Translation Fidelity Benchmark
億 and 万 are one character apart, 1000× different, and frontier models confuse them on Japanese filings every quarter.
For quant PMs, research platforms, and compliance desks. 54 passages from FSA, JLT (Ministry of Justice), and JPX.
Numbers preserved
Higher is better
Regulator's term preserved
Higher is better
Misleading variants emitted
Lower is better
Frontier models paraphrase regulator terms. Axiora preserves them.
Preliminary: canonical adherence is scored under strict verbatim matching. An equivalence-aware re-run is in progress; it raises absolute scores across all systems without changing the ranking.
| # | System | Numerical | Canonical | Forbidden |
|---|---|---|---|---|
| 1 | Axiora | 100% | 78% | 0% |
| 2 | GPT-4o | 100% | 63% | 2% |
| 3 | Gemini Flash 2.5 | 98% | 52% | 2% |
| 4 | Claude Sonnet 4.6 | 98% | 39% | 2% |
Numerical fidelity: source magnitudes round-trip exactly (¥1.2T and 1.2 trillion yen both count). Canonical adherence: the regulator's verbatim published English appears in the translation; paraphrases of statutory phrases count as misses. These canonical figures use strict matching, so formatting-equivalent renderings (¥ vs the word "yen", 2,850B vs 2.85T) also count as misses, a deliberately harsh bar applied identically to every system. Forbidden-alias rate: outputs that use a regulator-flagged misleading variant. Lower is better.
At the current measured gap, Axiora leads the best frontier model by 15pp on canonical adherence. On 1,000 filings/year that implies roughly 150 fewer regulatory-term drift events; on 10,000, about 1,500.
What each system returned
Each card shows the Japanese source, the regulator's published English, and what each system actually returned.
This matter constitutes a significant proposal action.
Paraphrase. Loses the FSA-published canonical.
This matter constitutes a material proposal.
Paraphrase. Loses the FSA-published canonical.
This matter falls under act of making important suggestion.
Methodology
- Model under test
- axiora-annual-v3.0.0-flashAxiora's annual-report translation pipeline. Every production translation row carries this version stamp for audit.
- Input
- Identical Japanese input to every system. Same deterministic scorer on every output. No per-system prompt tuning.
- References
- References are public: FSA FAQ, the Japanese Law Translation database (Ministry of Justice), JPX glossary, and rule-checked numeric round-trips. Each row cites its source.
- Scorers
- Three scorers, run independently. Numerical fidelity checks the JA→EN magnitude round-trip (¥1.2T, 1.2 trillion yen, and 1,200 billion yen are the same magnitude). Canonical adherence checks that the regulator's verbatim published English phrase appears (case + hyphen + whitespace normalised); the figures shown apply this strictly, so formatting-equivalent renderings count as misses too, and an equivalence-aware re-run is in progress that will raise absolute canonical scores across all four systems. Forbidden-alias detection flags regulator-flagged misleading variants (e.g., "acquisition" for 完全子会社化).
- Fairness
- This is a product comparison, not a model comparison. Axiora ran through its full pipeline, the deterministic numeric and glossary processing it ships with. Frontier models received a vanilla translation prompt: "translate this Japanese financial filing into English." That mirrors how a customer would actually call each system out of the box. A pure-LLM comparison (giving every system the same scaffolding) would be a different study.
- Caveats
- 54 rows is a pattern, not a statistical claim. Numerical fidelity is deterministic by design. Forbidden-alias prevention is enforced at the system level. Canonical adherence reflects how strictly each system honors the regulator's exact published wording; the figures shown apply strict matching, where formatting variants of the same magnitude (¥X vs X yen, 2,850B vs 2.85T) also count as misses. An equivalence-aware re-run is in progress and will raise absolute canonical scores across all systems without changing the ranking.
What this means at your desk
Numbers you don't re-read
Stop asking an analyst to spot-check yen magnitudes. The 億 / 万 / 兆 / 千 chain round-trips by design, same input, same output, every run.
Survives an audit
Every translated row carries the translator version, the active glossary version, and a citation trail back to the source filing. Compliance pulls the exact persisted text two years later. The row is the receipt, not a re-run of the model.
Scales without drift
10,000 filings/year, the same regulator-canonical English. No per-filing manual review, no terminology drift between Q1 and Q4 reports.
Backtest-friendly
Pin a translator + glossary version. Reruns produce identical bytes. A signal you build today on translated text holds the same shape five years from now.
Built for production use
Translations that survive a regulator's audit. Versioned. Replayable. Every persisted row stays byte-identical to the day it shipped. Auditors pull the exact text we wrote, not a re-roll of a model.
Every translated row stamps the translator version (SemVer) and the active glossary version. Pin a version for reproducibility; replay against a frozen reference set anytime.
Internal benchmark, run 2026-05-10. Current scope: annual securities reports (有価証券報告書, doc_type 120). Expanding to TOB filings, large-shareholding reports, and 100+ rows from issuer-published bilingual ARs.