Translation fidelityAnnual reports (有価証券報告書), expanding2026-05-10axiora-annual-v3.0.0-flash

Translation Fidelity Benchmark

億 and 万 are one character apart, 1000× different, and frontier models confuse them on Japanese filings every quarter.

For quant PMs, research platforms, and compliance desks. 54 passages from FSA, JLT (Ministry of Justice), and JPX.

FSAJLT (Ministry of Justice)JPX54 passages

Numbers preserved

Higher is better

100%
100%
98%
98%
Axiora
GPT-4o
Gemini
Claude

Regulator's term preserved

Higher is better

78%
63%
52%
39%
Axiora
GPT-4o
Gemini
Claude

Misleading variants emitted

Lower is better

0%
2%
2%
2%
Axiora
GPT-4o
Gemini
Claude
Per-system scores

Frontier models paraphrase regulator terms. Axiora preserves them.

Preliminary: canonical adherence is scored under strict verbatim matching. An equivalence-aware re-run is in progress; it raises absolute scores across all systems without changing the ranking.

#SystemNumericalCanonicalForbidden
1Axiora100%78%0%
2GPT-4o100%63%2%
3Gemini Flash 2.598%52%2%
4Claude Sonnet 4.698%39%2%

Numerical fidelity: source magnitudes round-trip exactly (¥1.2T and 1.2 trillion yen both count). Canonical adherence: the regulator's verbatim published English appears in the translation; paraphrases of statutory phrases count as misses. These canonical figures use strict matching, so formatting-equivalent renderings (¥ vs the word "yen", 2,850B vs 2.85T) also count as misses, a deliberately harsh bar applied identically to every system. Forbidden-alias rate: outputs that use a regulator-flagged misleading variant. Lower is better.

At scale

At the current measured gap, Axiora leads the best frontier model by 15pp on canonical adherence. On 1,000 filings/year that implies roughly 150 fewer regulatory-term drift events; on 10,000, about 1,500.

Side-by-side

What each system returned

Each card shows the Japanese source, the regulator's published English, and what each system actually returned.

Frontier paraphrases

This matter constitutes a significant proposal action.

GPT-4oClaude Sonnet 4.6

Paraphrase. Loses the FSA-published canonical.

This matter constitutes a material proposal.

Gemini Flash 2.5

Paraphrase. Loses the FSA-published canonical.

Axiora

This matter falls under act of making important suggestion.

numcanonalias
Methodology

Methodology

Model under test
axiora-annual-v3.0.0-flashAxiora's annual-report translation pipeline. Every production translation row carries this version stamp for audit.
Input
Identical Japanese input to every system. Same deterministic scorer on every output. No per-system prompt tuning.
References
References are public: FSA FAQ, the Japanese Law Translation database (Ministry of Justice), JPX glossary, and rule-checked numeric round-trips. Each row cites its source.
Scorers
Three scorers, run independently. Numerical fidelity checks the JA→EN magnitude round-trip (¥1.2T, 1.2 trillion yen, and 1,200 billion yen are the same magnitude). Canonical adherence checks that the regulator's verbatim published English phrase appears (case + hyphen + whitespace normalised); the figures shown apply this strictly, so formatting-equivalent renderings count as misses too, and an equivalence-aware re-run is in progress that will raise absolute canonical scores across all four systems. Forbidden-alias detection flags regulator-flagged misleading variants (e.g., "acquisition" for 完全子会社化).
Fairness
This is a product comparison, not a model comparison. Axiora ran through its full pipeline, the deterministic numeric and glossary processing it ships with. Frontier models received a vanilla translation prompt: "translate this Japanese financial filing into English." That mirrors how a customer would actually call each system out of the box. A pure-LLM comparison (giving every system the same scaffolding) would be a different study.
Caveats
54 rows is a pattern, not a statistical claim. Numerical fidelity is deterministic by design. Forbidden-alias prevention is enforced at the system level. Canonical adherence reflects how strictly each system honors the regulator's exact published wording; the figures shown apply strict matching, where formatting variants of the same magnitude (¥X vs X yen, 2,850B vs 2.85T) also count as misses. An equivalence-aware re-run is in progress and will raise absolute canonical scores across all systems without changing the ranking.
Outcomes

What this means at your desk

Numbers you don't re-read

Stop asking an analyst to spot-check yen magnitudes. The 億 / 万 / 兆 / 千 chain round-trips by design, same input, same output, every run.

Survives an audit

Every translated row carries the translator version, the active glossary version, and a citation trail back to the source filing. Compliance pulls the exact persisted text two years later. The row is the receipt, not a re-run of the model.

Scales without drift

10,000 filings/year, the same regulator-canonical English. No per-filing manual review, no terminology drift between Q1 and Q4 reports.

Backtest-friendly

Pin a translator + glossary version. Reruns produce identical bytes. A signal you build today on translated text holds the same shape five years from now.

For production use

Built for production use

Translations that survive a regulator's audit. Versioned. Replayable. Every persisted row stays byte-identical to the day it shipped. Auditors pull the exact text we wrote, not a re-roll of a model.

Every translated row stamps the translator version (SemVer) and the active glossary version. Pin a version for reproducibility; replay against a frozen reference set anytime.

Internal benchmark, run 2026-05-10. Current scope: annual securities reports (有価証券報告書, doc_type 120). Expanding to TOB filings, large-shareholding reports, and 100+ rows from issuer-published bilingual ARs.