Translation fidelityAnnual reports (有価証券報告書), expanding2026-05-10axiora-annual-v3.0.0-flash

Translation Fidelity Benchmark

億 and 万 are one character apart, 1000× different, and frontier models confuse them on Japanese filings every quarter.

For quant PMs, research platforms, and compliance desks. 54 passages from FSA, JLT (Ministry of Justice), and JPX.

FSAJLT (Ministry of Justice)JPX54 passages

Numbers preserved

Higher is better

100%

98%

Axiora

GPT-4o

Gemini

Claude

Regulator's term preserved

Higher is better

78%

63%

52%

39%

Axiora

GPT-4o

Gemini

Claude

Misleading variants emitted

Lower is better

Axiora

GPT-4o

Gemini

Claude

Per-system scores

Frontier models paraphrase regulator terms. Axiora preserves them.

Preliminary: canonical adherence is scored under strict verbatim matching. An equivalence-aware re-run is in progress; it raises absolute scores across all systems without changing the ranking.

#	System	Numerical	Canonical	Forbidden
1	Axiora	100%	78%	0%
2	GPT-4o	100%	63%	2%
3	Gemini Flash 2.5	98%	52%	2%
4	Claude Sonnet 4.6	98%	39%	2%

Numerical fidelity: source magnitudes round-trip exactly (¥1.2T and 1.2 trillion yen both count). Canonical adherence: the regulator's verbatim published English appears in the translation; paraphrases of statutory phrases count as misses. These canonical figures use strict matching, so formatting-equivalent renderings (¥ vs the word "yen", 2,850B vs 2.85T) also count as misses, a deliberately harsh bar applied identically to every system. Forbidden-alias rate: outputs that use a regulator-flagged misleading variant. Lower is better.

At scale

At the current measured gap, Axiora leads the best frontier model by 15pp on canonical adherence. On 1,000 filings/year that implies roughly 150 fewer regulatory-term drift events; on 10,000, about 1,500.

Side-by-side

What each system returned

Each card shows the Japanese source, the regulator's published English, and what each system actually returned.

Frontier paraphrases

This matter constitutes a significant proposal action.

GPT-4oClaude Sonnet 4.6

Paraphrase. Loses the FSA-published canonical.

This matter constitutes a material proposal.

Gemini Flash 2.5

Paraphrase. Loses the FSA-published canonical.

Axiora

This matter falls under act of making important suggestion.

✓num✓canon✓alias

Methodology

Model under test: axiora-annual-v3.0.0-flashAxiora's annual-report translation pipeline. Every production translation row carries this version stamp for audit.
Input: Identical Japanese input to every system. Same deterministic scorer on every output. No per-system prompt tuning.
References: References are public: FSA FAQ, the Japanese Law Translation database (Ministry of Justice), JPX glossary, and rule-checked numeric round-trips. Each row cites its source.
Scorers: Three scorers, run independently. Numerical fidelity checks the JA→EN magnitude round-trip (¥1.2T, 1.2 trillion yen, and 1,200 billion yen are the same magnitude). Canonical adherence checks that the regulator's verbatim published English phrase appears (case + hyphen + whitespace normalised); the figures shown apply this strictly, so formatting-equivalent renderings count as misses too, and an equivalence-aware re-run is in progress that will raise absolute canonical scores across all four systems. Forbidden-alias detection flags regulator-flagged misleading variants (e.g., "acquisition" for 完全子会社化).
Fairness: This is a product comparison, not a model comparison. Axiora ran through its full pipeline, the deterministic numeric and glossary processing it ships with. Frontier models received a vanilla translation prompt: "translate this Japanese financial filing into English." That mirrors how a customer would actually call each system out of the box. A pure-LLM comparison (giving every system the same scaffolding) would be a different study.
Caveats: 54 rows is a pattern, not a statistical claim. Numerical fidelity is deterministic by design. Forbidden-alias prevention is enforced at the system level. Canonical adherence reflects how strictly each system honors the regulator's exact published wording; the figures shown apply strict matching, where formatting variants of the same magnitude (¥X vs X yen, 2,850B vs 2.85T) also count as misses. An equivalence-aware re-run is in progress and will raise absolute canonical scores across all systems without changing the ranking.

Outcomes

What this means at your desk

Numbers you don't re-read

Stop asking an analyst to spot-check yen magnitudes. The 億 / 万 / 兆 / 千 chain round-trips by design, same input, same output, every run.

Survives an audit

Every translated row carries the translator version, the active glossary version, and a citation trail back to the source filing. Compliance pulls the exact persisted text two years later. The row is the receipt, not a re-run of the model.

Scales without drift

10,000 filings/year, the same regulator-canonical English. No per-filing manual review, no terminology drift between Q1 and Q4 reports.

Backtest-friendly

Pin a translator + glossary version. Reruns produce identical bytes. A signal you build today on translated text holds the same shape five years from now.

How the deterministic layer works →

For production use

Built for production use

Translations that survive a regulator's audit. Versioned. Replayable. Every persisted row stays byte-identical to the day it shipped. Auditors pull the exact text we wrote, not a re-roll of a model.

Every translated row stamps the translator version (SemVer) and the active glossary version. Pin a version for reproducibility; replay against a frozen reference set anytime.

Get API key →Documentation →

Internal benchmark, run 2026-05-10. Current scope: annual securities reports (有価証券報告書, doc_type 120). Expanding to TOB filings, large-shareholding reports, and 100+ rows from issuer-published bilingual ARs.