Back to KB
Difficulty
Intermediate
Read Time
9 min

Inference-Time Glossary Retrieval for Domain-Specific Localization

By Codcompass TeamΒ·Β·9 min read

Current Situation Analysis

Modern localization pipelines are architected around throughput, not semantic precision. CI/CD workflows typically diff source files, extract changed strings, and dispatch them to translation models in complete isolation. Each request contains a single paragraph, tooltip, or UI label (usually 50–200 words) stripped of surrounding document structure, domain signals, or cross-referential context. This architectural isolation creates a fundamental grounding problem: when an LLM encounters a polysemous term like "provider" without domain anchoring, it defaults to the highest-frequency translation in its training corpus. In Portuguese, that defaults to fornecedor (commercial supplier) rather than prestador (EU regulatory/legal context). The result is systematic terminology drift that compounds silently across releases.

The problem is exacerbated by evaluation methodologies that operate at the wrong granularity. Most teams score localization quality at the article or page level, mathematically compressing terminology errors into statistical noise. Using the standard MQM quality formula 1 - weighted penalty / wordCount, a single critical terminology mismatch in a 500-word document yields a score of 0.99. The identical error in a 50-word paragraph yields 0.90. At scale, holistic metrics like GEMBA-DA report near-identical deltas (0.0007–0.0178) across model variants, even when thousands of domain-specific terminology discrepancies go unflagged. Teams optimize for these flat scores, mistaking statistical compression for quality stability.

Initial attempts to fix this with static glossaries failed because sparse term sets (often ~37 terms per locale) resulted in zero retrieval hits for the majority of paragraphs. Without inference-time context injection and unit-aligned evaluation, terminology drift remains invisible to both translation engines and quality benchmarks. The industry has been measuring localization quality at the wrong resolution while deploying at a finer one, creating a blind spot where domain accuracy degrades without triggering alerts.

WOW Moment: Key Findings

The signal emerged when evaluation granularity was realigned with production units (paragraphs), glossary density was increased to 72 terms per locale pair, and scoring was anchored against human reference translations. Retrieval Augmented Localization (RAL) compensates precisely for training data gaps, with marginal gains inversely correlated to baseline terminology accuracy.

ApproachTerminology Error ReductionTotal Error ReductionGEMBA-DA Delta
Mistral (Raw vs RAL)-44.6%-11.7%0.0012
Deepseek (Raw vs RAL)-42.1%-13.5%0.0009
OpenAI (Raw vs RAL)-33.7%-5.4%0.0084
Anthropic (Raw vs RAL)-24.4%-3.1%0.0178
Google (Raw vs RAL)-16.6%-3.2%0.0007

Why this matters:

  • RAL delivers consistent terminology error reductions (16.6–44.6%) across all major providers. Lower-baseline models capture the highest marginal gains, proving that retrieval injection effectively patches training data gaps rather than merely polishing already-strong outputs.
  • Holistic evaluation frameworks (GEMBA-DA) failed to detect meaningful differences, while reference-anchored MQM captured thousands of targeted corrections. This confirms that terminology accuracy requires error-annotated, reference-grounded scoring, not aggregate quality prompts.
  • Locale divergence from training data dictates efficacy. Portuguese showed the largest per-locale improvement due to lower baseline coverage in regulatory domains; French showed the smallest.
  • Statistical significance was validated via paired Wilcoxon signed-rank tests with Holm-Bonferroni correction (p < 0.001 across all providers). Effect sizes ranged from Cohen's d = 0.20 (Google) to 0.60 (Mistral), confirming practical significance beyond statistical noise.

The sweet spot is paragraph-level evaluation (50–200 words) combined with reference-anchored MQM scoring and targeted glossary retrieval at inference time. This alignment exposes the true signal that holistic metrics obscure.

Core Solution

RAL implements a retrieve-inject pattern optimized for localization pipelines. Unlike

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back