Back to KB
Difficulty
Intermediate
Read Time
9 min

Inference-Time Glossary Retrieval for Domain-Specific Localization

By Codcompass TeamΒ·Β·9 min read

Current Situation Analysis

Modern localization pipelines are architected around throughput, not semantic precision. CI/CD workflows typically diff source files, extract changed strings, and dispatch them to translation models in complete isolation. Each request contains a single paragraph, tooltip, or UI label (usually 50–200 words) stripped of surrounding document structure, domain signals, or cross-referential context. This architectural isolation creates a fundamental grounding problem: when an LLM encounters a polysemous term like "provider" without domain anchoring, it defaults to the highest-frequency translation in its training corpus. In Portuguese, that defaults to fornecedor (commercial supplier) rather than prestador (EU regulatory/legal context). The result is systematic terminology drift that compounds silently across releases.

The problem is exacerbated by evaluation methodologies that operate at the wrong granularity. Most teams score localization quality at the article or page level, mathematically compressing terminology errors into statistical noise. Using the standard MQM quality formula 1 - weighted penalty / wordCount, a single critical terminology mismatch in a 500-word document yields a score of 0.99. The identical error in a 50-word paragraph yields 0.90. At scale, holistic metrics like GEMBA-DA report near-identical deltas (0.0007–0.0178) across model variants, even when thousands of domain-specific terminology discrepancies go unflagged. Teams optimize for these flat scores, mistaking statistical compression for quality stability.

Initial attempts to fix this with static glossaries failed because sparse term sets (often ~37 terms per locale) resulted in zero retrieval hits for the majority of paragraphs. Without inference-time context injection and unit-aligned evaluation, terminology drift remains invisible to both translation engines and quality benchmarks. The industry has been measuring localization quality at the wrong resolution while deploying at a finer one, creating a blind spot where domain accuracy degrades without triggering alerts.

WOW Moment: Key Findings

The signal emerged when evaluation granularity was realigned with production units (paragraphs), glossary density was increased to 72 terms per locale pair, and scoring was anchored against human reference translations. Retrieval Augmented Localization (RAL) compensates precisely for training data gaps, with marginal gains inversely correlated to baseline terminology accuracy.

ApproachTerminology Error ReductionTotal Error ReductionGEMBA-DA Delta
Mistral (Raw vs RAL)-44.6%-11.7%0.0012
Deepseek (Raw vs RAL)-42.1%-13.5%0.0009
OpenAI (Raw vs RAL)-33.7%-5.4%0.0084
Anthropic (Raw vs RAL)-24.4%-3.1%0.0178
Google (Raw vs RAL)-16.6%-3.2%0.0007

Why this matters:

  • RAL delivers consistent terminology error reductions (16.6–44.6%) across all major providers. Lower-baseline models capture the highest marginal gains, proving that retrieval injection effectively patches training data gaps rather than merely polishing already-strong outputs.
  • Holistic evaluation frameworks (GEMBA-DA) failed to detect meaningful differences, while reference-anchored MQM captured thousands of targeted corrections. This confirms that terminology accuracy requires error-annotated, reference-grounded scoring, not aggregate quality prompts.
  • Locale divergence from training data dictates efficacy. Portuguese showed the largest per-locale improvement due to lower baseline coverage in regulatory domains; French showed the smallest.
  • Statistical significance was validated via paired Wilcoxon signed-rank tests with Holm-Bonferroni correction (p < 0.001 across all providers). Effect sizes ranged from Cohen's d = 0.20 (Google) to 0.60 (Mistral), confirming practical significance beyond statistical noise.

The sweet spot is paragraph-level evaluation (50–200 words) combined with reference-anchored MQM scoring and targeted glossary retrieval at inference time. This alignment exposes the true signal that holistic metrics obscure.

Core Solution

RAL implements a retrieve-inject pattern optimized for localization pipelines. Unlike general-purpose RAG systems that retrieve arbitrary documents, RAL operates as a stateful terminology engine that dynamically enriches translation requests without bloating the context window. The architecture prioritizes precision over recall, injecting only domain-matched terms alongside structured locale constraints.

Architecture Decisions & Rationale

  1. Targeted Retrieval Over Full Context Injection Passing entire glossaries dilutes attention mechanisms, increases latency, and raises token costs. RAL matches only glossary entries present in the current source paragraph. This keeps context windows lean and ensures the model focuses on relevant terminology rather than scanning irrelevant mappings.

  2. Reference-Anchored Evaluation Loop Standard GEMBA-MQM is reference-free, allowing judges to apply self-preference bias. By injecting official human translations (e.g., EUR-Lex corpora) into the evaluation prompt, judges anchor terminology accuracy against ground truth. Error categories are weighted: minor=1, major=5, critical=25. The paragraph score is calculated as max(0, 1 - weighted penalty / wordCount).

  3. Judge Calibration & Exclusion Four LLM judges (Claude Sonnet 4.6, GPT-4.1, Gemini 2.5 Flash, Mistral Large 2512) score all outputs. Models with known leniency bias (Deepseek, QWEN) are excluded from judging because they consistently under-flag errors (1–3 per paragraph vs. 5–15 for stricter judges). Averaging across calibrated judges smooths individual variance without introducing false negatives.

Implementation (TypeScript)

import { createHash } from 'crypto';

interface GlossaryEntry {
  sourceTerm: string;
  targetTerm: string;
  locale: string;
  category: 'custom' | 'non-translatable';
}

interface LocalizationPayload {
  sourceText: string;
  targetLocale: string;
  retrievedTerms: GlossaryEntry[];
  brandVoice: string;
  localeConstraints: string[];
}

class DomainLocalizationEngine {
  private glossaryIndex: Map<string, GlossaryEntry[]> = new Map();
  private localeInstructions: Map<string, string[]> = new Map();

  constructor() {
    this.initializeLocaleInstructions();
  }

  private initializeLocaleInstructions(): void {
    this.localeInstructions.set('pt-PT', [
      'Use formal EU regulatory register',
      'Maintain passive voice for legal obligations',
      'Preserve decimal comma formatting',
      'Apply EN-PT punctuation spacing rules',
      'Avoid anglicisms in compliance terminology',
      'Use "prestador" for service providers in legal context',
      'Capitalize defined terms p

Results-Driven

The key to reducing hallucination by 35% lies in the Re-ranking weight matrix and dynamic tuning code below. Stop letting garbage data pollute your context window and company budget. Upgrade to Pro for the complete production-grade implementation + Blueprint (docker-compose + benchmark scripts).

Upgrade Pro, Get Full Implementation

Cancel anytime Β· 30-day money-back guarantee