Inference-Time Glossary Retrieval for Domain-Specific Localization

By Codcompass Team·2026-05-07·9 min read

Current Situation Analysis

Modern localization pipelines are architected around throughput, not semantic precision. CI/CD workflows typically diff source files, extract changed strings, and dispatch them to translation models in complete isolation. Each request contains a single paragraph, tooltip, or UI label (usually 50–200 words) stripped of surrounding document structure, domain signals, or cross-referential context. This architectural isolation creates a fundamental grounding problem: when an LLM encounters a polysemous term like "provider" without domain anchoring, it defaults to the highest-frequency translation in its training corpus. In Portuguese, that defaults to fornecedor (commercial supplier) rather than prestador (EU regulatory/legal context). The result is systematic terminology drift that compounds silently across releases.

The problem is exacerbated by evaluation methodologies that operate at the wrong granularity. Most teams score localization quality at the article or page level, mathematically compressing terminology errors into statistical noise. Using the standard MQM quality formula 1 - weighted penalty / wordCount, a single critical terminology mismatch in a 500-word document yields a score of 0.99. The identical error in a 50-word paragraph yields 0.90. At scale, holistic metrics like GEMBA-DA report near-identical deltas (0.0007–0.0178) across model variants, even when thousands of domain-specific terminology discrepancies go unflagged. Teams optimize for these flat scores, mistaking statistical compression for quality stability.

Initial attempts to fix this with static glossaries failed because sparse term sets (often ~37 terms per locale) resulted in zero retrieval hits for the majority of paragraphs. Without inference-time context injection and unit-aligned evaluation, terminology drift remains invisible to both translation engines and quality benchmarks. The industry has been measuring localization quality at the wrong resolution while deploying at a finer one, creating a blind spot where domain accuracy degrades without triggering alerts.

WOW Moment: Key Findings

The signal emerged when evaluation granularity was realigned with production units (paragraphs), glossary density was increased to 72 terms per locale pair, and scoring was anchored against human reference translations. Retrieval Augmented Localization (RAL) compensates precisely for training data gaps, with marginal gains inversely correlated to baseline terminology accuracy.

Approach	Terminology Error Reduction	Total Error Reduction	GEMBA-DA Delta
Mistral (Raw vs RAL)	-44.6%	-11.7%	0.0012
Deepseek (Raw vs RAL)	-42.1%	-13.5%	0.0009
OpenAI (Raw vs RAL)	-33.7%	-5.4%	0.0084
Anthropic (Raw vs RAL)	-24.4%	-3.1%	0.0178
Google (Raw vs RAL)	-16.6%	-3.2%	0.0007

Why this matters:

RAL delivers consistent terminology error reductions (16.6–44.6%) across all major providers. Lower-baseline models capture the highest marginal gains, proving that retrieval injection effectively patches training data gaps rather than merely polishing already-strong outputs.
Holistic evaluation frameworks (GEMBA-DA) failed to detect meaningful differences, while reference-anchored MQM captured thousands of targeted corrections. This confirms that terminology accuracy requires error-annotated, reference-grounded scoring, not aggregate quality prompts.
Locale divergence from training data dictates efficacy. Portuguese showed the largest per-locale improvement due to lower baseline coverage in regulatory domains; French showed the smallest.
Statistical significance was validated via paired Wilcoxon signed-rank tests with Holm-Bonferroni correction (p < 0.001 across all providers). Effect sizes ranged from Cohen's d = 0.20 (Google) to 0.60 (Mistral), confirming practical significance beyond statistical noise.

The sweet spot is paragraph-level evaluation (50–200 words) combined with reference-anchored MQM scoring and targeted glossary retrieval at inference time. This alignment exposes the true signal that holistic metrics obscure.

Core Solution

RAL implements a retrieve-inject pattern optimized for localization pipelines. Unlike general-purpose RAG systems that retrieve arbitrary documents, RAL operates as a stateful terminology engine that dynamically enriches translation requests without bloating the context window. The architecture prioritizes precision over recall, injecting only domain-matched terms alongside structured locale constraints.

Architecture Decisions & Rationale

Targeted Retrieval Over Full Context Injection Passing entire glossaries dilutes attention mechanisms, increases latency, and raises token costs. RAL matches only glossary entries present in the current source paragraph. This keeps context windows lean and ensures the model focuses on relevant terminology rather than scanning irrelevant mappings.
Reference-Anchored Evaluation Loop Standard GEMBA-MQM is reference-free, allowing judges to apply self-preference bias. By injecting official human translations (e.g., EUR-Lex corpora) into the evaluation prompt, judges anchor terminology accuracy against ground truth. Error categories are weighted: minor=1, major=5, critical=25. The paragraph score is calculated as max(0, 1 - weighted penalty / wordCount).
Judge Calibration & Exclusion Four LLM judges (Claude Sonnet 4.6, GPT-4.1, Gemini 2.5 Flash, Mistral Large 2512) score all outputs. Models with known leniency bias (Deepseek, QWEN) are excluded from judging because they consistently under-flag errors (1–3 per paragraph vs. 5–15 for stricter judges). Averaging across calibrated judges smooths individual variance without introducing false negatives.

Implementation (TypeScript)

import { createHash } from 'crypto';

interface GlossaryEntry {
  sourceTerm: string;
  targetTerm: string;
  locale: string;
  category: 'custom' | 'non-translatable';
}

interface LocalizationPayload {
  sourceText: string;
  targetLocale: string;
  retrievedTerms: GlossaryEntry[];
  brandVoice: string;
  localeConstraints: string[];
}

class DomainLocalizationEngine {
  private glossaryIndex: Map<string, GlossaryEntry[]> = new Map();
  private localeInstructions: Map<string, string[]> = new Map();

  constructor() {
    this.initializeLocaleInstructions();
  }

  private initializeLocaleInstructions(): void {
    this.localeInstructions.set('pt-PT', [
      'Use formal EU regulatory register',
      'Maintain passive voice for legal obligations',
      'Preserve decimal comma formatting',
      'Apply EN-PT punctuation spacing rules',
      'Avoid anglicisms in compliance terminology',
      'Use "prestador" for service providers in legal context',
      'Capitalize defined terms p

Results-Driven

The key to reducing hallucination by 35% lies in the Re-ranking weight matrix and dynamic tuning code below. Stop letting garbage data pollute your context window and company budget. Upgrade to Pro for the complete production-grade implementation + Blueprint (docker-compose + benchmark scripts).

Upgrade Pro, Get Full Implementation

Cancel anytime · 30-day money-back guarantee

er EUR-Lex standards', 'Render dates as DD/MM/YYYY', 'Use full stop for list termination', 'Maintain gender-neutral phrasing where applicable', 'Apply specific modal verbs for regulatory mandates', 'Preserve citation formatting for directives', 'Enforce terminology consistency across clauses' ]); }

public registerGlossary(entries: GlossaryEntry[]): void { entries.forEach(entry => { const key = ${entry.sourceTerm.toLowerCase()}_${entry.locale}; if (!this.glossaryIndex.has(key)) { this.glossaryIndex.set(key, []); } this.glossaryIndex.get(key)!.push(entry); }); }

public resolveTerms(sourceText: string, locale: string): GlossaryEntry[] { const words = sourceText.toLowerCase().match(/\b[\w-]+\b/g) || []; const matched = new Set<string>(); const results: GlossaryEntry[] = [];

words.forEach(word => {
  const key = `${word}_${locale}`;
  const entry = this.glossaryIndex.get(key);
  if (entry && !matched.has(key)) {
    matched.add(key);
    results.push(...entry);
  }
});

return results;

}

public buildInferencePayload( sourceText: string, locale: string ): LocalizationPayload { const retrievedTerms = this.resolveTerms(sourceText, locale); const constraints = this.localeInstructions.get(locale) || [];

return {
  sourceText,
  targetLocale: locale,
  retrievedTerms,
  brandVoice: 'formal EU regulatory register',
  localeConstraints: constraints
};

}

public calculateMQMScore( wordCount: number, penalties: { minor: number; major: number; critical: number } ): number { const weightedPenalty = (penalties.minor * 1) + (penalties.major * 5) + (penalties.critical * 25);

return Math.max(0, 1 - (weightedPenalty / wordCount));

} }


**Why this structure works:**
- `resolveTerms()` uses exact word-boundary matching to prevent false positives on substrings. Production systems should augment this with lemmatization for inflected languages.
- `buildInferencePayload()` separates retrieval from injection. The payload is serialized into the system prompt, keeping the user message clean for the translation task.
- `calculateMQMScore()` implements the reference-anchored penalty model. In production, this runs asynchronously against judge outputs, not inline with translation requests.

## Pitfall Guide

### 1. Coarse-Grained Evaluation Units
**Explanation:** Scoring at article or page level (200–700 words) mathematically compresses terminology errors into statistical noise. A critical mismatch in a long document barely moves the needle, masking production-level drift.
**Fix:** Align evaluation units with CI/CD diff granularity. Score at the paragraph or string level (50–200 words) to ensure terminology errors register proportionally.

### 2. Holistic Metric Dependency
**Explanation:** Aggregate quality scores like GEMBA-DA report near-identical deltas (0.0007–0.0178) even when thousands of terminology corrections occur. They measure fluency and style better than domain accuracy.
**Fix:** Deploy error-annotated frameworks (MQM) with explicit terminology categories. Use holistic metrics for general quality trends, but rely on MQM for terminology validation.

### 3. Full Glossary Context Flooding
**Explanation:** Injecting entire glossaries bloats the context window, dilutes attention mechanisms, and increases inference latency. Models struggle to prioritize relevant terms when surrounded by irrelevant mappings.
**Fix:** Implement targeted retrieval that matches only terms present in the current source paragraph. Cache retrieval results per locale to avoid redundant computation.

### 4. Unanchored Judge Bias
**Explanation:** Reference-free scoring allows judges to apply self-preference bias, flagging text as "awkward" when it diverges from training data even if it matches official references. This creates false positives for domain-correct terminology.
**Fix:** Always inject human reference translations (EUR-Lex, official style guides) into evaluation prompts. Anchor terminology accuracy against ground truth, not model preference.

### 5. Baseline Model Complacency
**Explanation:** Teams assume high-baseline models (Anthropic, Google) don't require retrieval augmentation because they perform well on general benchmarks. This ignores domain-specific training gaps.
**Fix:** Treat RAL as a domain compensator, not a quality booster. Even strong models show measurable terminology improvements when retrieval patches training data gaps.

### 6. Judge Leniency Ignorance
**Explanation:** Some models (Deepseek, QWEN) consistently under-flag errors (1–3 per paragraph vs. 5–15 for stricter judges). Including lenient judges without calibration skews averages toward false negatives.
**Fix:** Exclude known lenient models from judging pools, or apply weighted calibration based on historical error-detection consistency. Validate judge behavior against human-labeled test sets.

### 7. Terminology/Style Conflation
**Explanation:** Terminology errors are anchored against reference translations; style errors reflect judge preference. The gap between terminology reduction (16.6–44.6%) and total error reduction (3.1–13.5%) is largely driven by style bias.
**Fix:** Isolate terminology metrics from style/fluency scores. Track them separately in dashboards to measure RAL's true impact on domain accuracy.

## Production Bundle

### Action Checklist
- [ ] Align evaluation granularity with deployment units: Score at paragraph/string level, not article/page level.
- [ ] Increase glossary density to 70+ terms per locale pair, including 2+ non-translatable entries for regulatory anchors.
- [ ] Implement targeted retrieval: Match only terms present in the current source paragraph to prevent context window bloat.
- [ ] Anchor evaluation against human references: Inject official translations (EUR-Lex, style guides) into judge prompts.
- [ ] Calibrate judge pool: Exclude lenient models (Deepseek, QWEN) or weight them based on error-detection consistency.
- [ ] Separate terminology and style metrics: Track MQM terminology scores independently from holistic quality deltas.
- [ ] Validate statistical significance: Use paired Wilcoxon signed-rank tests with Holm-Bonferroni correction (p < 0.001) before declaring improvements.

### Decision Matrix

| Scenario | Recommended Approach | Why | Cost Impact |
|----------|---------------------|-----|-------------|
| Regulated/Legal Localization | RAL + Reference-Anchored MQM + Paragraph Scoring | Terminology precision is non-negotiable; holistic metrics mask compliance risks | Moderate (retrieval overhead + judge calibration) |
| High-Volume Marketing | Holistic Scoring + Lightweight Glossary | Style and tone matter more than strict terminology; throughput priority | Low (minimal retrieval, faster evaluation) |
| Low-Baseline Models (Mistral, Deepseek) | Full RAL Implementation | Highest marginal gains (42–44% terminology reduction); patches training gaps | Higher (retrieval + injection, but offsets model weakness) |
| High-Baseline Models (Anthropic, Google) | Targeted RAL + Style Isolation | Smaller terminology gains (16–24%); focus shifts to style/fluency optimization | Low-Moderate (selective retrieval reduces overhead) |
| Strict Compliance Audits | Reference-Anchored MQM + 4-Judge Calibration | Requires defensible, statistically validated quality metrics for regulatory review | High (judge pool + reference injection + statistical testing) |

### Configuration Template

```yaml
# locale_config.yaml
locale: pt-PT
domain: eu_regulatory
glossary_density: 72
non_translatable_count: 2

glossary_format:
  - source: "provider"
    target: "prestador"
    category: "custom"
    context: "legal_service"
  - source: "high-risk AI system"
    target: "sistema de IA de risco elevado"
    category: "custom"
    context: "ai_act"
  - source: "GDPR"
    target: "GDPR"
    category: "non-translatable"
    context: "regulatory"

inference_payload:
  brand_voice: "formal EU regulatory register"
  constraints:
    - "Use formal EU regulatory register"
    - "Maintain passive voice for legal obligations"
    - "Preserve decimal comma formatting"
    - "Apply EN-PT punctuation spacing rules"
    - "Avoid anglicisms in compliance terminology"
    - "Use 'prestador' for service providers in legal context"
    - "Capitalize defined terms per EUR-Lex standards"
    - "Render dates as DD/MM/YYYY"
    - "Use full stop for list termination"
    - "Maintain gender-neutral phrasing where applicable"
    - "Apply specific modal verbs for regulatory mandates"
    - "Preserve citation formatting for directives"
    - "Enforce terminology consistency across clauses"

evaluation:
  metric: "GEMBA-MQM"
  reference_injection: true
  reference_source: "eur_lex_official"
  error_categories: ["accuracy", "fluency", "style", "terminology"]
  severity_weights:
    minor: 1
    major: 5
    critical: 25
  judge_pool:
    - "claude-sonnet-4.6"
    - "gpt-4.1"
    - "gemini-2.5-flash"
    - "mistral-large-2512"
  excluded_judges:
    - "deepseek"
    - "qwen"
  statistical_test: "paired_wilcoxon_holm_bonferroni"
  significance_threshold: 0.001

Quick Start Guide

Extract & Structure Glossary: Pull 70+ domain-specific terms from held-out corpora or official style guides. Add 2 non-translatable anchors. Format as source-target pairs with locale tags.
Deploy Retrieval Layer: Implement a keyword/lemma matcher that scans incoming source paragraphs and returns only matching glossary entries. Cache results per locale to avoid redundant lookups.
Construct Inference Payload: Assemble the translation request with retrieved terms, brand voice profile, and 13 locale constraints. Serialize into the system prompt, keeping the user message clean for the translation task.
Calibrate Evaluation Loop: Inject official human references into judge prompts. Run outputs through the calibrated 4-judge pool. Calculate MQM scores at the paragraph level using the weighted penalty formula.
Validate & Iterate: Run paired statistical tests against baseline outputs. Isolate terminology metrics from style scores. Adjust glossary density or retrieval matching rules based on error distribution. Deploy to CI/CD when p < 0.001 and Cohen's d > 0.20.