general-purpose RAG systems that retrieve arbitrary documents, RAL operates as a stateful terminology engine that dynamically enriches translation requests without bloating the context window. The architecture prioritizes precision over recall, injecting only domain-matched terms alongside structured locale constraints.
Architecture Decisions & Rationale
-
Targeted Retrieval Over Full Context Injection
Passing entire glossaries dilutes attention mechanisms, increases latency, and raises token costs. RAL matches only glossary entries present in the current source paragraph. This keeps context windows lean and ensures the model focuses on relevant terminology rather than scanning irrelevant mappings.
-
Reference-Anchored Evaluation Loop
Standard GEMBA-MQM is reference-free, allowing judges to apply self-preference bias. By injecting official human translations (e.g., EUR-Lex corpora) into the evaluation prompt, judges anchor terminology accuracy against ground truth. Error categories are weighted: minor=1, major=5, critical=25. The paragraph score is calculated as max(0, 1 - weighted penalty / wordCount).
-
Judge Calibration & Exclusion
Four LLM judges (Claude Sonnet 4.6, GPT-4.1, Gemini 2.5 Flash, Mistral Large 2512) score all outputs. Models with known leniency bias (Deepseek, QWEN) are excluded from judging because they consistently under-flag errors (1β3 per paragraph vs. 5β15 for stricter judges). Averaging across calibrated judges smooths individual variance without introducing false negatives.
Implementation (TypeScript)
import { createHash } from 'crypto';
interface GlossaryEntry {
sourceTerm: string;
targetTerm: string;
locale: string;
category: 'custom' | 'non-translatable';
}
interface LocalizationPayload {
sourceText: string;
targetLocale: string;
retrievedTerms: GlossaryEntry[];
brandVoice: string;
localeConstraints: string[];
}
class DomainLocalizationEngine {
private glossaryIndex: Map<string, GlossaryEntry[]> = new Map();
private localeInstructions: Map<string, string[]> = new Map();
constructor() {
this.initializeLocaleInstructions();
}
private initializeLocaleInstructions(): void {
this.localeInstructions.set('pt-PT', [
'Use formal EU regulatory register',
'Maintain passive voice for legal obligations',
'Preserve decimal comma formatting',
'Apply EN-PT punctuation spacing rules',
'Avoid anglicisms in compliance terminology',
'Use "prestador" for service providers in legal context',
'Capitalize defined terms per EUR-Lex standards',
'Render dates as DD/MM/YYYY',
'Use full stop for list termination',
'Maintain gender-neutral phrasing where applicable',
'Apply specific modal verbs for regulatory mandates',
'Preserve citation formatting for directives',
'Enforce terminology consistency across clauses'
]);
}
public registerGlossary(entries: GlossaryEntry[]): void {
entries.forEach(entry => {
const key = `${entry.sourceTerm.toLowerCase()}_${entry.locale}`;
if (!this.glossaryIndex.has(key)) {
this.glossaryIndex.set(key, []);
}
this.glossaryIndex.get(key)!.push(entry);
});
}
public resolveTerms(sourceText: string, locale: string): GlossaryEntry[] {
const words = sourceText.toLowerCase().match(/\b[\w-]+\b/g) || [];
const matched = new Set<string>();
const results: GlossaryEntry[] = [];
words.forEach(word => {
const key = `${word}_${locale}`;
const entry = this.glossaryIndex.get(key);
if (entry && !matched.has(key)) {
matched.add(key);
results.push(...entry);
}
});
return results;
}
public buildInferencePayload(
sourceText: string,
locale: string
): LocalizationPayload {
const retrievedTerms = this.resolveTerms(sourceText, locale);
const constraints = this.localeInstructions.get(locale) || [];
return {
sourceText,
targetLocale: locale,
retrievedTerms,
brandVoice: 'formal EU regulatory register',
localeConstraints: constraints
};
}
public calculateMQMScore(
wordCount: number,
penalties: { minor: number; major: number; critical: number }
): number {
const weightedPenalty =
(penalties.minor * 1) +
(penalties.major * 5) +
(penalties.critical * 25);
return Math.max(0, 1 - (weightedPenalty / wordCount));
}
}
Why this structure works:
resolveTerms() uses exact word-boundary matching to prevent false positives on substrings. Production systems should augment this with lemmatization for inflected languages.
buildInferencePayload() separates retrieval from injection. The payload is serialized into the system prompt, keeping the user message clean for the translation task.
calculateMQMScore() implements the reference-anchored penalty model. In production, this runs asynchronously against judge outputs, not inline with translation requests.
Pitfall Guide
1. Coarse-Grained Evaluation Units
Explanation: Scoring at article or page level (200β700 words) mathematically compresses terminology errors into statistical noise. A critical mismatch in a long document barely moves the needle, masking production-level drift.
Fix: Align evaluation units with CI/CD diff granularity. Score at the paragraph or string level (50β200 words) to ensure terminology errors register proportionally.
2. Holistic Metric Dependency
Explanation: Aggregate quality scores like GEMBA-DA report near-identical deltas (0.0007β0.0178) even when thousands of terminology corrections occur. They measure fluency and style better than domain accuracy.
Fix: Deploy error-annotated frameworks (MQM) with explicit terminology categories. Use holistic metrics for general quality trends, but rely on MQM for terminology validation.
3. Full Glossary Context Flooding
Explanation: Injecting entire glossaries bloats the context window, dilutes attention mechanisms, and increases inference latency. Models struggle to prioritize relevant terms when surrounded by irrelevant mappings.
Fix: Implement targeted retrieval that matches only terms present in the current source paragraph. Cache retrieval results per locale to avoid redundant computation.
4. Unanchored Judge Bias
Explanation: Reference-free scoring allows judges to apply self-preference bias, flagging text as "awkward" when it diverges from training data even if it matches official references. This creates false positives for domain-correct terminology.
Fix: Always inject human reference translations (EUR-Lex, official style guides) into evaluation prompts. Anchor terminology accuracy against ground truth, not model preference.
5. Baseline Model Complacency
Explanation: Teams assume high-baseline models (Anthropic, Google) don't require retrieval augmentation because they perform well on general benchmarks. This ignores domain-specific training gaps.
Fix: Treat RAL as a domain compensator, not a quality booster. Even strong models show measurable terminology improvements when retrieval patches training data gaps.
6. Judge Leniency Ignorance
Explanation: Some models (Deepseek, QWEN) consistently under-flag errors (1β3 per paragraph vs. 5β15 for stricter judges). Including lenient judges without calibration skews averages toward false negatives.
Fix: Exclude known lenient models from judging pools, or apply weighted calibration based on historical error-detection consistency. Validate judge behavior against human-labeled test sets.
7. Terminology/Style Conflation
Explanation: Terminology errors are anchored against reference translations; style errors reflect judge preference. The gap between terminology reduction (16.6β44.6%) and total error reduction (3.1β13.5%) is largely driven by style bias.
Fix: Isolate terminology metrics from style/fluency scores. Track them separately in dashboards to measure RAL's true impact on domain accuracy.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Regulated/Legal Localization | RAL + Reference-Anchored MQM + Paragraph Scoring | Terminology precision is non-negotiable; holistic metrics mask compliance risks | Moderate (retrieval overhead + judge calibration) |
| High-Volume Marketing | Holistic Scoring + Lightweight Glossary | Style and tone matter more than strict terminology; throughput priority | Low (minimal retrieval, faster evaluation) |
| Low-Baseline Models (Mistral, Deepseek) | Full RAL Implementation | Highest marginal gains (42β44% terminology reduction); patches training gaps | Higher (retrieval + injection, but offsets model weakness) |
| High-Baseline Models (Anthropic, Google) | Targeted RAL + Style Isolation | Smaller terminology gains (16β24%); focus shifts to style/fluency optimization | Low-Moderate (selective retrieval reduces overhead) |
| Strict Compliance Audits | Reference-Anchored MQM + 4-Judge Calibration | Requires defensible, statistically validated quality metrics for regulatory review | High (judge pool + reference injection + statistical testing) |
Configuration Template
# locale_config.yaml
locale: pt-PT
domain: eu_regulatory
glossary_density: 72
non_translatable_count: 2
glossary_format:
- source: "provider"
target: "prestador"
category: "custom"
context: "legal_service"
- source: "high-risk AI system"
target: "sistema de IA de risco elevado"
category: "custom"
context: "ai_act"
- source: "GDPR"
target: "GDPR"
category: "non-translatable"
context: "regulatory"
inference_payload:
brand_voice: "formal EU regulatory register"
constraints:
- "Use formal EU regulatory register"
- "Maintain passive voice for legal obligations"
- "Preserve decimal comma formatting"
- "Apply EN-PT punctuation spacing rules"
- "Avoid anglicisms in compliance terminology"
- "Use 'prestador' for service providers in legal context"
- "Capitalize defined terms per EUR-Lex standards"
- "Render dates as DD/MM/YYYY"
- "Use full stop for list termination"
- "Maintain gender-neutral phrasing where applicable"
- "Apply specific modal verbs for regulatory mandates"
- "Preserve citation formatting for directives"
- "Enforce terminology consistency across clauses"
evaluation:
metric: "GEMBA-MQM"
reference_injection: true
reference_source: "eur_lex_official"
error_categories: ["accuracy", "fluency", "style", "terminology"]
severity_weights:
minor: 1
major: 5
critical: 25
judge_pool:
- "claude-sonnet-4.6"
- "gpt-4.1"
- "gemini-2.5-flash"
- "mistral-large-2512"
excluded_judges:
- "deepseek"
- "qwen"
statistical_test: "paired_wilcoxon_holm_bonferroni"
significance_threshold: 0.001
Quick Start Guide
- Extract & Structure Glossary: Pull 70+ domain-specific terms from held-out corpora or official style guides. Add 2 non-translatable anchors. Format as source-target pairs with locale tags.
- Deploy Retrieval Layer: Implement a keyword/lemma matcher that scans incoming source paragraphs and returns only matching glossary entries. Cache results per locale to avoid redundant lookups.
- Construct Inference Payload: Assemble the translation request with retrieved terms, brand voice profile, and 13 locale constraints. Serialize into the system prompt, keeping the user message clean for the translation task.
- Calibrate Evaluation Loop: Inject official human references into judge prompts. Run outputs through the calibrated 4-judge pool. Calculate MQM scores at the paragraph level using the weighted penalty formula.
- Validate & Iterate: Run paired statistical tests against baseline outputs. Isolate terminology metrics from style scores. Adjust glossary density or retrieval matching rules based on error distribution. Deploy to CI/CD when p < 0.001 and Cohen's d > 0.20.