- Python 3.9+
- An Anthropic API key (or OpenAI)
- Basic familiarity with REST APIs and Python
pip install anthropic python-dotenv
Step 1: Set Up Your Project Structure
llm-judge-pipeline/
├── evaluator.py
├── prompts.py
├── logger.py
├── results/
│ └── evaluations.json
└── .env
Create your .env:
ANTHROPIC_API_KEY=your_api_key_here
Step 2: Design Your Evaluation Prompt
The quality of your judge depends almost entirely on your prompt. We want structured, consistent output — so we'll ask the model to respond in JSON.
Prompts.py
JUDGE_PROMPT = """
You are an expert translation evaluator with deep knowledge of linguistics and cultural context.
You will be given:
- SOURCE: The original text
- TRANSLATION: The translated output to evaluate
Evaluate the translation on three dimensions and return ONLY a JSON object:
{{
"fluency": {{
"score": <1-10>,
"reason": "<one sentence explanation>"
}},
"accuracy": {{
"score": <1-10>,
"reason": "<one sentence explanation>"
}},
"cultural_appropriateness": {{
"score": <1-10>,
"reason": "<one sentence explanation>"
}},
"overall_score": <average of the three scores>,
"recommendation": "<pass | review | reject>"
}}
SOURCE: {source}
TRANSLATION: {translation}
TARGET LANGUAGE: {target_language}
"""
Step 3: Build the Evaluator
evaluator.py
import os
import json
import anthropic
from dotenv import load_dotenv
from prompts import JUDGE_PROMPT
load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def evaluate_translation(source: str, translation: str, target_language: str) -> dict:
prompt = JUDGE_PROMPT.format(
source=source,
translation=translation,
target_language=target_language
)
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[
{"role": "user", "content": prompt}
]
)
raw_response = message.content[0].text
try:
result = json.loads(raw_response)
except json.JSONDecodeError:
import re
json_match = re.search(r'\{.*\}', raw_response, re.DOTALL)
if json_match:
result = json.loads(json_match.group())
else:
raise ValueError("Could not parse JSON from model response")
return result
def batch_evaluate(pairs: list[dict]) -> list[dict]:
"""Evaluate multiple source/translation pairs"""
results = []
for pair in pairs:
result = evaluate_translation(
source=pair["source"],
translation=pair["translation"],
target_language=pair["target_language"]
)
result["source"] = pair["source"]
result["translation"] = pair["translation"]
results.append(result)
return results
Step 4: Add a Logger
logger.py
import json
import os
from datetime import datetime
RESULTS_FILE = "results/evaluations.json"
def log_evaluation(evaluation: dict):
os.makedirs("results", exist_ok=True)
existing = []
if os.path.exists(RESULTS_FILE):
with open(RESULTS_FILE, "r") as f:
existing = json.load(f)
evaluation["timestamp"] = datetime.utcnow().isoformat()
existing.append(evaluation)
with open(RESULTS_FILE, "w") as f:
json.dump(existing, f, indent=2)
print(f"✅ Logged evaluation. Overall score: {evaluation['overall_score']}/10")
Step 5: Run It
main.py
from evaluator import evaluate_translation
from logger import log_evaluation
source = "Please ensure the patient takes the medication twice daily with food."
translation = "Jọwọ rii daju pe alaisan mu oogun naa lẹmeji lojoojumọ pẹlu ounjẹ."
target_language = "Yoruba"
result = evaluate_translation(source, translation, target_language)
log_evaluation(result)
print(json.dumps(result, indent=2))
Sample Output
{
"fluency": {
"score": 9,
"reason": "The translation reads naturally and follows Yoruba grammatical structure."
},
"accuracy": {
"score": 8,
"reason": "Core meaning is preserved; minor phrasing differences don't affect intent."
},
"cultural_appropriateness": {
"score": 9,
"reason": "Terminology is appropriate for a Nigerian Yoruba-speaking audience."
},
"overall_score": 8.67,
"recommendation": "pass"
}
Step 6: Scale It with Batch Processing
from evaluator import batch_evaluate
from logger import log_evaluation
pairs = [
{
"source": "Welcome to our platform.",
"translation": "Kaabọ si pẹpẹ wa.",
"target_language": "Yoruba"
},
{
"source": "Your payment was successful.",
"translation": "Isanwo rẹ ti ṣaṣeyọri.",
"target_language": "Yoruba"
}
]
results = batch_evaluate(pairs)
for result in results:
log_evaluation(result)
Pitfall Guide
- Judge-Translator Model Conflict: Using the same LLM for both translation and evaluation introduces self-bias, as the model recognizes its own generation patterns. Always decouple the translator and judge models (e.g., translate with a smaller/faster model, evaluate with a reasoning-heavy model).
- Uncontrolled Temperature & Determinism: LLM scoring variance spikes when
temperature > 0. Set temperature=0 (or top_p=1 with deterministic sampling) to ensure reproducible evaluation runs across CI/CD pipelines.
- JSON Parsing Fragility: Frontier models occasionally prepend conversational text or markdown formatting before the JSON block. Always implement a regex fallback (
r'\{.*\}' with re.DOTALL) or use native structured output APIs to prevent JSONDecodeError crashes in production.
- Cost & Rate Limiting Blind Spots: Batch processing without concurrency control or token budgeting triggers API throttling and unexpected costs. Implement exponential backoff, token counting middleware, and async batching (e.g.,
asyncio + aiohttp) for large datasets.
- Cultural/Linguistic Bias Drift: LLMs trained predominantly on English/Western corpora may misjudge low-resource languages or region-specific idioms. Calibrate the judge prompt with language-specific rubrics and periodically validate scores against human spot-checks.
- Static Prompt Degradation: Prompt effectiveness decays as model weights update or new capabilities emerge. Version-control your prompt templates, log raw model responses, and schedule quarterly re-validation against a held-out human-annotated benchmark set.
Deliverables
- Architecture Blueprint: Modular pipeline diagram detailing the flow from source/translation ingestion → prompt templating → LLM judge API → regex/JSON parser → timestamped logger → CI/CD dashboard. Includes async scaling paths and fallback routing.
- Implementation Checklist: Pre-flight validation steps covering API key rotation, temperature locking, JSON schema enforcement, token budget thresholds, error logging hooks, and human calibration intervals.
- Configuration Templates: Production-ready
.env scaffolding, requirements.txt/pyproject.toml dependency pins, GitHub Actions workflow YAML for automated translation evaluation on PR merges, and structured JSON log schema for downstream analytics.