achieves the strongest human-LLM alignment (Pearson 0.89) without the noise introduced by 10-point scales.
from enum import Enum
from dataclasses import dataclass
from typing import List, Dict, Any
class QualityTier(Enum):
EXCELLENT = (0.8, 1.0)
ADEQUATE = (0.5, 0.7)
POOR = (0.2, 0.4)
INVALID = (0.0, 0.1)
@dataclass
class ScoringRubric:
tier_definitions: Dict[QualityTier, str]
scale_mapping: float = 0.2 # Maps 0-5 integer to 0.0-1.0 float
def get_prompt_template(self) -> str:
tiers = "\n".join([
f"- {tier.name} ({low}-{high}): {desc}"
for tier, desc in self.tier_definitions.items()
for low, high in [tier.value]
])
return (
f"Evaluate the agent response against the following explicit criteria:\n{tiers}\n"
f"Return a single float between 0.0 and 1.0. Do not include explanations."
)
Step 2: Instrument Trajectory Capture
Agents must log every tool invocation, decision branch, and intermediate state. This is best achieved through middleware or hook providers that wrap the execution engine without modifying core business logic.
import logging
from typing import Callable, Any
from functools import wraps
class ExecutionTracker:
def __init__(self):
self.steps: List[Dict[str, Any]] = []
self.logger = logging.getLogger("agent.tracker")
def record_step(self, tool_name: str, input_args: Dict, output: Any, status: str):
self.steps.append({
"tool": tool_name,
"input": input_args,
"output": output,
"status": status,
"timestamp": __import__("time").time()
})
def get_trajectory(self) -> List[Dict[str, Any]]:
return self.steps.copy()
def track_execution(tracker: ExecutionTracker):
def decorator(func: Callable) -> Callable:
@wraps(func)
def wrapper(*args, **kwargs):
tool_name = func.__name__
try:
result = func(*args, **kwargs)
tracker.record_step(tool_name, kwargs, result, "success")
return result
except Exception as e:
tracker.record_step(tool_name, kwargs, str(e), "error")
raise
return wrapper
return decorator
Step 3: Build the Hybrid Evaluation Engine
Deterministic checks should handle hard constraints (format validation, required tool usage, safety keywords). LLM judges should handle subjective quality assessment. This routing minimizes latency and cost while preserving nuance.
from typing import List, Tuple
import re
class HybridEvaluator:
def __init__(self, rubric: ScoringRubric, judge_model: str):
self.rubric = rubric
self.judge_model = judge_model
self.deterministic_rules: List[Callable] = []
def add_rule(self, rule: Callable) -> None:
self.deterministic_rules.append(rule)
def run_deterministic_checks(self, output: str, trajectory: List[Dict]) -> Tuple[bool, List[str]]:
failures = []
for rule in self.deterministic_rules:
if not rule(output, trajectory):
failures.append(rule.__name__)
return len(failures) == 0, failures
def evaluate(self, output: str, trajectory: List[Dict]) -> Dict[str, Any]:
passed_det, det_failures = self.run_deterministic_checks(output, trajectory)
if not passed_det:
return {"score": 0.0, "reason": f"Failed deterministic rules: {det_failures}", "type": "deterministic"}
# Route to LLM judge only if hard constraints pass
prompt = self.rubric.get_prompt_template()
# Simulate LLM call (replace with actual provider SDK)
llm_score = self._call_judge(prompt, output)
return {
"score": llm_score,
"reason": "Passed hard constraints, scored by rubric",
"type": "llm_judge",
"trajectory_length": len(trajectory),
"tool_calls": [s["tool"] for s in trajectory]
}
def _call_judge(self, prompt: str, output: str) -> float:
# Placeholder for actual LLM inference
# In production, use AWS Bedrock, OpenAI, or Anthropic SDK
return 0.85
Architecture Decisions & Rationale
- Separation of Concerns: Deterministic validators run in milliseconds at zero marginal cost. They filter out format violations, missing required parameters, or unsafe outputs before the LLM judge is invoked. This reduces judge latency by 40-60% in high-throughput pipelines.
- Explicit Thresholds over Open Prompts: Mapping quality to strict numeric bands eliminates judge ambiguity. The model no longer guesses what "good" means; it matches output characteristics to predefined bands.
- Trajectory as First-Class Data: Logging tool calls, inputs, and statuses enables post-hoc analysis of agent efficiency. Duplicate calls, irrelevant tool selection, and error recovery patterns become quantifiable metrics rather than anecdotal observations.
- Scale Mapping (0-5 to 0.0-1.0): The 5-point integer scale aligns with human cognitive grading patterns, while the float representation integrates cleanly with continuous monitoring dashboards and automated gating systems.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Vague Rubric Design | Prompts like "Rate quality" force the judge to invent criteria, causing position and verbosity bias. Scores cluster around 0.5-0.7 regardless of actual output quality. | Define explicit tier boundaries with concrete examples. Use the 0-5 scale mapped to 0.0-1.0. Validate rubric separation on a held-out dataset before deployment. |
| Ignoring Intermediate Steps | Evaluating only the final response misses duplicate tool calls, hallucinated verification steps, and unsafe reasoning branches. | Implement trajectory capture via hooks/middleware. Score execution efficiency separately from output quality. Flag trajectories exceeding expected step counts. |
| Over-Delegating to LLM Judges | Routing every evaluation through an LLM increases latency, cost, and variance. Judges struggle with hard constraints like format validation or exact string matching. | Use deterministic checks for binary requirements (contains "$", matches regex, calls specific tool). Reserve LLM judges for subjective quality, tone, and completeness. |
| Misaligned Scoring Scales | 10-point scales introduce decision noise. Binary scales lose 73% of quality gradations. Both degrade human-LLM alignment. | Stick to a 5-point scale. Map to 0.0-1.0 for system integration. Calibrate thresholds against human reviewer baselines quarterly. |
| Missing Deterministic Guards | Without hard constraints, agents can bypass safety filters, omit required data, or return malformed JSON while still scoring highly on subjective rubrics. | Implement rule-based validators for format, required fields, safety keywords, and tool usage. Fail fast before LLM evaluation. |
| Prompt Contamination in Judges | Including example outputs or agent prompts in the judge's context causes score inflation or leakage. The judge begins optimizing for the prompt rather than the rubric. | Isolate judge context from agent context. Pass only the output, trajectory summary, and rubric. Use system prompts that explicitly forbid referencing the original task prompt. |
| Neglecting Latency/Cost Trade-offs | Running full trajectory analysis + LLM judging on every request creates pipeline bottlenecks and unpredictable compute costs. | Implement tiered evaluation: deterministic checks on 100% of requests, LLM judging on 20-30% sample, full trajectory analysis on failures or A/B test cohorts. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Cost-sensitive batch processing | Deterministic rules + 10% LLM sampling | Minimizes judge calls while maintaining quality visibility | -65% vs full LLM eval |
| Safety-critical financial/healthcare | Full trajectory analysis + strict deterministic guards | Catches unsafe reasoning paths and hallucinated steps before output | +20% latency, -90% risk exposure |
| High-throughput customer support | Explicit rubric + deterministic format checks | Balances speed with consistent quality scoring | Baseline +5% overhead |
| Agent model comparison/A-B testing | Trajectory-aware hybrid + full LLM judging | Provides granular efficiency and quality metrics for model selection | +40% compute, high decision value |
Configuration Template
evaluation_pipeline:
rubric:
scale: 5
mapping: 0.0-1.0
tiers:
excellent: "0.8-1.0: Contains specific entities, correct format, actionable details"
adequate: "0.5-0.7: Partial information, minor omissions, usable but incomplete"
poor: "0.2-0.4: Vague, missing core requirements, requires significant correction"
invalid: "0.0-0.1: Hallucinated, unsafe, or completely unhelpful"
validators:
deterministic:
- type: regex_match
pattern: "^[A-Z]{2}\\d{3,4}$"
field: "output"
- type: contains
value: "$"
field: "output"
- type: tool_called
name: "search_inventory"
field: "trajectory"
llm_judge:
model: "us.anthropic.claude-sonnet-4-20250514-v1:0"
temperature: 0.0
max_tokens: 10
trajectory:
capture: true
max_steps: 8
alert_on_duplicate: true
routing:
deterministic_first: true
llm_sampling_rate: 0.25
fail_fast_on_violation: true
Quick Start Guide
- Install dependencies:
pip install strands-agents-evals opentelemetry-api (or equivalent framework packages)
- Define your rubric: Create a
ScoringRubric instance with explicit 0-5 tier definitions mapped to 0.0-1.0
- Wrap your agent execution: Attach a
ExecutionTracker hook to log tool calls, inputs, outputs, and status codes
- Initialize the hybrid evaluator: Add deterministic rules for format/safety, then configure the LLM judge model and routing thresholds
- Run evaluation: Pass agent output and trajectory to
HybridEvaluator.evaluate(). Parse the returned score, reason, and trajectory metrics. Integrate with your CI/CD or monitoring dashboard.
This pipeline transforms agent evaluation from a retrospective quality check into a continuous, cost-aware optimization loop. By auditing both the destination and the path, you eliminate silent failures, control token spend, and establish measurable safety boundaries before production deployment.