y hooks over log parsing?** Parsing raw logs after execution is fragile and misses context. Hook-based instrumentation captures tool calls, arguments, success states, and timing in real-time, preserving the causal chain required for accurate process scoring.
Why parallel execution? Semantic and trajectory evaluation are independent. Running them concurrently reduces pipeline latency by ~60% without sacrificing accuracy.
Implementation (Python)
The following implementation demonstrates a production-ready evaluation pipeline. It uses a modular evaluator registry, explicit rubric templating, and hook-based trajectory capture.
import asyncio
import time
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from enum import Enum
class QualityTier(Enum):
EXCELLENT = (0.8, 1.0)
ADEQUATE = (0.5, 0.7)
POOR = (0.2, 0.4)
CRITICAL = (0.0, 0.1)
@dataclass
class EvaluationResult:
score: float
reasoning: str
tier: QualityTier
latency_ms: float
@dataclass
class ToolInvocation:
name: str
arguments: Dict[str, Any]
status: str
timestamp: float
@dataclass
class AgentRun:
prompt: str
final_output: str
trajectory: List[ToolInvocation] = field(default_factory=list)
class DeterministicGuard:
"""Fast, zero-cost validation for hard constraints."""
def __init__(self, required_fields: List[str], forbidden_patterns: List[str]):
self.required = required_fields
self.forbidden = forbidden_patterns
def validate(self, run: AgentRun) -> EvaluationResult:
start = time.perf_counter_ns()
issues = []
for field in self.required:
if field not in run.final_output:
issues.append(f"Missing required field: {field}")
for pattern in self.forbidden:
if pattern.lower() in run.final_output.lower():
issues.append(f"Contains forbidden pattern: {pattern}")
score = 1.0 if not issues else 0.0
latency = (time.perf_counter_ns() - start) / 1e6
return EvaluationResult(
score=score,
reasoning="; ".join(issues) if issues else "All hard constraints met",
tier=QualityTier.EXCELLENT if score == 1.0 else QualityTier.CRITICAL,
latency_ms=latency
)
class SemanticJudge:
"""LLM-as-Judge with explicit threshold mapping."""
def __init__(self, model_client: Any, rubric_template: str):
self.client = model_client
self.rubric = rubric_template
async def evaluate(self, run: AgentRun) -> EvaluationResult:
start = time.perf_counter_ns()
prompt = f"""
{self.rubric}
Agent Output:
{run.final_output}
Return a JSON object with keys: score (float 0.0-1.0), reasoning (string), tier (string).
"""
response = await self.client.generate(prompt)
latency = (time.perf_counter_ns() - start) / 1e6
# Parse response (simplified for example)
parsed = self._parse_json_response(response)
tier = QualityTier[parsed["tier"]]
return EvaluationResult(
score=parsed["score"],
reasoning=parsed["reasoning"],
tier=tier,
latency_ms=latency
)
def _parse_json_response(self, raw: str) -> Dict[str, Any]:
import json
import re
json_match = re.search(r'\{.*\}', raw, re.DOTALL)
return json.loads(json_match.group()) if json_match else {"score": 0.0, "reasoning": "Parse failed", "tier": "CRITICAL"}
class TrajectoryAnalyzer:
"""Scores the step-by-step execution path."""
def __init__(self, model_client: Any, rubric_template: str):
self.client = model_client
self.rubric = rubric_template
async def evaluate(self, run: AgentRun) -> EvaluationResult:
start = time.perf_counter_ns()
traj_summary = "\n".join([
f"[{t.timestamp}] {t.name}({t.arguments}) -> {t.status}"
for t in run.trajectory
])
prompt = f"""
{self.rubric}
Execution Trajectory:
{traj_summary}
Return a JSON object with keys: score (float 0.0-1.0), reasoning (string), tier (string).
"""
response = await self.client.generate(prompt)
latency = (time.perf_counter_ns() - start) / 1e6
parsed = self._parse_json_response(response)
tier = QualityTier[parsed["tier"]]
return EvaluationResult(
score=parsed["score"],
reasoning=parsed["reasoning"],
tier=tier,
latency_ms=latency
)
def _parse_json_response(self, raw: str) -> Dict[str, Any]:
import json
import re
json_match = re.search(r'\{.*\}', raw, re.DOTALL)
return json.loads(json_match.group()) if json_match else {"score": 0.0, "reasoning": "Parse failed", "tier": "CRITICAL"}
class EvaluationPipeline:
"""Orchestrates parallel evaluation layers."""
def __init__(self, guard: DeterministicGuard, judge: SemanticJudge, analyzer: TrajectoryAnalyzer):
self.guard = guard
self.judge = judge
self.analyzer = analyzer
async def run(self, run: AgentRun) -> Dict[str, EvaluationResult]:
# Deterministic check runs synchronously first
guard_result = self.guard.validate(run)
if guard_result.score == 0.0:
return {"guard": guard_result, "semantic": None, "trajectory": None}
# Parallel execution for semantic and trajectory
semantic_task = asyncio.create_task(self.judge.evaluate(run))
trajectory_task = asyncio.create_task(self.analyzer.evaluate(run))
sem_res, traj_res = await asyncio.gather(semantic_task, trajectory_task)
return {
"guard": guard_result,
"semantic": sem_res,
"trajectory": traj_res
}
Why This Architecture Works
- Fail-Fast Deterministic Layer: Hard constraints are checked first. If the output lacks required fields or contains forbidden patterns, the pipeline short-circuits. This saves LLM tokens and reduces latency for obvious failures.
- Explicit Rubric Anchoring: The
SemanticJudge and TrajectoryAnalyzer use structured prompts that force the model to map outputs to predefined tiers. This eliminates score drift and enables consistent thresholding across runs.
- Parallel Orchestration: Semantic and trajectory evaluation are computationally independent.
asyncio.gather runs them concurrently, cutting total evaluation time by nearly half compared to sequential execution.
- Hook-Based Trajectory Capture: The
AgentRun object expects a pre-captured trajectory list. In production, this is populated via framework hooks that intercept tool calls before execution, ensuring no steps are missed and timing metadata is preserved.
Pitfall Guide
1. Vague Rubric Definition
Explanation: Using open-ended prompts like "Evaluate quality" forces the judge to invent criteria, causing score inconsistency across runs.
Fix: Define explicit threshold bands (e.g., 0.8-1.0 = contains all required entities, 0.5-0.7 = missing one non-critical field). Anchor each tier to observable output characteristics.
Explanation: Validating only the final response misses duplicate API calls, unsafe parameter passing, and illogical tool ordering.
Fix: Instrument your agent framework with execution hooks that log every tool invocation, argument payload, and return status. Feed this sequence into a dedicated trajectory scorer.
3. Model Capability Mismatch
Explanation: Using a weaker model as the judge than the agent being evaluated creates a ceiling on detection accuracy. The judge cannot identify flaws it cannot comprehend.
Fix: Ensure the judge model's reasoning capability matches or exceeds the agent's model. For cost-sensitive pipelines, use a strong model for calibration and a smaller model for high-volume scoring, but validate alignment first.
4. Over-Reliance on LLM Judges for Simple Checks
Explanation: Routing format validation, regex matching, or required-field checks through an LLM adds unnecessary latency and cost.
Fix: Implement a deterministic guard layer that runs before any LLM invocation. Reserve semantic judges for contextual understanding, tone, and completeness assessment.
5. Static Rubrics in Evolving Workflows
Explanation: Rubrics hardcode expectations that break when agents gain new tools or when business requirements change.
Fix: Version your rubrics alongside agent deployments. Implement a rubric registry that loads configuration per environment. Schedule quarterly recalibration sessions with human reviewers to adjust thresholds.
6. Position and Verbosity Bias
Explanation: LLM judges systematically prefer the first option presented or longer responses, skewing scores independent of actual quality.
Fix: Randomize the order of options in evaluation prompts. Add explicit instructions to the rubric: "Ignore response length. Focus strictly on factual accuracy and completeness."
7. Missing Cost Attribution
Explanation: Evaluation pipelines consume tokens and compute. Without tracking, teams cannot calculate the true cost of agent validation or optimize spend.
Fix: Tag every evaluation run with metadata: model used, token count, latency, and tier. Aggregate this data in your observability stack to track evaluation ROI and identify expensive failure patterns.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume customer support | Deterministic guards + lightweight semantic judge | Speed and cost efficiency matter; trajectory complexity is low | Low (~$0.002/run) |
| Financial transaction agent | Full dual-layer + strict trajectory analyzer | Safety and compliance require process validation; errors are costly | Medium (~$0.015/run) |
| Multi-step research workflow | Dual-layer + custom trajectory rubric | Complex tool chains need step relevance scoring; hallucinations compound | High (~$0.04/run) |
| Internal knowledge retrieval | Deterministic guards only | Outputs are factual; process is linear; LLM judges add unnecessary overhead | Minimal (~$0.0001/run) |
Configuration Template
# evaluation-pipeline-config.yaml
pipeline:
version: "2.1"
parallel_execution: true
fail_fast_threshold: 0.0
layers:
deterministic:
enabled: true
required_fields: ["flight_number", "departure_time", "total_price"]
forbidden_patterns: ["error", "unauthorized", "pii_detected"]
semantic_judge:
enabled: true
model: "us.anthropic.claude-sonnet-4-20250514-v1:0"
rubric: |
Rate output quality on a 0.0-1.0 scale:
0.8-1.0: Contains all requested entities, accurate pricing, clear formatting
0.5-0.7: Missing one non-critical detail or minor formatting issue
0.2-0.4: Vague, lacks actionable specifics, or contains minor inaccuracies
0.0-0.1: Fabricated data, completely unhelpful, or violates safety guidelines
temperature: 0.0
trajectory_analyzer:
enabled: true
model: "us.anthropic.claude-sonnet-4-20250514-v1:0"
rubric: |
Rate execution path quality on a 0.0-1.0 scale:
0.8-1.0: Logical tool order, no duplicates, all calls relevant to prompt
0.5-0.7: Minor inefficiency (1 redundant call or slightly suboptimal order)
0.2-0.4: Irrelevant tool usage or excessive duplication (>2 repeats)
0.0-0.1: Unsafe intermediate steps, unauthorized access, or completely wrong tool selection
temperature: 0.0
output:
format: "json"
include_reasoning: true
latency_tracking: true
cost_tagging: true
Quick Start Guide
- Define your rubrics: Create explicit threshold mappings for both output quality and execution trajectory. Store them in a version-controlled configuration file.
- Instrument your agent: Add execution hooks to your framework that log every tool invocation, argument payload, and return status into a structured
AgentRun object.
- Initialize the pipeline: Load your configuration, instantiate the deterministic guard, semantic judge, and trajectory analyzer, then wire them into the
EvaluationPipeline orchestrator.
- Run a validation batch: Execute the pipeline against 50-100 historical agent runs. Analyze the score distribution, identify failure patterns, and adjust rubric thresholds accordingly.
- Deploy with thresholds: Set production gates (e.g.,
semantic >= 0.7 AND trajectory >= 0.6). Route runs below threshold to fallback handlers or human review. Monitor cost and latency metrics continuously.