chitecture Decisions
- Deterministic First, LLM Second: Always run code-based validators before invoking an LLM judge. Regex, JSON schema validation, and tool-call verification are faster, cheaper, and more consistent. Reserve LLM scoring for subjective criteria like groundedness or tone.
- Trajectory Tracking by Default: Capture step counts, tool invocation sequences, latency per step, and cumulative cost. Output correctness alone is insufficient for production safety.
- Reference-Free Online Scoring: Production traffic lacks ground truth. Use reference-free evaluators that validate format compliance, safety boundaries, and execution sanity without requiring a perfect answer.
- Failure Ingestion Pipeline: Automatically route online failures into the offline dataset with metadata (input, trajectory, failure reason, timestamp) to ensure the next version is tested against real degradation patterns.
Implementation
// assessment-engine.ts
import { z } from 'zod';
export interface TestCorpus {
id: string;
input: string;
expectedTools?: string[];
maxSteps?: number;
budgetLimit?: number;
}
export interface ExecutionTrace {
steps: Array<{ tool: string; input: any; output: any; latencyMs: number }>;
finalOutput: string;
totalCost: number;
timestamp: Date;
}
export interface AssessmentResult {
passed: boolean;
scores: Record<string, number>;
failures: string[];
trajectory: ExecutionTrace;
}
export class AssessmentEngine {
private deterministicChecks: Array<(trace: ExecutionTrace) => boolean>;
private llmJudges: Array<{ name: string; rubric: string }>;
constructor() {
this.deterministicChecks = [];
this.llmJudges = [];
}
addDeterministicCheck(check: (trace: ExecutionTrace) => boolean) {
this.deterministicChecks.push(check);
}
registerJudge(name: string, rubric: string) {
this.llmJudges.push({ name, rubric });
}
async evaluate(corpus: TestCorpus, trace: ExecutionTrace): Promise<AssessmentResult> {
const failures: string[] = [];
const scores: Record<string, number> = {};
// 1. Deterministic validation (fast, zero-cost)
const jsonValid = this.validateJsonStructure(trace.finalOutput);
if (!jsonValid) failures.push('Invalid JSON structure');
const toolSequenceValid = this.verifyToolSequence(corpus, trace);
if (!toolSequenceValid) failures.push('Unexpected tool invocation sequence');
const costWithinBudget = trace.totalCost <= (corpus.budgetLimit || Infinity);
if (!costWithinBudget) failures.push('Budget exceeded');
// 2. Trajectory analysis
const stepCount = trace.steps.length;
const avgLatency = trace.steps.reduce((sum, s) => sum + s.latencyMs, 0) / stepCount;
scores.trajectoryEfficiency = stepCount <= (corpus.maxSteps || 5) ? 1.0 : 0.0;
scores.latencyHealth = avgLatency < 2000 ? 1.0 : 0.0;
// 3. LLM-as-judge (only if deterministic passes)
if (failures.length === 0) {
for (const judge of this.llmJudges) {
const score = await this.invokeJudge(judge.name, judge.rubric, trace);
scores[judge.name] = score;
}
}
return {
passed: failures.length === 0 && Object.values(scores).every(s => s >= 0.8),
scores,
failures,
trajectory: trace
};
}
private validateJsonStructure(output: string): boolean {
try {
JSON.parse(output);
return true;
} catch {
return false;
}
}
private verifyToolSequence(corpus: TestCorpus, trace: ExecutionTrace): boolean {
if (!corpus.expectedTools) return true;
const usedTools = trace.steps.map(s => s.tool);
return corpus.expectedTools.every(t => usedTools.includes(t));
}
private async invokeJudge(name: string, rubric: string, trace: ExecutionTrace): Promise<number> {
// Production implementation: route to calibrated LLM judge
// with pairwise comparison fallback for subjective criteria
return 0.9; // Placeholder for actual LLM call
}
}
Feedback Loop Integration
// feedback-loop.ts
export class FailureIngestor {
private offlineDataset: TestCorpus[] = [];
async ingestProductionFailure(trace: ExecutionTrace, reason: string): Promise<void> {
const newCase: TestCorpus = {
id: `prod-${Date.now()}-${Math.random().toString(36).slice(2)}`,
input: trace.steps[0]?.input || 'unknown',
expectedTools: trace.steps.map(s => s.tool),
maxSteps: trace.steps.length + 2,
budgetLimit: trace.totalCost * 1.2
};
this.offlineDataset.push(newCase);
console.log(`[FeedbackLoop] Ingested failure: ${reason} -> Dataset size: ${this.offlineDataset.length}`);
}
getDataset(): TestCorpus[] {
return [...this.offlineDataset];
}
}
The architecture enforces three production realities: deterministic checks eliminate cheap false positives, trajectory metrics expose hidden inefficiencies, and the failure ingestor ensures every production incident becomes a permanent regression test. This structure works identically whether you route LLM judges through LangSmith, Braintrust, Langfuse, Arize, or a custom inference service.
Pitfall Guide
1. Synthetic Seeding Overload
Explanation: Teams generate entire datasets using LLMs, creating test cases that reflect model assumptions rather than actual user behavior. Synthetic data lacks malformed inputs, ambiguous phrasing, and adversarial patterns.
Fix: Seed datasets exclusively from production traces. Use synthetic generation only to expand existing real cases, never to create initial test coverage.
2. Aggregate Score Illusion
Explanation: Reporting a single pass rate (e.g., "87% success") masks failure distribution. High-stakes cases often fail while low-risk ones pass, making the aggregate metric dangerously misleading.
Fix: Decompose metrics by category, input type, and tool sequence. Track failure clusters over time and surface exact failing examples in CI/CD reports.
3. Trajectory Blindness
Explanation: Evaluating only the final answer ignores execution path. Agents can return correct outputs while calling the same tool repeatedly, exceeding cost thresholds, or violating safety boundaries mid-execution.
Fix: Implement mandatory trajectory scoring. Track step counts, tool invocation sequences, per-step latency, and cumulative cost. Fail builds when trajectory metrics degrade, even if output correctness remains stable.
4. LLM Judge Calibration Gap
Explanation: LLM-as-judge systems drift over time due to model updates, prompt sensitivity, and lack of ground truth alignment. Uncalibrated judges produce inconsistent scores that break regression tracking.
Fix: Run periodic calibration sessions using human-reviewed subsets. Implement pairwise comparison instead of absolute scoring for subjective criteria. Log judge variance and alert when score distribution shifts beyond acceptable thresholds.
5. Dashboard-Only Alerting
Explanation: Routing evaluation results to internal dashboards that engineers rarely monitor delays incident response. Silent degradation continues until user complaints surface.
Fix: Wire evaluation failures directly to Slack, PagerDuty, or incident management systems. Configure threshold-based alerts for trajectory anomalies, cost spikes, and safety violations. Treat eval failures as production incidents.
6. Static Dataset Stagnation
Explanation: Datasets are built once and never updated. As user behavior evolves and new features ship, test coverage becomes obsolete, creating false confidence in regression tests.
Fix: Implement automated dataset rotation. Schedule weekly ingestion of production traces, retire low-value test cases, and tag cases by risk category. Treat the dataset as a living artifact, not a static file.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Early-stage prototype | Synthetic expansion + output metrics | Fast iteration, low traffic volume | Minimal |
| High-traffic production | Production traces + trajectory metrics | Detects cost anomalies and tool misuse | Moderate (LLM judge routing) |
| Cost-sensitive deployment | Deterministic-first + sampling strategy | Reduces unnecessary LLM calls | Low |
| Safety-critical system | Human review calibration + pairwise comparison | Ensures consistent subjective scoring | High (human review overhead) |
Configuration Template
# eval-config.yaml
evaluation:
modes:
offline:
dataset_source: "production_traces"
max_cases: 500
regression_threshold: 0.85
ci_integration: true
online:
sampling_rate: 0.15
reference_free_evaluators:
- groundedness
- format_validity
- safety_compliance
- tool_call_sanity
alert_channels:
- slack
- pagerduty
scoring:
deterministic_priority: true
trajectory_tracking: true
llm_judge_calibration:
frequency: "monthly"
human_review_subset: 0.1
pairwise_fallback: true
feedback_loop:
auto_ingest_failures: true
retention_days: 90
category_tagging: true
Quick Start Guide
- Extract Production Traces: Query your logging system for the last 30 days of agent interactions. Filter for completed executions and export inputs, tool sequences, and outputs.
- Initialize Deterministic Checks: Implement JSON schema validation, tool sequence verification, and budget threshold checks. Run these against your trace dataset to establish baseline pass rates.
- Configure Online Monitoring: Deploy reference-free evaluators to sample 10-15% of live traffic. Route failures to your incident management system with trajectory metadata attached.
- Activate Feedback Ingestion: Set up an automated pipeline that captures online failures, extracts inputs and execution paths, and appends them to your offline dataset with category tags.
- Validate Closed Loop: Trigger a controlled regression in your agent. Verify that the offline benchmark catches it, the online monitor detects the degradation, and the failure is automatically added to the next dataset cycle.