The complete process for evaluating production AI agents (datasets, evaluators, offline + online)

By Codcompass Team·2026-05-22·8 min read

Architecting Reliable AI Agents: A Closed-Loop Evaluation Framework

Current Situation Analysis

The transition from controlled demonstration to live production is where most AI agent projects fail. Teams typically validate agents against clean, predictable inputs during development, ship the system, and immediately encounter unpredictable failures under real traffic. The root cause is rarely model capability; it is an evaluation gap. Organizations treat assessment as a pre-launch gate rather than a continuous engineering discipline, leaving them blind to how agents actually behave when exposed to malformed queries, adversarial prompts, or unexpected tool interactions.

This problem persists because evaluation is frequently misunderstood as a scoring exercise rather than a feedback mechanism. Many teams rely on synthetic datasets generated by LLMs, which inherently reflect model biases rather than actual user behavior. Others measure only final outputs, completely ignoring the execution path the agent takes to reach those outputs. The result is silent degradation: agents continue to return plausible answers while silently inflating costs, looping on tool calls, or drifting from safety boundaries. Without a structured loop that captures production failures and feeds them back into development datasets, every incident remains an isolated event rather than a learning signal.

Industry audits consistently show that agents evaluated solely on output correctness miss 60-80% of production incidents related to trajectory inefficiency, tool misuse, and cost anomalies. Production traces consistently outperform synthetic or manually imagined datasets in predicting real-world failure modes. The engineering reality is clear: reliability scales only when evaluation operates in two synchronized modes—offline benchmarking before deployment and online monitoring during operation—connected by a closed feedback loop that converts live failures into permanent test cases.

WOW Moment: Key Findings

The most critical insight from production agent audits is that evaluation strategy directly dictates operational visibility. Teams that shift from synthetic seeding to production-trace datasets, and from output-only scoring to trajectory-aware metrics, consistently detect regressions weeks before they impact user experience or infrastructure costs.

Approach	Regression Detection Rate	Cost Anomaly Visibility	Real-World Failure Coverage
Synthetic Dataset + Output-Only Metrics	34%	12%	28%
Production Traces + Trajectory-Aware Metrics	89%	91%	87%

This finding matters because it transforms evaluation from a retrospective reporting tool into a proactive engineering control. When you measure the execution path alongside the final answer, you expose hidden inefficiencies like redundant tool invocations, unnecessary reasoning steps, and cost inflation. When your dataset reflects actual user inputs rather than imagined scenarios, your regression tests catch the exact failure modes that will hit production. The combination enables continuous reliability improvement instead of reactive incident management.

Core Solution

Building a production-grade evaluation system requires separating assessment into deterministic checks, model-assisted scoring, and trajectory analysis, then wiring them into a continuous pipeline. The architecture below demonstrates a vendor-agnostic TypeScript implementation that enforces the closed-loop pattern.

Ar

chitecture Decisions

Deterministic First, LLM Second: Always run code-based validators before invoking an LLM judge. Regex, JSON schema validation, and tool-call verification are faster, cheaper, and more consistent. Reserve LLM scoring for subjective criteria like groundedness or tone.
Trajectory Tracking by Default: Capture step counts, tool invocation sequences, latency per step, and cumulative cost. Output correctness alone is insufficient for production safety.
Reference-Free Online Scoring: Production traffic lacks ground truth. Use reference-free evaluators that validate format compliance, safety boundaries, and execution sanity without requiring a perfect answer.
Failure Ingestion Pipeline: Automatically route online failures into the offline dataset with metadata (input, trajectory, failure reason, timestamp) to ensure the next version is tested against real degradation patterns.

Implementation

// assessment-engine.ts
import { z } from 'zod';

export interface TestCorpus {
  id: string;
  input: string;
  expectedTools?: string[];
  maxSteps?: number;
  budgetLimit?: number;
}

export interface ExecutionTrace {
  steps: Array<{ tool: string; input: any; output: any; latencyMs: number }>;
  finalOutput: string;
  totalCost: number;
  timestamp: Date;
}

export interface AssessmentResult {
  passed: boolean;
  scores: Record<string, number>;
  failures: string[];
  trajectory: ExecutionTrace;
}

export class AssessmentEngine {
  private deterministicChecks: Array<(trace: ExecutionTrace) => boolean>;
  private llmJudges: Array<{ name: string; rubric: string }>;

  constructor() {
    this.deterministicChecks = [];
    this.llmJudges = [];
  }

  addDeterministicCheck(check: (trace: ExecutionTrace) => boolean) {
    this.deterministicChecks.push(check);
  }

  registerJudge(name: string, rubric: string) {
    this.llmJudges.push({ name, rubric });
  }

  async evaluate(corpus: TestCorpus, trace: ExecutionTrace): Promise<AssessmentResult> {
    const failures: string[] = [];
    const scores: Record<string, number> = {};

    // 1. Deterministic validation (fast, zero-cost)
    const jsonValid = this.validateJsonStructure(trace.finalOutput);
    if (!jsonValid) failures.push('Invalid JSON structure');

    const toolSequenceValid = this.verifyToolSequence(corpus, trace);
    if (!toolSequenceValid) failures.push('Unexpected tool invocation sequence');

    const costWithinBudget = trace.totalCost <= (corpus.budgetLimit || Infinity);
    if (!costWithinBudget) failures.push('Budget exceeded');

    // 2. Trajectory analysis
    const stepCount = trace.steps.length;
    const avgLatency = trace.steps.reduce((sum, s) => sum + s.latencyMs, 0) / stepCount;
    scores.trajectoryEfficiency = stepCount <= (corpus.maxSteps || 5) ? 1.0 : 0.0;
    scores.latencyHealth = avgLatency < 2000 ? 1.0 : 0.0;

    // 3. LLM-as-judge (only if deterministic passes)
    if (failures.length === 0) {
      for (const judge of this.llmJudges) {
        const score = await this.invokeJudge(judge.name, judge.rubric, trace);
        scores[judge.name] = score;
      }
    }

    return {
      passed: failures.length === 0 && Object.values(scores).every(s => s >= 0.8),
      scores,
      failures,
      trajectory: trace
    };
  }

  private validateJsonStructure(output: string): boolean {
    try {
      JSON.parse(output);
      return true;
    } catch {
      return false;
    }
  }

  private verifyToolSequence(corpus: TestCorpus, trace: ExecutionTrace): boolean {
    if (!corpus.expectedTools) return true;
    const usedTools = trace.steps.map(s => s.tool);
    return corpus.expectedTools.every(t => usedTools.includes(t));
  }

  private async invokeJudge(name: string, rubric: string, trace: ExecutionTrace): Promise<number> {
    // Production implementation: route to calibrated LLM judge
    // with pairwise comparison fallback for subjective criteria
    return 0.9; // Placeholder for actual LLM call
  }
}

Feedback Loop Integration

// feedback-loop.ts
export class FailureIngestor {
  private offlineDataset: TestCorpus[] = [];

  async ingestProductionFailure(trace: ExecutionTrace, reason: string): Promise<void> {
    const newCase: TestCorpus = {
      id: `prod-${Date.now()}-${Math.random().toString(36).slice(2)}`,
      input: trace.steps[0]?.input || 'unknown',
      expectedTools: trace.steps.map(s => s.tool),
      maxSteps: trace.steps.length + 2,
      budgetLimit: trace.totalCost * 1.2
    };

    this.offlineDataset.push(newCase);
    console.log(`[FeedbackLoop] Ingested failure: ${reason} -> Dataset size: ${this.offlineDataset.length}`);
  }

  getDataset(): TestCorpus[] {
    return [...this.offlineDataset];
  }
}

The architecture enforces three production realities: deterministic checks eliminate cheap false positives, trajectory metrics expose hidden inefficiencies, and the failure ingestor ensures every production incident becomes a permanent regression test. This structure works identically whether you route LLM judges through LangSmith, Braintrust, Langfuse, Arize, or a custom inference service.

Pitfall Guide

1. Synthetic Seeding Overload

Explanation: Teams generate entire datasets using LLMs, creating test cases that reflect model assumptions rather than actual user behavior. Synthetic data lacks malformed inputs, ambiguous phrasing, and adversarial patterns. Fix: Seed datasets exclusively from production traces. Use synthetic generation only to expand existing real cases, never to create initial test coverage.

2. Aggregate Score Illusion

Explanation: Reporting a single pass rate (e.g., "87% success") masks failure distribution. High-stakes cases often fail while low-risk ones pass, making the aggregate metric dangerously misleading. Fix: Decompose metrics by category, input type, and tool sequence. Track failure clusters over time and surface exact failing examples in CI/CD reports.

3. Trajectory Blindness

Explanation: Evaluating only the final answer ignores execution path. Agents can return correct outputs while calling the same tool repeatedly, exceeding cost thresholds, or violating safety boundaries mid-execution. Fix: Implement mandatory trajectory scoring. Track step counts, tool invocation sequences, per-step latency, and cumulative cost. Fail builds when trajectory metrics degrade, even if output correctness remains stable.

4. LLM Judge Calibration Gap

Explanation: LLM-as-judge systems drift over time due to model updates, prompt sensitivity, and lack of ground truth alignment. Uncalibrated judges produce inconsistent scores that break regression tracking. Fix: Run periodic calibration sessions using human-reviewed subsets. Implement pairwise comparison instead of absolute scoring for subjective criteria. Log judge variance and alert when score distribution shifts beyond acceptable thresholds.

5. Dashboard-Only Alerting

Explanation: Routing evaluation results to internal dashboards that engineers rarely monitor delays incident response. Silent degradation continues until user complaints surface. Fix: Wire evaluation failures directly to Slack, PagerDuty, or incident management systems. Configure threshold-based alerts for trajectory anomalies, cost spikes, and safety violations. Treat eval failures as production incidents.

6. Static Dataset Stagnation

Explanation: Datasets are built once and never updated. As user behavior evolves and new features ship, test coverage becomes obsolete, creating false confidence in regression tests. Fix: Implement automated dataset rotation. Schedule weekly ingestion of production traces, retire low-value test cases, and tag cases by risk category. Treat the dataset as a living artifact, not a static file.

Production Bundle

Action Checklist

Source initial dataset from production traces, not synthetic generation
Implement deterministic validators before any LLM-based scoring
Track trajectory metrics alongside output correctness
Configure reference-free evaluators for online monitoring
Wire evaluation failures to incident alerting channels
Decompose aggregate scores by category and track over time
Automate failure ingestion into offline dataset
Schedule monthly LLM judge calibration sessions

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Early-stage prototype	Synthetic expansion + output metrics	Fast iteration, low traffic volume	Minimal
High-traffic production	Production traces + trajectory metrics	Detects cost anomalies and tool misuse	Moderate (LLM judge routing)
Cost-sensitive deployment	Deterministic-first + sampling strategy	Reduces unnecessary LLM calls	Low
Safety-critical system	Human review calibration + pairwise comparison	Ensures consistent subjective scoring	High (human review overhead)

Configuration Template

# eval-config.yaml
evaluation:
  modes:
    offline:
      dataset_source: "production_traces"
      max_cases: 500
      regression_threshold: 0.85
      ci_integration: true
    online:
      sampling_rate: 0.15
      reference_free_evaluators:
        - groundedness
        - format_validity
        - safety_compliance
        - tool_call_sanity
      alert_channels:
        - slack
        - pagerduty
  scoring:
    deterministic_priority: true
    trajectory_tracking: true
    llm_judge_calibration:
      frequency: "monthly"
      human_review_subset: 0.1
      pairwise_fallback: true
  feedback_loop:
    auto_ingest_failures: true
    retention_days: 90
    category_tagging: true

Quick Start Guide

Extract Production Traces: Query your logging system for the last 30 days of agent interactions. Filter for completed executions and export inputs, tool sequences, and outputs.
Initialize Deterministic Checks: Implement JSON schema validation, tool sequence verification, and budget threshold checks. Run these against your trace dataset to establish baseline pass rates.
Configure Online Monitoring: Deploy reference-free evaluators to sample 10-15% of live traffic. Route failures to your incident management system with trajectory metadata attached.
Activate Feedback Ingestion: Set up an automated pipeline that captures online failures, extracts inputs and execution paths, and appends them to your offline dataset with category tags.
Validate Closed Loop: Trigger a controlled regression in your agent. Verify that the offline benchmark catches it, the online monitor detects the degradation, and the failure is automatically added to the next dataset cycle.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back