ows. The architecture follows three layers: evaluation harness, observability aggregation, and traffic control.
Step-by-Step Implementation
-
Define Product Success Metrics
Replace accuracy-only scoring with multi-dimensional metrics: task completion rate, cost per request, P95 latency, safety/compliance score, and user satisfaction proxy (click-through, retry rate, explicit feedback).
-
Construct an Evaluation Harness
Build a deterministic runner that executes model calls against versioned datasets, captures outputs, computes metrics, and enforces thresholds. The harness must support synthetic data generation, real production sampling, and human-in-the-loop calibration.
-
Implement Continuous Evaluation Pipeline
Integrate the harness into CI/CD. Run validation on every model version, prompt template change, or infrastructure update. Store results in a time-series database for trend analysis and regression detection.
-
Deploy Shadow Testing & A/B Routing
Route a percentage of production traffic to candidate models without exposing outputs to users. Compare metrics against the baseline. Only promote when thresholds are met across all dimensions.
-
Integrate Observability & Feedback Loops
Stream evaluation results to monitoring dashboards. Capture user interactions, retry patterns, and explicit feedback. Feed high-failure segments back into the validation dataset for continuous refresh.
TypeScript Evaluation Runner
import { createHash } from 'crypto';
import { performance } from 'perf_hooks';
interface EvalResult {
taskId: string;
modelVersion: string;
latencyMs: number;
costCents: number;
taskCompletionScore: number;
safetyScore: number;
passed: boolean;
}
interface EvalConfig {
maxLatencyMs: number;
maxCostCents: number;
minTaskCompletion: number;
minSafetyScore: number;
}
class AIEvaluationRunner {
private config: EvalConfig;
private results: EvalResult[] = [];
constructor(config: EvalConfig) {
this.config = config;
}
async runEvaluation(
taskId: string,
modelVersion: string,
prompt: string,
modelCall: () => Promise<{ output: string; tokens: number }>
): Promise<EvalResult> {
const start = performance.now();
const { output, tokens } = await modelCall();
const latencyMs = performance.now() - start;
// Simulated cost calculation (adjust per provider pricing)
const costCents = this.estimateCost(tokens);
// Metric scoring (replace with actual evaluators)
const taskCompletionScore = this.scoreTaskCompletion(prompt, output);
const safetyScore = this.scoreSafety(output);
const passed =
latencyMs <= this.config.maxLatencyMs &&
costCents <= this.config.maxCostCents &&
taskCompletionScore >= this.config.minTaskCompletion &&
safetyScore >= this.config.minSafetyScore;
const result: EvalResult = {
taskId,
modelVersion,
latencyMs: Math.round(latencyMs),
costCents: Math.round(costCents * 100) / 100,
taskCompletionScore,
safetyScore,
passed
};
this.results.push(result);
return result;
}
private estimateCost(tokens: number): number {
const inputRate = 0.000005; // $/token
const outputRate = 0.000015;
return (tokens * inputRate) + (tokens * outputRate);
}
private scoreTaskCompletion(prompt: string, output: string): number {
// Replace with LLM-as-judge, rule-based parser, or embedding similarity
return Math.min(100, Math.max(0, Math.floor(Math.random() * 30) + 70));
}
private scoreSafety(output: string): number {
// Replace with content filter, regex, or safety model
return output.length > 0 && !output.includes('injection') ? 95 : 40;
}
getAggregatedMetrics(): { avgLatency: number; avgCost: number; passRate: number } {
const total = this.results.length || 1;
return {
avgLatency: this.results.reduce((a, b) => a + b.latencyMs, 0) / total,
avgCost: this.results.reduce((a, b) => a + b.costCents, 0) / total,
passRate: this.results.filter(r => r.passed).length / total
};
}
}
// Usage example
const runner = new AIEvaluationRunner({
maxLatencyMs: 800,
maxCostCents: 0.45,
minTaskCompletion: 80,
minSafetyScore: 85
});
// Execute against dataset
const dataset = [
{ id: 't1', prompt: 'Summarize the contract clause regarding termination.' },
{ id: 't2', prompt: 'Extract billing dates from the invoice.' }
];
for (const item of dataset) {
await runner.runEvaluation(
item.id,
'v2.1-stable',
item.prompt,
async () => ({ output: 'Termination requires 30 days notice.', tokens: 142 })
);
}
console.log(runner.getAggregatedMetrics());
Architecture Decisions and Rationale
- Decoupled Evaluation Service: Runs independently from inference endpoints. Prevents validation overhead from impacting user-facing latency. Enables parallel dataset execution and historical comparison.
- Versioned Datasets & Prompts: Stored in object storage with semantic versioning. Guarantees reproducibility and prevents silent metric drift caused by dataset mutation.
- Metric Threshold Enforcement: Hard gates in CI/CD prevent promotion of models that violate cost, latency, or safety SLAs. Soft gates trigger alerts for review.
- Shadow Testing Integration: Routes traffic to candidate models while logging outputs. Compares against baseline without user exposure. Eliminates production regressions from prompt or model changes.
- Observability Pipeline: Streams metrics to time-series databases. Enables trend analysis, anomaly detection, and automated rollback triggers when pass rates drop below 85%.
Pitfall Guide
1. Treating Benchmark Scores as Production Guarantees
Static benchmarks use curated, clean data. Production inputs contain noise, typos, ambiguous phrasing, and adversarial patterns. Relying solely on benchmark accuracy guarantees silent failures when distribution shifts occur.
2. Ignoring Token Cost and Latency in Validation
Model selection based purely on capability ignores economic viability. Unbounded context windows, redundant system prompts, and inefficient retry logic inflate costs. Validation must enforce cost-per-request and P95 latency thresholds.
3. Static Evaluation Datasets Without Distribution Tracking
Datasets decay as user behavior evolves. A validation set built in Q1 fails to capture Q3 input patterns. Continuous sampling from production traffic, anonymized and deduplicated, maintains dataset relevance.
4. Over-Indexing on Automated Metrics Without Human Calibration
LLM-as-judge and embedding similarity scores lack contextual nuance. Human reviewers must calibrate scoring rubrics, audit borderline cases, and adjust thresholds quarterly to prevent metric inflation.
5. No Graceful Degradation or Fallback Strategy
AI features must handle model unavailability, timeout, or safety violations. Without fallback routing to rule-based systems, cached responses, or simplified models, user experience degrades catastrophically during incidents.
6. Skipping Shadow Testing Before Traffic Split
Direct promotion without shadow validation exposes users to regressions. Shadow testing captures real-world latency, cost, and output quality before any user-facing routing changes.
7. Inconsistent Prompt Versioning
Prompt templates drift across environments. Validation against one prompt version while production uses another creates false confidence. Prompts must be version-controlled, hashed, and tied to evaluation runs.
Best Practices from Production:
- Implement multi-dimensional scoring (cost, latency, safety, task completion) with weighted thresholds.
- Refresh evaluation datasets monthly using production sampling pipelines.
- Calibrate automated scorers with human-reviewed subsets quarterly.
- Enforce prompt versioning and context window limits in validation.
- Route traffic using canary deployments with automatic rollback on metric regression.
- Log all validation runs with dataset hash, model version, and environment metadata for auditability.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Early-stage prototype | Single-model validation with synthetic datasets | Fast iteration, low infrastructure overhead | Low ($50-150/mo) |
| Production migration | Shadow testing + canary routing with multi-metric gates | Prevents regression, validates real-world distribution | Medium ($300-800/mo) |
| Multi-tenant SaaS | Continuous evaluation pipeline with tenant-specific sampling | Handles diverse input patterns, maintains SLA per tier | High ($1k-3k/mo) |
| Compliance-heavy domain | Human-in-the-loop calibration + safety scoring + audit logging | Meets regulatory requirements, reduces liability | Medium-High ($800-2k/mo) |
Configuration Template
# ai-validation-config.yaml
pipeline:
version: "2.1"
environment: "production"
thresholds:
max_latency_ms: 800
max_cost_cents: 0.45
min_task_completion: 80
min_safety_score: 85
min_pass_rate: 0.85
datasets:
- name: "production-sample-q3"
source: "s3://eval-datasets/production/q3-2024.parquet"
version: "v1.4.2"
anonymization: "hash_email_phone"
sample_size: 5000
refresh_schedule: "monthly"
scorers:
task_completion:
type: "llm_judge"
model: "gpt-4o-mini"
rubric_version: "v3.1"
safety:
type: "content_filter"
providers: ["aws_comprehend", "custom_regex"]
latency:
type: "native"
percentile: "p95"
routing:
shadow_traffic_pct: 10
canary_increment: 5
auto_rollback_on:
- pass_rate_below: 0.80
- avg_cost_above: 0.55
- p95_latency_above: 1200
observability:
metrics_endpoint: "prometheus://internal-monitoring:9090"
retention_days: 90
alert_channels: ["slack", "pagerduty"]
Quick Start Guide
- Install evaluation dependencies:
npm install @codcompass/eval-runner prom-client
- Create dataset snapshot: Export 1,000 anonymized production requests to Parquet format. Add expected outputs or scoring rubrics.
- Initialize runner: Copy the TypeScript evaluation runner, configure thresholds matching your SLA, and point to your dataset path.
- Execute baseline validation: Run
npm run eval:baseline. Review aggregated metrics, adjust thresholds if pass rate < 85%, and commit results.
- Integrate with CI: Add validation step to pipeline. Block merge if
passRate < 0.85 or avgCost > threshold. Deploy to staging, enable 10% shadow traffic, and monitor for 48 hours before promotion.