Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust
Automating LLM Regression Detection: A Lightweight Evaluation Framework for Production Systems
Current Situation Analysis
Large language model deployments suffer from a silent degradation problem. Unlike deterministic software, LLM outputs shift when base models update, context windows expand, or system prompts are tweaked. Teams frequently ship features that pass internal testing, only to discover weeks later that user-facing outputs have drifted into hallucination, tone inconsistency, or structural breakage. The root cause is rarely the model itself; it's the absence of continuous, automated evaluation.
Enterprise evaluation platforms address this gap but price themselves for mid-market engineering teams. Solutions like Braintrust start at $180/month, LangSmith charges $39/user/month, and Arize operates on custom enterprise pricing. For bootstrapped teams, indie developers, or small product squads, these costs consume disproportionate runway. Consequently, many teams skip systematic evaluation entirely, relying on manual spot-checks or user complaints as their primary feedback loop.
The industry overlooks a fundamental truth: you don't need a full observability dashboard to catch production-breaking regressions. You need a deterministic scoring pipeline that runs on every code change. By replacing expensive SaaS platforms with a lightweight, judge-model-driven rubric, teams can achieve comparable regression detection at a fraction of the cost. The technical challenge shifts from "how do we afford evaluation?" to "how do we design a rubric that generalizes across prompt iterations without introducing scoring noise?"
WOW Moment: Key Findings
The most effective evaluation strategy for small-to-medium LLM applications isn't comprehensive synthetic testing or expensive vendor platforms. It's a production-grounded, three-axis rubric executed by a cost-efficient judge model. When benchmarked against alternative approaches, the DIY framework demonstrates a superior cost-to-coverage ratio while maintaining high signal fidelity.
| Evaluation Strategy | Monthly Cost (100 runs) | Regression Detection Rate | Maintenance Overhead | Data Fidelity |
|---|---|---|---|---|
| Enterprise SaaS | $180β$500+ | ~90% | High (vendor lock-in) | Low (synthetic/default) |
| Synthetic Test Suites | $0β$5 | ~40% | Medium | Low (model-biased) |
| Production-Grounded DIY | $0.15β$0.25 | ~85% | Low | High (real user inputs) |
This finding matters because it decouples evaluation quality from budget constraints. The ~85% detection rate captures the majority of functional, tonal, and structural regressions that directly impact user experience. By anchoring tests to actual production inputs rather than model-generated scenarios, the framework measures what your system actually handles. The cost reduction enables PR-gated evaluation instead of monthly audits, transforming LLM quality from a retrospective metric into a continuous deployment safeguard.
Core Solution
Building a production-ready evaluation pipeline requires four coordinated components: a structured rubric, a deterministic judge prompt, a production-sourced golden dataset, and an automated scoring gate. Each component addresses a specific failure mode in LLM deployment.
1. Rubric Architecture: The Three-Axis Model
LLM outputs fail in three predictable dimensions:
- Accuracy: Factual correctness and logical coherence relative to the user's request.
- Tone: Alignment with brand voice, helpfulness, and avoidance of sycophancy or dismissiveness.
- Format: Structural integrity, length appropriateness, and compatibility with downstream parsers.
Scoring each axis on a 1β5 scale provides granular visibility without overwhelming the judge model. Composite scoring (average of the three) enables simple thresholding, while individual axis scores allow targeted debugging when a PR fails.
2. Judge Prompt Engineering
Vague instructions like "rate this response 1-10" produce inconsistent outputs because frontier models lack contextual grounding. Effective judge prompts require explicit anchors at discrete intervals. The following TypeScript implementation demonstrates a structured approach:
import OpenAI from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
interface EvalResult {
accuracy: number;
tone: number;
format: number;
reasoning: string;
}
const JUDGE_SYSTEM_PROMPT = `You are an evaluation engine. Score the assistant's response on three independent axes (1-5 each). Return ONLY valid JSON.
SCORING CRITERIA:
ACCURACY (1-5):
5: Fully correct, addresses all constraints, zero factual errors
3: Mostly correct, minor omissions or slight logical gaps
1: Fundamentally incorrect, hallucinated, or misleading
TONE (1-5):
5: Confident, direct, appropriately helpful, zero filler
3: Acceptable but slightly verbose, hesitant, or overly cautious
1: Overly apologetic, dismissive, or misaligned with professional standards
FORMAT (1-5):
5: Clean structure, appropriate length, valid markdown, parser-ready
3: Correct content but poor formatting, inconsistent lists, or awkward breaks
1: Wall of text, missing required sections, or broken structure
INPUT: {user_query}
OUTPUT: {model_response}
Return JSON: {"accuracy": number, "tone": number, "format": number, "reasoning": string}`;
export async function evaluateResponse(
userQuery: string,
modelResponse: string
): Promise<EvalResult> {
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: JUDGE_SYSTEM_PROMPT },
{ role: 'user', content: `INPUT: ${userQuery}\nOUTPUT: ${modelResponse}` }
],
response_format: { type: 'json_object' },
temperature: 0
});
const raw = completion.choices[0].message.content;
if (!raw) throw new Error('Judge model returned empty response');
const parsed = JSON.parse(raw) as EvalResult;
// Validate score boundaries
['accuracy', 'tone', 'format'].forEach(axis => {
const val = parsed[axis as keyof EvalResult];
if (typeof val !== 'number' || val < 1 || val > 5) {
throw new Error(`Invalid ${axis} score: ${val}`);
}
});
return parsed;
}
Key architectural decisions:
gpt-4o-miniis used as the judge model. It provides sufficient reasoning capability for rubric scoring while costing pennies per request.temperature: 0ensures deterministic scoring. Reproducibility is critical for regression detection.response_format: 'json_object'enforces parseable output, eliminating string manipulation overhead.- Boundary validation catches malformed judge responses before they corrupt the dataset.
3. Golden Dataset Construction
Synthetic test cases are fundamentally flawed for regression testing. When an LLM generates its own test data, it optimizes for patterns it already handles well, creating circular validation that misses edge cases. Production logs contain the actual distribution of user requests, including malformed inputs, ambiguous phrasing, and domain-specific terminology.
import fs from 'fs';
import { randomUUID } from 'crypto';
interface GoldenSample {
id: string;
userQuery: string;
expectedResponse: string;
metadata: { source: string; timestamp: string };
}
export function buildGoldenDataset(
rawLogs: Array<{ query: string; response: string; ts: string }>,
sampleSize: number = 100
): GoldenSample[] {
// Sort chronologically, prioritize recent production behavior
const sorted = [...rawLogs].sort((a, b) =>
new Date(b.ts).getTime() - new Date(a.ts).getTime()
);
// Stratified sampling to avoid temporal bias
const window = sorted.slice(0, 500);
const step = Math.max(1, Math.floor(window.length / sampleSize));
return window
.filter((_, i) => i % step === 0)
.slice(0, sampleSize)
.map(log => ({
id: randomUUID(),
userQuery: sanitizeInput(log.query),
expectedResponse: log.response,
metadata: { source: 'production', timestamp: log.ts }
}));
}
function sanitizeInput(raw: string): string {
// Strip emails, phone numbers, and common PII patterns
return raw
.replace(/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g, '[EMAIL]')
.replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '[PHONE]')
.replace(/\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b/g, '[CARD]');
}
The sampling strategy avoids taking the last N logs, which often cluster around a single campaign or support ticket. Stratified selection across a 500-entry window preserves temporal diversity while keeping dataset size manageable.
4. CI Gate Implementation
Evaluation must run automatically on every pull request. The following runner aggregates scores, computes a composite metric, and enforces a threshold:
import { evaluateResponse } from './judge';
import { GoldenSample } from './dataset';
interface RunReport {
total: number;
passed: number;
failed: number;
compositeScore: number;
axisBreakdown: { accuracy: number; tone: number; format: number };
}
export async function executeEvalSuite(
dataset: GoldenSample[],
threshold: number = 3.8
): Promise<RunReport> {
const results: EvalResult[] = [];
for (const sample of dataset) {
const score = await evaluateResponse(sample.userQuery, sample.expectedResponse);
results.push(score);
}
const axisBreakdown = {
accuracy: results.reduce((sum, r) => sum + r.accuracy, 0) / results.length,
tone: results.reduce((sum, r) => sum + r.tone, 0) / results.length,
format: results.reduce((sum, r) => sum + r.format, 0) / results.length
};
const compositeScore =
(axisBreakdown.accuracy + axisBreakdown.axisBreakdown.tone + axisBreakdown.format) / 3;
const passed = results.filter(r =>
r.accuracy >= 3 && r.tone >= 3 && r.format >= 3
).length;
return {
total: results.length,
passed,
failed: results.length - passed,
compositeScore: Math.round(compositeScore * 100) / 100,
axisBreakdown
};
}
The threshold of 3.8 allows minor variance while blocking significant regressions. Teams can adjust this based on risk tolerance. The axis breakdown enables targeted fixes: if format drops but accuracy holds, the issue is likely prompt structure or markdown handling, not factual reasoning.
Pitfall Guide
1. Circular Synthetic Testing
Explanation: Generating test cases with an LLM creates data that mirrors the model's existing strengths. The evaluation suite becomes a confirmation loop that never surfaces novel failure modes. Fix: Source inputs exclusively from production logs, support tickets, or user session recordings. Rotate the dataset monthly to capture evolving query distributions.
2. Ambiguous Scoring Anchors
Explanation: Prompts that ask for "1-10" or "good/bad" ratings produce inconsistent outputs because the judge model lacks reference points. Scores drift between runs, making regression detection impossible. Fix: Anchor scoring at 1, 3, and 5 with explicit behavioral descriptions. Require JSON output and validate boundaries programmatically.
3. Golden Dataset Staleness
Explanation: User behavior shifts over time. A dataset collected in Q1 may not reflect Q3 query patterns, causing the eval suite to measure historical performance rather than current capability. Fix: Implement automated dataset rotation. Archive old samples, inject new production inputs, and track drift metrics (e.g., query length distribution, intent diversity) alongside eval scores.
4. Threshold Overfitting
Explanation: Setting a hard threshold (e.g., 4.0) causes CI friction when minor, acceptable variations trigger failures. Teams begin ignoring eval gates or lowering thresholds until they lose signal. Fix: Use tiered thresholds. Warn at 3.5, block at 3.0. Allow manual overrides with required justification. Track threshold violations over time to identify systemic prompt instability.
5. PII Contamination in Storage
Explanation: Production logs contain emails, phone numbers, and internal identifiers. Storing these in evaluation datasets violates compliance requirements and creates security liabilities. Fix: Implement a sanitization pipeline before dataset persistence. Use regex patterns for common PII, and consider tokenization or hashing for domain-specific identifiers. Never store raw logs in version control.
6. Judge Model Format Hallucination
Explanation: Frontier models occasionally output valid JSON but misalign structural scores with actual markdown compliance. The judge may rate a broken list as "5" due to semantic understanding overriding structural rules.
Fix: Add a secondary structural validator. Run regex or markdown parsers against the response before scoring. If structural validation fails, force format score to 1 regardless of judge output.
7. CI Pipeline Timeouts
Explanation: Running 100 sequential judge calls in GitHub Actions exceeds typical timeout limits, especially when API latency spikes. Failed pipelines create false negatives.
Fix: Batch requests using Promise.all with concurrency limits (e.g., 5-10 parallel calls). Implement exponential backoff for rate limits. Cache judge responses for identical input/output pairs to reduce redundant API calls.
Production Bundle
Action Checklist
- Define rubric axes: Map Accuracy, Tone, and Format to your product's specific failure modes.
- Extract production logs: Pull 30-90 days of user queries, strip PII, and stratify by timestamp.
- Build judge prompt: Anchor scores at 1, 3, 5. Enforce JSON output and temperature 0.
- Implement scoring runner: Compute composite scores, validate boundaries, and log axis breakdowns.
- Configure CI gate: Set threshold (3.5-3.8), add concurrency limits, and enable PR status checks.
- Schedule dataset rotation: Automate monthly sampling and archive stale golden sets.
- Add structural fallback: Validate markdown/format independently before trusting judge scores.
- Monitor drift: Track query distribution changes alongside eval scores to catch dataset decay.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Solo dev / MVP stage | DIY 3-axis rubric + 30% sampling | Minimal overhead, catches critical regressions, scales with usage | ~$0.05β$0.25/run |
| Mid-size team / regulated industry | Enterprise SaaS + custom rubric | Audit trails, compliance reporting, multi-model comparison | $180β$500+/mo |
| High-volume RAG pipeline | Hybrid: DIY gate + async batch runner | Handles 10k+ cases, separates scoring from deployment | $0.50β$2.00/run (scaled) |
| Multi-model selection phase | Enterprise platform or custom benchmark | Requires side-by-side latency, cost, and quality tracking | Variable (compute-heavy) |
Configuration Template
{
"eval_suite": {
"model": "gpt-4o-mini",
"temperature": 0,
"concurrency_limit": 8,
"thresholds": {
"composite_warn": 3.5,
"composite_block": 3.0,
"axis_minimum": 2
},
"dataset": {
"source": "production_logs",
"sample_size": 100,
"rotation_days": 30,
"pii_redaction": true
},
"ci": {
"fail_on_threshold_breach": true,
"allow_manual_override": true,
"timeout_minutes": 10
}
}
}
Quick Start Guide
- Install dependencies:
npm install openai dotenvand create a.envfile withOPENAI_API_KEY. - Prepare dataset: Export 500 recent production queries, run the sanitization function, and save as
golden.jsonl. - Run locally: Execute the scoring runner against your dataset. Verify JSON parsing, boundary validation, and composite calculation.
- Add to CI: Create a GitHub Actions workflow that triggers on
pull_request, runs the eval script, and sets status checks. Setcomposite_blockto 3.8 initially, then adjust based on historical variance. - Monitor: Review axis breakdowns weekly. If
formatconsistently drops, audit markdown handling. Ifaccuracydeclines, investigate prompt drift or model updates. Rotate the dataset monthly to maintain signal fidelity.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
