on of any evaluation system is a curated dataset that represents your actual usage patterns. Instead of storing expected output strings, store input prompts alongside structured evaluation criteria. This approach decouples test data from model-specific phrasing.
// types.ts
export interface EvalCase {
id: string;
inputPrompt: string;
contextWindow: string;
rubricCriteria: string[];
weight: number;
}
export interface EvalResult {
caseId: string;
modelOutput: string;
judgeScore: number;
breakdown: Record<string, number>;
latencyMs: number;
}
Populate this dataset with 50-200 cases drawn from production logs, known edge cases, and historical failure modes. Assign weights to cases based on business criticality. High-traffic user intents should carry higher weights than niche formatting scenarios.
Step 2: Implement the LLM-as-Judge Scorer
The scoring engine queries a lightweight reasoning model to evaluate outputs against your rubric. Temperature must be locked to zero to ensure consistent scoring behavior. The judge model receives the original input, the candidate output, and the rubric, then returns a structured score.
// scorer.ts
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export class QualityGateEvaluator {
private readonly judgeModel = "gpt-4o-mini";
private readonly maxRetries = 3;
async evaluate(
input: string,
candidateOutput: string,
rubric: string[]
): Promise<{ score: number; breakdown: Record<string, number> }> {
const rubricPrompt = rubric
.map((c, i) => `${i + 1}. ${c}`)
.join("\n");
const systemPrompt = `You are an evaluation engine. Score the candidate output against each rubric criterion on a scale of 0-1. Return only a JSON object with "score" (average) and "breakdown" (key-value pairs).`;
const userPrompt = `INPUT: ${input}\n\nCANDIDATE OUTPUT: ${candidateOutput}\n\nRUBRIC:\n${rubricPrompt}`;
for (let attempt = 1; attempt <= this.maxRetries; attempt++) {
try {
const response = await openai.chat.completions.create({
model: this.judgeModel,
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: userPrompt },
],
temperature: 0,
response_format: { type: "json_object" },
});
const parsed = JSON.parse(response.choices[0].message.content || "{}");
return {
score: parsed.score ?? 0,
breakdown: parsed.breakdown ?? {},
};
} catch (error) {
if (attempt === this.maxRetries) throw error;
await new Promise((r) => setTimeout(r, attempt * 1000));
}
}
throw new Error("Evaluation failed after retries");
}
}
Architecture Rationale:
gpt-4o-mini is selected for scoring because it balances reasoning capability with sub-cent pricing. Using the same model for generation and evaluation introduces bias; a separate lightweight judge isolates scoring from feature logic.
temperature: 0 and response_format: "json_object" enforce deterministic scoring and prevent parsing failures.
- Retry logic with exponential backoff handles transient API rate limits without failing the entire CI run.
Step 3: CI Enforcement Layer
The pipeline executes the evaluation suite on every pull request, compares the weighted average score against a predefined threshold, and blocks merges if quality degrades.
// ci-runner.ts
import { QualityGateEvaluator } from "./scorer";
import { EvalCase, EvalResult } from "./types";
export async function runQualityGate(
cases: EvalCase[],
threshold: number
): Promise<void> {
const evaluator = new QualityGateEvaluator();
const results: EvalResult[] = [];
for (const testCase of cases) {
const start = Date.now();
const candidateOutput = await generateCandidate(testCase.inputPrompt);
const evaluation = await evaluator.evaluate(
testCase.inputPrompt,
candidateOutput,
testCase.rubricCriteria
);
results.push({
caseId: testCase.id,
modelOutput: candidateOutput,
judgeScore: evaluation.score,
breakdown: evaluation.breakdown,
latencyMs: Date.now() - start,
});
}
const weightedAvg = results.reduce(
(acc, r, i) => acc + r.judgeScore * cases[i].weight,
0
) / results.reduce((acc, r, i) => acc + cases[i].weight, 0);
if (weightedAvg < threshold) {
console.error(`Quality gate failed: ${weightedAvg.toFixed(2)} < ${threshold}`);
process.exit(1);
}
console.log(`Quality gate passed: ${weightedAvg.toFixed(2)}`);
}
// Stub for your actual LLM generation logic
async function generateCandidate(prompt: string): Promise<string> {
// Replace with your feature's inference call
return "placeholder output";
}
The threshold acts as a quality floor. Start with a baseline measurement from your current production prompt, then set the threshold 0.1-0.2 points below that baseline to account for natural variance. This prevents false positives while catching meaningful regressions.
Pitfall Guide
1. Rubric Ambiguity
Explanation: Vague criteria like "Is the tone appropriate?" produce inconsistent judge scores because the LLM interprets subjective language differently across runs.
Fix: Replace subjective descriptors with measurable conditions. Change "tone appropriate" to "Uses formal register without colloquialisms or slang."
2. Threshold Tuning Drift
Explanation: Setting thresholds too aggressively causes constant CI failures, leading teams to disable the gate entirely. Setting them too loosely defeats the purpose.
Fix: Run 50+ evaluations against your current production prompt to establish a baseline distribution. Set the threshold at the 10th percentile of that distribution, then adjust quarterly based on production feedback.
3. Ignoring Latency in CI
Explanation: Sequential evaluation of 100 cases against an external API can push CI runtime past 10 minutes, blocking developer velocity.
Fix: Implement concurrent execution with a concurrency limit (e.g., 5-10 parallel requests). Cache judge responses for identical input/output pairs using a content hash.
4. Single-Model Judge Bias
Explanation: Relying exclusively on one judge model can mask systematic blind spots. GPT-4o-mini may overlook factual hallucinations that Claude 3.5 Haiku catches, or vice versa.
Fix: Run a secondary validation pass on a subset of cases (10-20%) using a different model family. Flag cases where scores diverge by >0.5 for manual review.
5. Dataset Staleness
Explanation: Evaluation suites decay as user behavior shifts and new edge cases emerge. A static dataset from six months ago no longer represents production traffic.
Fix: Implement a quarterly rotation process. Pull the top 20 failure cases from production logs, add them to the suite, and retire the lowest-performing historical cases.
6. Cost Blindness
Explanation: Unmonitored evaluation runs can accumulate API costs, especially when triggered on every commit across multiple branches.
Fix: Wrap evaluation calls in a cost tracker that logs token usage and estimated spend. Set a monthly budget alert at 80% of your threshold. Run full suites only on PR merges, not on every push.
7. False Positives from Prompt Variance
Explanation: Minor prompt rewording that doesn't affect user experience can shift token probabilities enough to drop judge scores.
Fix: Separate generation temperature from evaluation temperature. Keep generation at your production setting, but force the judge to temperature: 0. Use semantic similarity checks alongside rubric scoring to filter cosmetic variations.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup MVP (0-10k users) | Rubric-based LLM judge on GPT-4o-mini | Lowest cost, fastest setup, catches 85%+ regressions | ~$4/month |
| High-Volume SaaS (100k+ users) | Rubric judge + secondary model validation | Reduces false positives, maintains accuracy at scale | ~$15-25/month |
| Compliance-Heavy (Healthcare/Finance) | Human-in-the-loop spot checks + strict threshold | Regulatory requirements demand audit trails and manual verification | ~$10/month + review time |
| Multi-Model A/B Testing | Parallel evaluation across GPT-4o, Claude 3.5 Haiku, Gemini Flash | Identifies cost-performance sweet spots before migration | ~$8-12/month |
Configuration Template
# .github/workflows/llm-quality-gate.yml
name: LLM Quality Gate
on:
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- name: Run Evaluation Suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
EVAL_THRESHOLD: 3.6
run: npx ts-node ci-runner.ts --threshold $EVAL_THRESHOLD
// eval-config.ts
export const EVAL_CONFIG = {
datasetPath: "./evals/golden-dataset.json",
threshold: parseFloat(process.env.EVAL_THRESHOLD || "3.6"),
concurrencyLimit: 8,
cacheTtlHours: 24,
judgeModel: "gpt-4o-mini",
maxRetries: 3,
};
Quick Start Guide
- Initialize the dataset: Create a JSON file with 50 input prompts and corresponding rubric arrays. Assign weights based on business priority.
- Install dependencies: Run
npm install openai ts-node and configure your OPENAI_API_KEY in environment variables.
- Run baseline evaluation: Execute the scorer against your current production prompt to establish a baseline score distribution.
- Set the threshold: Calculate the 10th percentile of your baseline scores and set
EVAL_THRESHOLD slightly below it.
- Enable CI enforcement: Add the GitHub Actions workflow to your repository. Push a test PR to verify the quality gate blocks merges when scores drop below the threshold.