How to Run LLM Evaluations in CI Without Paying $249/Month
Automating Prompt Quality Gates: A Zero-Cost CI Pipeline for LLM Applications
Current Situation Analysis
Building production-grade LLM features introduces a unique engineering challenge: prompts behave predictably in isolated playgrounds but degrade silently under real-world distribution shifts. Unlike traditional software where unit tests catch regressions, LLM outputs are inherently non-deterministic. A minor tweak to a system prompt or temperature setting can silently reduce classification accuracy, increase hallucination rates, or break formatting constraints. Without automated evaluation, teams ship prompt regressions that only surface through user complaints or degraded metrics weeks later.
This problem is frequently overlooked because developers treat prompts as static configuration rather than versioned code. The industry response has been commercial evaluation platforms like LangSmith and Braintrust. These tools offer robust tracing, dataset management, and dashboarding, but they start at $249/month. For pre-PMF startups, indie developers, or internal tooling teams, that price point creates a hard barrier. The result is a gap between prototype validation and production reliability.
The technical reality is that you don't need a managed platform to enforce quality gates. LLM-as-judge scoring using cost-efficient models like GPT-4o-mini costs approximately $0.002 per evaluation. A 100-case golden dataset runs for $0.20 per execution. Combined with GitHub Actions' free tier (2,000 minutes/month), a complete CI evaluation pipeline for a team pushing 10 PRs weekly costs under $5/month. The infrastructure exists; the missing piece is a disciplined, automated workflow that treats prompt quality as a continuous integration metric rather than a manual review process.
WOW Moment: Key Findings
The shift from manual playground testing to automated CI evaluation fundamentally changes how LLM features evolve. The following comparison illustrates the operational and economic impact of adopting a self-hosted evaluation pipeline versus relying on commercial SaaS platforms.
| Approach | Monthly Cost | Time-to-First-Eval | Regression Detection | Vendor Lock-in Risk |
|---|---|---|---|---|
| Commercial Platform (e.g., LangSmith, Braintrust) | $249+ | 2-4 hours (account setup, SDK integration) | High (built-in dashboards, alerting) | High (proprietary dataset formats, API dependencies) |
| DIY CI Pipeline (LLM-as-judge + GitHub Actions) | <$5 | 45-60 minutes (script + workflow config) | High (threshold-based build failures) | Low (open standards, CSV/JSON datasets, portable code) |
This finding matters because it decouples prompt quality engineering from budget constraints. Teams can enforce strict quality gates, run multi-model cost comparisons, and iterate on system prompts with confidence that regressions will block merges. The pipeline transforms prompt engineering from an artisanal practice into a repeatable, measurable engineering discipline.
Core Solution
Building a production-grade evaluation pipeline requires three interconnected components: a versioned golden dataset, a rubric-based LLM-as-judge scorer, and a CI integration that enforces quality thresholds. Each component addresses a specific failure mode in LLM development.
Step 1: Construct the Golden Dataset
The foundation of any evaluation system is a curated dataset of 50-200 test cases. Unlike traditional unit tests, you should not define exact expected outputs. LLMs will naturally vary in phrasing, structure, and token selection. Instead, define behavioral expectations using rubrics.
Pull cases from three sources:
- Production logs containing real user queries that triggered edge cases
- Known failure modes you've already shipped accidentally
- Hypothetical adversarial inputs designed to stress-test constraints
Store the dataset as a version-controlled CSV or JSON file. Each row should contain the input prompt, a category tag, and a rubric describing what constitutes a successful response.
Step 2: Implement the LLM-as-Judge Scorer
Exact string matching fails because LLMs are probabilistic. Rubric-based scoring succeeds because it evaluates semantic alignment with requirements. The scorer sends the model's output alongside the rubric to a lightweight judge model (GPT-4o-mini is optimal at ~$0.002 per evaluation). The judge returns a numeric score and a brief justification.
Key architectural decisions:
- Judge Model Selection: Use a different model for judging than the one being evaluated. This prevents self-bias and reduces cost. GPT-4o-mini provides sufficient reasoning capability for rubric alignment at a fraction of GPT-4o's price.
- Temperature Control: Set the judge's temperature to 0.0. Evaluation must be deterministic. You want consistent scoring across runs, not creative interpretations.
- Scoring Scale: Use a 1-5 scale. It provides enough granularity to detect drift without creating false precision. A score of 3.5/5.0 typically represents the minimum threshold for production readiness.
Step 3: Integrate with CI and Enforce Thresholds
The pipeline runs on every pull request. It loads the golden dataset, executes the target prompt against each case, scores the outputs, and calculates the average. If the average falls below the configured threshold, the build fails. This creates a hard quality gate that prevents prompt regressions from merging.
Below is a complete, production-ready TypeScript implementation. It uses the official OpenAI SDK, handles rate limiting, and outputs structured results for CI consumption.
import OpenAI from "openai";
import { readFileSync } from "fs";
import { parse } from "csv-parse/sync";
interface EvalCase {
id: string;
input: string;
category: string;
rubric: string;
}
interface EvalResult {
caseId: string;
score: number;
justification: string;
latencyMs: number;
}
class PromptEvaluator {
private openai: OpenAI;
private judgeModel: string;
constructor(apiKey: string, judgeModel: string = "gpt-4o-mini") {
this.openai = new OpenAI({ apiKey });
this.judgeModel = judgeModel;
}
async loadDataset(path: string): Promise<EvalCase[]> {
const raw = readFileSync(path, "utf-8");
return parse(raw, {
columns: true,
skip_empty_lines: true,
}) as EvalCase[];
}
private async scoreResponse(
input: string,
output: string,
rubric: string
): Promise<{ score: number; justification: string }> {
const prompt = `You are an evaluation judge. Score the following AI response against the provided rubric.
Return ONLY a JSON object with "score" (1-5) and "justification" (max 50 words).
Rubric:
${rubric}
Input:
${input}
AI Response:
${output}`;
const start = Date.now();
const completion = await this.openai.chat.completions.create({
model: this.judgeModel,
messages: [{ role: "user", content: prompt }],
temperature: 0,
response_format: { type: "json_object" },
});
const latency = Date.now() - start;
const parsed = JSON.parse(completion.choices[0].message.content || "{}");
return { score: parsed.score ?? 0, justification: parsed.justification ?? "" };
}
async runSuite(
dataset: EvalCase[],
generateFn: (input: string) => Promise<string>,
threshold: number = 3.5
): Promise<{ averageScore: number; results: EvalResult[]; passed: boolean }> {
const results: EvalResult[] = [];
for (const testCase of dataset) {
const output = await generateFn(testCase.input);
const { score, justification } = await this.scoreResponse(
testCase.input,
output,
testCase.rubric
);
results.push({
caseId: testCase.id,
score,
justification,
latencyMs: 0, // Latency tracking handled by generateFn in production
});
}
const averageScore =
results.reduce((sum, r) => sum + r.score, 0) / results.length;
const passed = averageScore >= threshold;
return { averageScore, results, passed };
}
}
export { PromptEvaluator };
This implementation separates concerns cleanly: dataset loading, scoring logic, and suite execution. The generateFn parameter allows you to inject any prompt execution strategy (direct API call, LangChain chain, custom wrapper) without modifying the evaluator. The threshold check happens at the suite level, making it trivial to integrate with CI exit codes.
Step 4: Multi-Model Cost Optimization
Before locking in a model for production, run the same golden dataset against multiple candidates. GPT-4o, Claude 3.5 Haiku, and Gemini Flash often score within 0.2 points of each other on well-constructed rubrics, but their inference costs differ by 5-10x. A 10-minute comparison run can reveal that a cheaper model meets your quality threshold, cutting monthly inference spend by 60-80%. This pattern should be standard practice before any major feature launch.
Pitfall Guide
Even with a solid architecture, LLM evaluation pipelines fail when teams ignore behavioral nuances. The following pitfalls are drawn from production deployments and represent the most common points of failure.
1. Exact Match Fallacy
Explanation: Writing rubrics that expect specific phrasing or token sequences. LLMs will naturally vary in expression, causing false negatives. Fix: Focus rubrics on semantic requirements, constraints, and tone. Use phrases like "must include X" or "should avoid Y" rather than "must output exactly Z".
2. Judge Model Self-Bias
Explanation: Using the same model for generation and evaluation. Models tend to rate their own outputs higher due to distribution alignment. Fix: Always use a distinct judge model. GPT-4o-mini works well as a judge for GPT-4o or Claude outputs. Rotate judges periodically to detect drift.
3. Dataset Stagnation
Explanation: Running the same 50 cases for months while production traffic evolves. The pipeline stops catching real-world regressions. Fix: Implement a quarterly dataset refresh cycle. Pull 20% new cases from production logs, retire low-signal cases, and version the dataset alongside your code.
4. Threshold Over-Tuning
Explanation: Setting thresholds too high (4.8/5.0) causes constant CI failures on minor phrasing changes. Setting them too low (2.5/5.0) allows degraded outputs to merge. Fix: Start at 3.5/5.0. Monitor false positive rates for two weeks. Adjust in 0.2 increments based on actual regression catch rate, not theoretical perfection.
5. Rubric Leakage
Explanation: Embedding expected outputs or system prompt details directly into the rubric. This turns the judge into a regex matcher and defeats the purpose of semantic scoring. Fix: Keep rubrics abstract and behavior-focused. Example: "Response must acknowledge uncertainty when data is missing" instead of "Response must say 'I don't know'".
6. Ignoring CI Latency
Explanation: Running 200 cases sequentially in GitHub Actions causes timeouts or exceeds free tier minutes.
Fix: Implement parallel execution with concurrency limits. Use Promise.allSettled() with a batch size of 5-10. Cache judge responses for unchanged cases to skip redundant scoring.
7. Missing Error Boundaries
Explanation: API rate limits or network failures crash the entire CI run, masking actual quality issues. Fix: Wrap judge calls in retry logic with exponential backoff. Log failures separately from quality scores. Fail the build only on quality threshold breaches, not transient infrastructure errors.
Production Bundle
Action Checklist
- Define 50-100 golden dataset cases covering core workflows, edge cases, and known failure modes
- Write behavioral rubrics for each case focusing on constraints, tone, and accuracy rather than exact phrasing
- Select a lightweight judge model (GPT-4o-mini recommended) and set temperature to 0.0
- Implement the evaluation runner with parallel execution and retry logic
- Configure GitHub Actions workflow to run on pull requests with a 3.5/5.0 threshold
- Add multi-model comparison step to test Claude 3.5 Haiku and Gemini Flash before production deployment
- Schedule quarterly dataset refreshes using production log sampling
- Document threshold adjustment history and false positive/negative rates for auditability
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Pre-PMF product with <10 PRs/week | DIY CI Pipeline + GPT-4o-mini judge | Low overhead, full control, <$5/month | Negligible |
| Enterprise compliance requirements | Commercial Platform (LangSmith/Braintrust) | Audit trails, SSO, dedicated support, SLA | $249-$999/month |
| Multi-model vendor evaluation | DIY Pipeline with parallel runner | Direct score comparison, no platform abstraction | ~$0.50/run |
| High-frequency prompt iteration (>20 PRs/week) | DIY Pipeline + response caching | Reduces redundant judge calls, stays under free tier limits | ~$8-12/month |
| Strict regulatory output constraints | Hybrid: DIY CI + human-in-the-loop sampling | Automated scoring catches drift, humans validate edge cases | Variable (human review cost) |
Configuration Template
Copy this GitHub Actions workflow to .github/workflows/llm-eval.yml. It assumes your evaluation script is located at scripts/run-eval.ts and your dataset at eval/golden-dataset.csv.
name: LLM Quality Gate
on:
pull_request:
paths:
- "prompts/**"
- "src/llm/**"
- "eval/**"
jobs:
evaluate:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: 20
cache: "npm"
- name: Install dependencies
run: npm ci
- name: Run evaluation suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
EVAL_THRESHOLD: 3.5
run: |
npx tsx scripts/run-eval.ts \
--dataset eval/golden-dataset.csv \
--threshold $EVAL_THRESHOLD \
--judge-model gpt-4o-mini \
--concurrency 8
- name: Upload evaluation report
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: eval/reports/
The workflow triggers only on relevant path changes to avoid unnecessary runs. It exports results as artifacts for post-merge review. The --concurrency flag controls parallel judge calls to respect API rate limits.
Quick Start Guide
- Create your dataset: Generate a CSV with columns
id,input,category,rubric. Populate 50 rows using production logs and known edge cases. Commit toeval/golden-dataset.csv. - Write the runner: Use the TypeScript evaluator class above. Implement a
generateFnthat calls your actual prompt pipeline. Add CLI argument parsing for threshold and concurrency. - Configure CI: Add the workflow template to
.github/workflows/llm-eval.yml. SetOPENAI_API_KEYin repository secrets. SetEVAL_THRESHOLDto 3.5. - Validate: Open a test PR that intentionally degrades a prompt. Verify the workflow fails and outputs the average score. Revert the change and confirm the build passes.
- Iterate: Run multi-model comparisons against Claude 3.5 Haiku and Gemini Flash. Document score deltas and adjust your production model selection based on cost/quality trade-offs.
This pipeline transforms prompt engineering from a manual, reactive process into a continuous, measurable engineering practice. By enforcing quality gates at merge time, you eliminate silent regressions, reduce inference costs through systematic model comparison, and maintain production reliability without commercial platform dependencies. The infrastructure is lightweight, the economics are favorable, and the operational discipline scales with your team.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
