A/B Test Your Prompts Without a Framework

By Codcompass Team·2026-05-26·7 min read

Current Situation Analysis

Prompt engineering lacks deterministic regression testing. When a developer modifies a system prompt, the code diff only shows text changes. It provides zero visibility into whether the modification improved output quality, introduced subtle drift, or broke downstream parsing. Engineering teams frequently rely on subjective comparison or manual spot-checking, which scales poorly and introduces cognitive bias.

This gap exists because traditional evaluation frameworks demand heavy infrastructure: vendor dashboards, LLM-as-a-judge pipelines, or complex staging environments. Most teams skip formal testing until a regression hits production, at which point rollback becomes guesswork. The absence of a lightweight, version-controlled baseline capture mechanism means prompt changes are effectively invisible to CI/CD pipelines.

Data from production LLM deployments consistently shows that minor prompt adjustments can shift output distributions by 15–30% across key metrics like format compliance, tone consistency, and factual grounding. Without a deterministic replay mechanism, teams cannot isolate whether a performance drop stems from prompt drift, model updates, or input distribution shifts. The solution requires decoupling prompt versioning from output capture, storing baselines in a git-friendly format, and enabling fast, local diffing before code merges.

WOW Moment: Key Findings

The following comparison illustrates why local fixture-based replay outperforms traditional evaluation approaches for routine prompt iteration:

Approach	Setup Complexity	Execution Speed	Traceability	Cost per Run
Vendor Dashboard / LLM-as-a-Judge	High (API keys, routing, prompt engineering for judges)	Slow (sequential judge calls, rate limits)	Low (opaque scoring, hard to audit)	High (additional LLM calls per test)
Local Fixture Replay	Low (two libraries, JSONL storage)	Fast (deterministic replay, no judge calls)	High (git-tracked baselines, exact version hashes)	Near-zero (reuses cached inputs, optional semantic scoring)

This finding matters because it shifts prompt testing from an expensive, opaque evaluation phase to a deterministic, CI-gatable regression check. Teams can now treat prompts like infrastructure code: versioned, diffed, and validated before deployment. The approach enables rapid iteration loops without vendor lock-in or budget blowouts, while maintaining full auditability of which prompt text produced which output.

Core Solution

The architecture separates three concerns: template versioning, baseline capture, and output diffing. This separation ensures that prompt changes are traceable, replayable, and comparable without mocking production systems or introducing heavy dependencies.

Step 1: Version and Hash Prompt Templates

Prompt text must be pinned to a deterministic identifier. Using semantic versioning alongside content hashing guar

antees that every fixture file maps to an exact prompt state. This eliminates ambiguity when reviewing historical baselines or debugging regressions.

import { PromptRegistry } from 'prompt-template-version';

const templateRegistry = new PromptRegistry();

const baselineId = templateRegistry.register({
  name: 'customer_support_triage',
  version: '1.2.0',
  text: 'Classify the support ticket into one of three categories: billing, technical, or account. Return only the category name.'
});

const updatedId = templateRegistry.register({
  name: 'customer_support_triage',
  version: '1.3.0',
  text: 'Analyze the support ticket and assign it to exactly one category: billing, technical, or account. Output only the category name in lowercase.'
});

const baselineFixture = `fixtures/triage_${baselineId.shortHash}.jsonl`;

Step 2: Record Baseline Outputs

The recording phase intercepts LLM calls, captures inputs and outputs, and writes them to a JSONL file. JSONL is chosen for its git-friendly line-by-line structure, easy streaming support, and compatibility with standard Unix tools. Each line represents a single inference call, enabling granular diffing later.

import { RegressionRecorder } from 'prompt-replay';

const recorder = new RegressionRecorder({
  outputPath: baselineFixture,
  modelConfig: {
    provider: 'anthropic',
    model: 'claude-sonnet-4-6',
    maxTokens: 128,
    temperature: 0.1
  }
});

const testInputs = [
  'My invoice shows a duplicate charge for last month.',
  'The API returns 503 when I call /v2/users.',
  'I cannot reset my password using the email link.'
];

async function captureBaseline() {
  for (const input of testInputs) {
    await recorder.capture({
      prompt: templateRegistry.resolve('customer_support_triage', '1.2.0'),
      userMessage: input
    });
  }
  await recorder.flush();
  console.log(`Baseline captured: ${recorder.entryCount} records`);
}

Step 3: Replay and Diff

The replay phase loads the fixture, injects the updated prompt, and executes the same inputs. The diff engine provides three comparison modes: exact string matching, structural JSON comparison, and semantic similarity via embedding cosine distance. Semantic scoring acts as a drift signal, not a quality verdict.

import { RegressionReplayer, DiffEngine } from 'prompt-replay';

const replayer = new RegressionReplayer({
  fixturePath: baselineFixture,
  modelConfig: {
    provider: 'anthropic',
    model: 'claude-sonnet-4-6',
    maxTokens: 128,
    temperature: 0.1
  }
});

async function runRegression() {
  const results = await replayer.execute({
    prompt: templateRegistry.resolve('customer_support_triage', '1.3.0'),
    inputField: 'userMessage'
  });

  const diffEngine = new DiffEngine();
  
  for (const record of results) {
    const exactMatch = diffEngine.compare(record.baseline, record.current, 'exact');
    const structuralMatch = diffEngine.compare(record.baseline, record.current, 'json');
    const semanticScore = diffEngine.compare(record.baseline, record.current, 'semantic');

    console.log(`Input: ${record.input.slice(0, 40)}...`);
    console.log(`Exact: ${exactMatch} | Structural: ${structuralMatch} | Semantic: ${semanticScore.toFixed(3)}`);
  }
}

Architecture Decisions and Rationale

JSONL over SQLite/Parquet: JSONL enables line-level git diffs, easy filtering with jq, and zero database dependencies. It aligns with how LLM providers stream responses.
Separate Recording and Replay: Decoupling phases prevents state leakage. Baselines remain immutable until explicitly regenerated, ensuring CI gates compare against a stable reference.
Semantic Similarity as Signal: Embedding-based scoring captures conceptual drift without requiring expensive judge models. A threshold of 0.85–0.92 typically indicates acceptable variance, but must be calibrated per workload.
Version Hashing in Filenames: Embedding the prompt hash in the fixture path guarantees traceability. Teams can instantly verify which prompt text generated a given baseline, even months later.

Pitfall Guide

1. Treating Semantic Scores as Ground Truth

Semantic similarity measures vector proximity, not correctness. A score of 0.91 does not mean the new output is accurate; it only means the embedding space considers them close. Fix: Always pair semantic scores with exact/structural checks and human review for critical paths.

2. Applying to Non-Deterministic Workloads

Creative generation, brainstorming, or high-temperature sampling will naturally produce divergent outputs. Fixture replay will flag every run as a regression. Fix: Use rubric-based evaluation (prompt-eval-rubric) for creative tasks, reserving replay for deterministic classification, extraction, or formatting prompts.

3. Ignoring Token Limit Drift

Changing max_tokens or temperature between baseline and replay invalidates comparisons. The diff engine assumes identical inference parameters. Fix: Lock model configuration in the recorder and replayer. Validate config parity before execution.

4. Fixture Staleness

Approved prompt changes must regenerate baselines. Running replay against outdated fixtures creates false positives and erodes trust in CI gates. Fix: Implement a --regenerate flag that overwrites fixtures only after manual approval or successful production rollout.

5. Overlooking Side Effects and Tool Calls

The replay engine captures LLM text output only. It does not execute tools, write to databases, or trigger webhooks. Fix: Use this approach for prompt-only validation. For full agent tracing, integrate with agentsnap or mock downstream dependencies during replay.

6. Testing Full Production Corpus

Running thousands of inputs defeats the purpose of fast iteration. Fix: Curate a representative sample of 30–50 inputs covering edge cases, common patterns, and known failure modes. Rotate the sample quarterly to prevent dataset rot.

7. Hardcoding Thresholds Without Calibration

A universal 0.90 semantic threshold fails across domains. Medical extraction requires higher precision than marketing copy. Fix: Run a calibration phase on historical prompt changes. Plot score distributions and set thresholds at the 5th percentile of acceptable changes.

Production Bundle

Action Checklist

Pin prompt versions using semantic versioning and content hashing before recording baselines
Curate a representative input sample covering edge cases, not just happy paths
Lock inference parameters (model, temperature, max_tokens) across baseline and replay phases
Calibrate semantic similarity thresholds using historical prompt changes before enforcing CI gates
Implement a baseline regeneration workflow tied to approved prompt merges
Pair semantic scores with exact/structural diffing to catch format regressions
Exclude tool-dependent or creative prompts from replay; route them to rubric evaluation instead

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Deterministic classification/extraction prompts	Local fixture replay	Fast, deterministic, git-traceable	Near-zero
Creative generation or high-temperature tasks	Rubric-based evaluation	Semantic diff fails on variable outputs	Low (single judge call per rubric)
Full agent workflows with tool execution	Agent trace capture (`agentsnap`)	Replay only captures text, not side effects	Medium (trace storage + comparison)
High-volume daily prompt changes	Sampled replay + CI gate	Full corpus replay is slow and noisy	Low (cached inputs, parallel replay)
Multi-turn conversational prompts	Session-aware recorder (roadmap)	Current replay assumes single-turn state	Low (pending native support)

Configuration Template

# prompt_regression.config.yaml
registry:
  storage_path: ./prompt_versions
  hash_algorithm: sha256

recorder:
  fixture_dir: ./fixtures
  model:
    provider: anthropic
    model: claude-sonnet-4-6
    max_tokens: 256
    temperature: 0.1
  sampling:
    strategy: stratified
    max_samples: 50

replayer:
  diff_modes:
    - exact
    - json
    - semantic
  thresholds:
    semantic_min: 0.88
    structural_strict: true

ci_gate:
  fail_on_semantic_drop: true
  fail_on_format_mismatch: true
  notify_channels:
    - slack
    - github_pr_comment

Quick Start Guide

Install dependencies: npm install prompt-replay prompt-template-version (or pip install for Python)
Initialize version registry: npx prompt-template-version init --path ./prompts
Record baseline: npx prompt-replay record --prompt ./prompts/triage_v1.json --input ./samples/inputs.jsonl --output ./fixtures/triage_baseline.jsonl
Update prompt text and version in registry
Run regression: npx prompt-replay replay --fixture ./fixtures/triage_baseline.jsonl --prompt ./prompts/triage_v2.json --threshold 0.88
Review diff output and commit fixture if changes are approved

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back