antees that every fixture file maps to an exact prompt state. This eliminates ambiguity when reviewing historical baselines or debugging regressions.
import { PromptRegistry } from 'prompt-template-version';
const templateRegistry = new PromptRegistry();
const baselineId = templateRegistry.register({
name: 'customer_support_triage',
version: '1.2.0',
text: 'Classify the support ticket into one of three categories: billing, technical, or account. Return only the category name.'
});
const updatedId = templateRegistry.register({
name: 'customer_support_triage',
version: '1.3.0',
text: 'Analyze the support ticket and assign it to exactly one category: billing, technical, or account. Output only the category name in lowercase.'
});
const baselineFixture = `fixtures/triage_${baselineId.shortHash}.jsonl`;
Step 2: Record Baseline Outputs
The recording phase intercepts LLM calls, captures inputs and outputs, and writes them to a JSONL file. JSONL is chosen for its git-friendly line-by-line structure, easy streaming support, and compatibility with standard Unix tools. Each line represents a single inference call, enabling granular diffing later.
import { RegressionRecorder } from 'prompt-replay';
const recorder = new RegressionRecorder({
outputPath: baselineFixture,
modelConfig: {
provider: 'anthropic',
model: 'claude-sonnet-4-6',
maxTokens: 128,
temperature: 0.1
}
});
const testInputs = [
'My invoice shows a duplicate charge for last month.',
'The API returns 503 when I call /v2/users.',
'I cannot reset my password using the email link.'
];
async function captureBaseline() {
for (const input of testInputs) {
await recorder.capture({
prompt: templateRegistry.resolve('customer_support_triage', '1.2.0'),
userMessage: input
});
}
await recorder.flush();
console.log(`Baseline captured: ${recorder.entryCount} records`);
}
Step 3: Replay and Diff
The replay phase loads the fixture, injects the updated prompt, and executes the same inputs. The diff engine provides three comparison modes: exact string matching, structural JSON comparison, and semantic similarity via embedding cosine distance. Semantic scoring acts as a drift signal, not a quality verdict.
import { RegressionReplayer, DiffEngine } from 'prompt-replay';
const replayer = new RegressionReplayer({
fixturePath: baselineFixture,
modelConfig: {
provider: 'anthropic',
model: 'claude-sonnet-4-6',
maxTokens: 128,
temperature: 0.1
}
});
async function runRegression() {
const results = await replayer.execute({
prompt: templateRegistry.resolve('customer_support_triage', '1.3.0'),
inputField: 'userMessage'
});
const diffEngine = new DiffEngine();
for (const record of results) {
const exactMatch = diffEngine.compare(record.baseline, record.current, 'exact');
const structuralMatch = diffEngine.compare(record.baseline, record.current, 'json');
const semanticScore = diffEngine.compare(record.baseline, record.current, 'semantic');
console.log(`Input: ${record.input.slice(0, 40)}...`);
console.log(`Exact: ${exactMatch} | Structural: ${structuralMatch} | Semantic: ${semanticScore.toFixed(3)}`);
}
}
Architecture Decisions and Rationale
- JSONL over SQLite/Parquet: JSONL enables line-level git diffs, easy filtering with
jq, and zero database dependencies. It aligns with how LLM providers stream responses.
- Separate Recording and Replay: Decoupling phases prevents state leakage. Baselines remain immutable until explicitly regenerated, ensuring CI gates compare against a stable reference.
- Semantic Similarity as Signal: Embedding-based scoring captures conceptual drift without requiring expensive judge models. A threshold of 0.85β0.92 typically indicates acceptable variance, but must be calibrated per workload.
- Version Hashing in Filenames: Embedding the prompt hash in the fixture path guarantees traceability. Teams can instantly verify which prompt text generated a given baseline, even months later.
Pitfall Guide
1. Treating Semantic Scores as Ground Truth
Semantic similarity measures vector proximity, not correctness. A score of 0.91 does not mean the new output is accurate; it only means the embedding space considers them close. Fix: Always pair semantic scores with exact/structural checks and human review for critical paths.
2. Applying to Non-Deterministic Workloads
Creative generation, brainstorming, or high-temperature sampling will naturally produce divergent outputs. Fixture replay will flag every run as a regression. Fix: Use rubric-based evaluation (prompt-eval-rubric) for creative tasks, reserving replay for deterministic classification, extraction, or formatting prompts.
3. Ignoring Token Limit Drift
Changing max_tokens or temperature between baseline and replay invalidates comparisons. The diff engine assumes identical inference parameters. Fix: Lock model configuration in the recorder and replayer. Validate config parity before execution.
4. Fixture Staleness
Approved prompt changes must regenerate baselines. Running replay against outdated fixtures creates false positives and erodes trust in CI gates. Fix: Implement a --regenerate flag that overwrites fixtures only after manual approval or successful production rollout.
The replay engine captures LLM text output only. It does not execute tools, write to databases, or trigger webhooks. Fix: Use this approach for prompt-only validation. For full agent tracing, integrate with agentsnap or mock downstream dependencies during replay.
6. Testing Full Production Corpus
Running thousands of inputs defeats the purpose of fast iteration. Fix: Curate a representative sample of 30β50 inputs covering edge cases, common patterns, and known failure modes. Rotate the sample quarterly to prevent dataset rot.
7. Hardcoding Thresholds Without Calibration
A universal 0.90 semantic threshold fails across domains. Medical extraction requires higher precision than marketing copy. Fix: Run a calibration phase on historical prompt changes. Plot score distributions and set thresholds at the 5th percentile of acceptable changes.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Deterministic classification/extraction prompts | Local fixture replay | Fast, deterministic, git-traceable | Near-zero |
| Creative generation or high-temperature tasks | Rubric-based evaluation | Semantic diff fails on variable outputs | Low (single judge call per rubric) |
| Full agent workflows with tool execution | Agent trace capture (agentsnap) | Replay only captures text, not side effects | Medium (trace storage + comparison) |
| High-volume daily prompt changes | Sampled replay + CI gate | Full corpus replay is slow and noisy | Low (cached inputs, parallel replay) |
| Multi-turn conversational prompts | Session-aware recorder (roadmap) | Current replay assumes single-turn state | Low (pending native support) |
Configuration Template
# prompt_regression.config.yaml
registry:
storage_path: ./prompt_versions
hash_algorithm: sha256
recorder:
fixture_dir: ./fixtures
model:
provider: anthropic
model: claude-sonnet-4-6
max_tokens: 256
temperature: 0.1
sampling:
strategy: stratified
max_samples: 50
replayer:
diff_modes:
- exact
- json
- semantic
thresholds:
semantic_min: 0.88
structural_strict: true
ci_gate:
fail_on_semantic_drop: true
fail_on_format_mismatch: true
notify_channels:
- slack
- github_pr_comment
Quick Start Guide
- Install dependencies:
npm install prompt-replay prompt-template-version (or pip install for Python)
- Initialize version registry:
npx prompt-template-version init --path ./prompts
- Record baseline:
npx prompt-replay record --prompt ./prompts/triage_v1.json --input ./samples/inputs.jsonl --output ./fixtures/triage_baseline.jsonl
- Update prompt text and version in registry
- Run regression:
npx prompt-replay replay --fixture ./fixtures/triage_baseline.jsonl --prompt ./prompts/triage_v2.json --threshold 0.88
- Review diff output and commit fixture if changes are approved