and output stability.
Core Solution
The Prompt Replay Protocol consists of three phases: Capture, Replay, and Diff. The implementation decouples the recording mechanism from the comparison logic, allowing flexible integration into CI/CD pipelines.
Architecture Decisions
- JSONL Storage: The capture corpus is stored as JSON Lines. This format is append-only, stream-friendly, and human-readable. It avoids database dependencies and allows easy inspection with standard tools.
- Functional Replay: The replay engine accepts a callable target function. This abstraction allows testing against any model provider, local inference server, or mocked endpoint without modifying the core logic.
- Strategy-Based Diffing: Different tasks require different validation strategies. The protocol supports byte-level matching for deterministic outputs, structural diffing for JSON/tool-use, and token-overlap heuristics for text consistency.
Implementation
Below is a reference implementation using a custom validation framework. This code demonstrates the capture and replay workflow with distinct naming and structure from the source material.
Phase 1: Capture Baseline Interactions
from llm_validation import InteractionCapture, ComparisonStrategy
# Initialize capture session
capture = InteractionCapture(
output_path="./corpus/baseline_interactions.jsonl",
session_id="prod-v2-capture"
)
# Record real production traffic or test suite
def record_baseline():
requests = load_production_requests()
with capture.session():
for req in requests:
# Invoke current model
response = current_llm_client.generate(req)
# Log interaction
capture.log(
request=req,
response=response,
metadata={"model": "claude-3-5-sonnet"}
)
print(f"Captured {capture.count()} interactions.")
Phase 2: Replay Against Candidate Model
from llm_validation import ModelComparator, ReplayReport
def validate_migration():
comparator = ModelComparator(
source_path="./corpus/baseline_interactions.jsonl"
)
# Define target function for candidate model
def candidate_invoke(prompt):
return new_llm_client.generate(prompt)
# Execute replay with structural diffing
report: ReplayReport = comparator.run(
target_fn=candidate_invoke,
strategy=ComparisonStrategy.STRUCTURAL_JSON,
tolerance_threshold=0.95
)
# Analyze results
if report.has_failures():
print(f"Migration blocked: {report.failure_count} regressions detected.")
for delta in report.deltas[:5]:
print(f" - Request: {delta.request[:50]}...")
print(f" Diff: {delta.diff_summary}")
return False
print("Migration approved: No structural regressions.")
return True
Phase 3: Comparison Strategies
The protocol supports three comparison strategies, each suited to different output types:
class ComparisonStrategy(Enum):
BYTE_MATCH = "byte_match"
STRUCTURAL_JSON = "structural_json"
TOKEN_OVERLAP = "token_overlap"
BYTE_MATCH: Performs exact string comparison. Use this for deterministic tasks like code generation or regex extraction where output must be identical.
STRUCTURAL_JSON: Parses responses as JSON and computes a structural diff. This is critical for agents using tool calls or structured outputs. It detects missing fields, type changes, or schema violations even if values differ slightly.
TOKEN_OVERLAP: Calculates Jaccard similarity on token sets (split by whitespace and punctuation). This heuristic flags significant wording changes in freeform text. Note: This is not embedding-based; it measures lexical overlap, not semantic meaning.
Rationale for Choices
- Why
STRUCTURAL_JSON for agents? Agents rely on JSON schemas for tool invocation. A model that changes field names or nesting breaks the agent loop. Structural diffing catches this immediately.
- Why Jaccard over embeddings? Embedding-based comparison requires an additional model call per comparison, adding latency and cost. Jaccard similarity provides a zero-dependency heuristic that is sufficient for detecting style drift or major content changes.
- Why session isolation? Recording sessions with unique IDs prevents corpus contamination. Teams can maintain multiple baselines for different prompt versions or model families.
Pitfall Guide
1. Streaming Blindness
- Explanation: If the capture mechanism records streaming chunks, the replay corpus will contain fragmented responses. Comparing chunks against full responses yields false positives.
- Fix: Ensure the capture layer buffers the stream and records only the complete response. The replay engine expects atomic interactions.
2. Semantic Illusion
- Explanation: Relying on
TOKEN_OVERLAP for semantic equivalence is risky. Two responses can have low token overlap but identical meaning, or high overlap with different intent.
- Fix: Use
TOKEN_OVERLAP only for style consistency checks. For meaning validation, combine replay with a rubric-based evaluator or manual review of flagged items.
3. Threshold Misconfiguration
- Explanation: Setting a fixed tolerance threshold (e.g., 0.85) without calibration leads to either false alarms or missed regressions.
- Fix: Calibrate thresholds using a golden dataset. Run the replay against the same model to establish a baseline similarity score, then set the threshold slightly below that value.
4. Creative Regression Testing
- Explanation: Applying replay protocols to creative generation tasks (e.g., copywriting, brainstorming) is counterproductive. These tasks require variance, and diffing will flag all outputs as regressions.
- Fix: Scope replay testing to deterministic, structured, or consistency-critical tasks. Exclude creative endpoints from the regression suite.
5. Environment Leakage
- Explanation: Captured interactions may contain sensitive data, PII, or internal system prompts. Storing this in plain JSONL poses a security risk.
- Fix: Implement sanitization hooks in the capture layer to redact sensitive fields before writing to disk. Encrypt the corpus at rest if required.
6. Async Bottlenecks
- Explanation: Sequential replay of large corpora can be slow, blocking CI pipelines.
- Fix: Parallelize the target function invocation. Use async concurrency or thread pools within the replay engine to execute calls concurrently, reducing total validation time.
7. Tool Schema Drift
- Explanation: Model updates may change tool names, parameter types, or required fields. Text-based diffing misses these structural changes.
- Fix: Always use
STRUCTURAL_JSON for agent workflows. Validate against a JSON Schema definition to enforce strict compliance.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Strategy | Why | Cost Impact |
|---|
| Agent Tool Calls | STRUCTURAL_JSON | Ensures schema integrity and tool compatibility. | Low |
| Code Generation | BYTE_MATCH | Deterministic output required; exact match expected. | None |
| Summarization | TOKEN_OVERLAP | Detects major content omission or style drift. | Medium |
| Creative Copy | Skip Replay | Variance is expected; diffing yields noise. | N/A |
| Regex Extraction | BYTE_MATCH | Output must match pattern exactly. | None |
Configuration Template
Use this template to configure the replay suite for a CI pipeline.
# replay_config.yaml
capture:
output_path: "./corpus/baseline.jsonl"
session_id: "prod-v2"
replay:
source_path: "./corpus/baseline.jsonl"
target_model: "claude-sonnet-4-6"
strategies:
- task: "tool_use"
strategy: "STRUCTURAL_JSON"
threshold: 0.95
- task: "summarization"
strategy: "TOKEN_OVERLAP"
threshold: 0.80
- task: "code_gen"
strategy: "BYTE_MATCH"
threshold: 1.0
pipeline:
parallelism: 10
timeout_seconds: 300
fail_on_regression: true
Quick Start Guide
- Install Dependencies: Add the validation library to your project dependencies.
- Instrument Capture: Wrap your model invocation code with the
InteractionCapture context manager to record interactions.
- Run Replay: Execute the
ModelComparator against your candidate model using the recorded corpus.
- Inspect Report: Review the diff report for structural or lexical regressions.
- Deploy: Proceed with migration only if the replay report passes all configured thresholds.