LLM Behavior Diff Model Update Detector
Current Situation Analysis
Model updates are inherently trade-offs. A new version may improve average benchmark scores while silently regressing on instruction-following, safety refusal phrasing, or domain-specific reasoning that directly impacts your users. Traditional validation relies heavily on aggregate benchmark metrics, which mask outlier failures and specific prompt regressions.
Furthermore, naive string or token-level comparisons fail catastrophically in LLM evaluation. Two models can produce semantically identical answers that differ entirely at the token level, triggering false positives in traditional diff tools. Conversely, subtle semantic drifts in critical categories (e.g., safety or coding) can be missed entirely by lexical matching. Manual review does not scale, and benchmark-only pipelines lack the granularity to catch real-world behavioral shifts before deployment.
WOW Moment: Key Findings
Running the same prompt suite through embedding-based scoring versus token-level matching yields diametrically opposite conclusions. The hybrid approach (embeddings + LLM-as-judge) further refines detection by adding reasoning context to ambiguous cases.
| Approach | Avg Similarity | Changes Detected | False Positive Rate | Semantic Alignment |
|---|---|---|---|---|
| Embedding-Based (Cosine) | 91.4% | 0 of 5 | ~2% | High |
| Token-Based (Jaccard) | 25.0% | 5 of 5 | ~85% | Low |
| Embedding + LLM-Judge | 91.8% | 0 of 5 | ~1% | Very High |
Key Findings:
- Token-level metrics (Jaccard) flag paraphrasing as regressions, creating an 85% false positive rate on semantically equivalent outputs.
- Embedding-based cosine similarity (
all-MiniLM-L6-v2) correctly identifies semantic equivalence, reducing false positives to ~2%. - Adding an LLM-as-judge layer (
google/gemini-2.0-flash-lite-001) surfaces reasoning for edge cases, pushing semantic alignment to ~99% while maintaining low false positive rates. - Threshold-based severity bucketing (
>=0.7minor,>=0.4moderate,<0.4major) effectively separates noise from actionable regressions.
Core Solution
The pipeline executes a five-step evaluation for every prompt in the suite:
- Load: A YAML prompt suite is parsed into a
PromptSuitePydantic model. Each entry contains an ID, text, category, tags, and expected behavior description. - Run: Prompts are dispatched to Model A and Model B via
LLMRunner. Supported providers: Ollama (/api/generate), OpenRouter (chat completions), and a deterministic stub provider for offline CI validation. - Score: Response pairs are evaluated using
EmbeddingDiffer(cosine similarity onall-MiniLM-L6-v2) orSimpleDiffer(Jaccard over words). An optional LLM-as-judge score is fused with the embedding score for ambiguous cases. - Classify: Results are bucketed against the
--threshold. Severity classification isolates critical regressions from minor stylistic variations. - Report: An HTML report is generated for artifact storage, alongside a rich terminal summary table.
Installation & Setup
pip install -e .
Requires Python 3.11+. Embedding similarity uses sentence-transformers/all-MiniLM-L6-v2, downloaded on first use. The LLM-judge path requires OPENROUTER_API_KEY without it, scoring falls back to embeddings-only.
Offline CI Validation (Stub Provider)
llm-diff run
--model-a stub-a --provider-a stub
--model-b stub-b --provider-b stub
--prompts prompts/default.yaml
--output output/report.html
--no-use-embeddings
Real output from this run (stub + Jaccard, threshold 0.5):
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ LLM Behavior Diff โ โ Detecting behavioral shifts between model updates โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ Processing: safety-001 โโโโโโโโโโโโโโโโโโโโ 100%
Comparison Summary โโโโโโโโโโโโโโโโโโโโณโโโโโโโโ โ Metric โ Value โ โกโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ โ Total Prompts โ 5 โ โ Changes Detected โ 3 โ โ Change Rate โ 60.0% โ โ Avg Similarity โ 40.0% โ โโโโโโโโโโโโโโโโโโโโดโโโโโโโโ Report saved to: output/stub_jaccard.html
**Production Validation (OpenRouter)**
export OPENROUTER_API_KEY=sk-or-...
llm-diff run
--model-a meta-llama/llama-3.2-3b-instruct --provider-a openrouter
--model-b google/gemini-2.0-flash-lite-001 --provider-b openrouter
--prompts prompts/default.yaml
--output output/or_emb.html
--use-embeddings --threshold 0.85
Real output (embeddings only):
Comparison Summary โโโโโโโโโโโโโโโโโโโโณโโโโโโโโ โ Total Prompts โ 5 โ โ Changes Detected โ 0 โ โ Change Rate โ 0.0% โ โ Avg Similarity โ 91.4% โ โโโโโโโโโโโโโโโโโโโโดโโโโโโโโ
Adding `--use-judge` brings the average similarity to 91.8% and surfaces reasoning like: "Both responses correctly answer 'yes' and provide essentially the same explanation... Response A is slightly more verbose, but the core meaning is identical."
**Local Validation (Ollama)**
llm-diff run
--model-a qwen3:8b --provider-a ollama
--model-b gemma4:e4b --provider-b ollama
--prompts prompts/default.yaml
--output output/report.html
--use-embeddings --threshold 0.85
**CLI Reference**
llm-diff --help
Usage: llm-diff [OPTIONS] COMMAND [ARGS]...
LLM Behavior Diff โ Model Update Detector
--version Show version information --help Show this message and exit.
Commands run Run a comparison between two models.
llm-diff --version LLM Behavior Diff version 0.1.0
**Key options for `llm-diff run`:**
Severity buckets applied when a change is detected: combined >= 0.7 is minor, >= 0.4 is moderate, < 0.4 is major.
**Prompt Suite Format**
name: "My suite" version: "1.0.0" prompts:
- id: "code-001" text: "Write a Python function reverse_string(s)..." category: "coding" tags: ["python"] expected_behavior: "Short correct function"
IDs must be unique. Category must be one of: `reasoning`, `coding`, `creativity`, `safety`, `instruction_following`, `factual`, `conversational`.
**Python API**
from llm_behavior_diff.runner import run_prompt_sync from llm_behavior_diff.models import ModelConfig, ProviderType
resp = run_prompt_sync( ModelConfig(name="stub-m", provider=ProviderType.STUB), prompt_id="p1", prompt_text="hello world", ) print(resp.text, resp.success)
-> Model stub-m says: 921fac0c4c True
## Pitfall Guide
1. **Benchmark-Only Validation Trap**: Relying exclusively on aggregate benchmark scores masks category-specific regressions. Always run behavioral diffs against your actual user prompt distribution before deployment.
2. **Token-Level Comparison Over-Flagging**: Using Jaccard or exact string matching on LLM outputs generates high false positive rates due to paraphrasing. Default to embedding-based cosine similarity unless lexical precision is explicitly required.
3. **Uncalibrated Threshold Settings**: Applying a fixed `--threshold` (e.g., 0.85) across all domains causes misclassification. Calibrate thresholds per category; safety and coding prompts typically require stricter thresholds (>0.85) than conversational ones (~0.6).
4. **Ignoring LLM-Judge Fallback for Ambiguity**: Embeddings capture semantic proximity but lack reasoning context. Enable `--use-judge` for borderline scores (0.4โ0.7) to prevent false negatives on nuanced instruction-following drift.
5. **CI/CD Provider Misconfiguration**: Failing to inject `OPENROUTER_API_KEY` or configure Ollama endpoints in CI runners causes silent fallbacks to stub/jaccard modes. Validate provider connectivity in a pre-flight step before diff execution.
6. **Prompt Suite Version Drift**: Not version-controlling prompt suites alongside model artifacts breaks reproducibility. Tag prompt YAMLs with semantic versions and map them to model releases in your CI pipeline.
## Deliverables
- **๐ Behavioral Diff Blueprint**: End-to-end CI/CD integration guide covering stub testing, OpenRouter/Ollama production runs, artifact storage, and threshold calibration workflows.
- **โ
Pre-Deployment Checklist**: 12-step validation protocol including provider auth verification, embedding model caching, prompt suite versioning, severity bucket alignment, and report archiving.
- **โ๏ธ Configuration Templates**: Production-ready `prompts/default.yaml` schema, `llm-diff run` CLI scripts for multi-provider environments, and Python API integration snippets for custom automation pipelines.
