RAG Series (9): When RAG Gives Bad Answers β Root Cause Diagnosis with RAGAS
Current Situation Analysis
Deployed RAG systems frequently exhibit subtle degradation where users report answers that "aren't quite right." Traditional engineering responses involve iterative, intuition-driven tweaks: adjusting prompts, swapping embedding models, or modifying retrieval parameters. This approach suffers from critical failure modes:
- Confounded Variables: Multiple changes are applied simultaneously, making it impossible to isolate which modification actually improved or degraded performance.
- Subjective Validation: "Feels off" lacks quantifiable thresholds, leading to regression when new edge cases emerge.
- Root Cause Ambiguity: Without metric decomposition, engineers cannot distinguish between upstream retrieval failures (missing context) and downstream generation failures (hallucination or format drift).
- Inefficient Debugging: Traditional manual inspection scales poorly. RAGAS metrics transform subjective feedback into a deterministic diagnostic pipeline, replacing guesswork with data-driven root cause isolation.
WOW Moment: Key Findings
By deliberately inducing three classic failure modes and measuring them against a baseline, RAGAS metrics reveal precise diagnostic signatures. The decision tree logic (context_recall β faithfulness β answer_relevancy) isolates failure stages with near-perfect accuracy.
| Approach | context_recall | faithfulness | answer_relevancy | context_precision |
|---|---|---|---|---|
| Baseline | 0.625 | 0.829 | 0.502 | 0.583 |
| Problem 1: Low Recall | 0.250 | 0.750 | 0.191 | 0.375 |
| Problem 2: Hallucination | 0.625 | 0.320 | 0.487 | 0.583 |
| Problem 3: Off-Topic | 0.613 | 0.817 | 0.183 | 0.550 |
Key Findings:
- Retrieval Failure Signature: Sharp drop in
context_recall(0.625 β 0.250) with collateral damage tocontext_precision. Indicates semantic fragmentation or insufficienttop_k. - Generation Failure Signature:
context_recallremains stable whilefaithfulnessplummets (0.829 β 0.320). Indicates prompt leakage allowing external knowledge injection. - Format/Relevancy Failure Signature:
faithfulnessandcontext_recallstay normal, butanswer_relevancycrashes (0.502 β 0.183). Indicates rigid structural constraints overriding direct question answering. - Sweet Spot: Baseline configuration balances grounding, recall, and directness. Targeted fixes restore specific metrics without destabilizing others.
Core Solution
The diagnostic workflow follows a strict decision tree. Each branch maps to a configurable pipeline parameter or prompt constraint.
Diagnostic Decision Tree Logic
User reports "bad answer"
β
Check context_recall
ββ Low βββββ Retrieval problem
β ββ Key content not retrieved
β ββ Chunks too small or too large
β ββ Top-K too low
β
ββ Normal βββ Check faithfulness
ββ Low βββββ Generation problem (hallucination)
β ββ Prompt encourages model to go beyond context
β
ββ Normal βββ Check answer_relevancy
ββ Low βββ Off-topic answer
ββ Prompt force
s rigid structure
### Failure Mode 1: Low Retrieval Recall
p1_pipeline = RAGPipeline( chunk_size=64, # Extremely small β docs shattered into fragments chunk_overlap=0, top_k=1, # Only 1 chunk retrieved β most info is lost prompt_type="baseline", )
**Root Cause**: Documents are fragmented into 64-character tokens. Concepts are distributed across multiple chunks. `top_k=1` retrieves only one fragment, leaving the LLM with incomplete semantics. `context_recall` drops sharply as key content remains unretrieved.
### Failure Mode 2: Hallucination
p2_pipeline = RAGPipeline( chunk_size=512, chunk_overlap=50, top_k=4, prompt_type="hallucination", # Hallucination-inducing prompt )
PROMPT_HALLUCINATION = ChatPromptTemplate.from_messages([ ("system", "You are an encyclopedic AI assistant with broad knowledge. Answer questions comprehensively " "drawing on your extensive training. The reference material below is just a starting point β " "feel free to expand with additional background knowledge beyond what's provided. " "Make the answer as rich and informative as possible."), ("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nProvide a comprehensive answer:"), ])
**Root Cause**: The prompt explicitly authorizes external knowledge injection. RAGAS `faithfulness` flags any claim not traceable to `{context}`, causing a sharp metric drop despite intact retrieval.
### Failure Mode 3: Off-Topic Answers
p3_pipeline = RAGPipeline( chunk_size=512, chunk_overlap=50, top_k=4, prompt_type="offtopic", # Forced academic survey format )
PROMPT_OFFTOPIC = ChatPromptTemplate.from_messages([ ("system", "You are a senior technical researcher writing academic surveys. " "For every question, structure your response exactly as follows:\n" "1. Technical Background and Historical Evolution\n" "2. Major Technical Schools and Comparative Analysis\n" "3. Current Challenges and Future Trends\n" "Each section must be at least 200 words. Use academic language."), ("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nWrite the survey:"), ])
**Root Cause**: Rigid structural mandates override direct question answering. `answer_relevancy` measures directness; forced academic formatting decouples output from user intent.
### Remediation Configurations
Before: chunks too small, top_k too low
RAGPipeline(chunk_size=64, chunk_overlap=0, top_k=1)
After: reasonable chunk size + sufficient top_k
RAGPipeline(chunk_size=512, chunk_overlap=50, top_k=4)
**Reference Values:**
| Use Case | chunk_size | overlap | top_k |
|----------|------------|---------|-------|
| Short Q&A (technical FAQ) | 256β512 | 20β50 | 3β5 |
| Long document comprehension | 512β1024 | 50β100 | 4β6 |
| Code repository search | Per function/class | 0 | 3β5 |
**Prompt Hardening Principle**: Explicitly constrain the model to the reference material and define fallback behavior for insufficient context.
PROMPT_STRICT = Cha
## Pitfall Guide
1. **Intuition-Driven Tuning**: Modifying prompts or embeddings without baseline metrics creates confounded variables. Always run RAGAS evaluation before and after changes to isolate impact.
2. **Fragmented Chunking**: Setting `chunk_size` below semantic boundaries (e.g., <128 tokens) shatters context. `context_recall` will drop regardless of embedding quality. Align chunk size with concept density.
3. **Prompt Leakage**: Phrases like "feel free to expand" or "use your general knowledge" break grounding. RAG requires explicit constraints: "Answer strictly based on provided context."
4. **Over-Structured Prompts**: Forcing rigid templates (academic, bullet-heavy, word-count mandates) decouples output from user intent. Prioritize directness over formatting compliance.
5. **Ignoring Context Precision**: Low `context_recall` often coincides with low `context_precision`. Retrieving tiny, topically relevant but semantically empty fragments wastes LLM context windows and degrades generation quality.
6. **Missing Fallback Instructions**: Failing to instruct the model to explicitly state "insufficient information" when context lacks answers forces hallucination or irrelevant padding. Always include a negative constraint clause.
## Deliverables
- **π RAG Diagnostic Blueprint**: Step-by-step decision tree mapping RAGAS metric thresholds to pipeline components (Retrieval vs. Generation). Includes metric sensitivity analysis and remediation pathways.
- **β
Pre-Deployment RAG Validation Checklist**: 12-point verification protocol covering chunking strategy, prompt grounding constraints, `top_k` calibration, and RAGAS baseline thresholds before production rollout.
- **βοΈ Configuration Templates**: Production-ready JSON/YAML templates for `RAGPipeline` parameters (chunk_size, overlap, top_k) by use case, plus hardened prompt templates with explicit grounding and fallback clauses.
