RAG Series (9): When RAG Gives Bad Answers — Root Cause Diagnosis with RAGAS

By Codcompass Team·2026-05-07·5 min read

Current Situation Analysis

Deployed RAG systems frequently exhibit subtle degradation where users report answers that "aren't quite right." Traditional engineering responses involve iterative, intuition-driven tweaks: adjusting prompts, swapping embedding models, or modifying retrieval parameters. This approach suffers from critical failure modes:

Confounded Variables: Multiple changes are applied simultaneously, making it impossible to isolate which modification actually improved or degraded performance.
Subjective Validation: "Feels off" lacks quantifiable thresholds, leading to regression when new edge cases emerge.
Root Cause Ambiguity: Without metric decomposition, engineers cannot distinguish between upstream retrieval failures (missing context) and downstream generation failures (hallucination or format drift).
Inefficient Debugging: Traditional manual inspection scales poorly. RAGAS metrics transform subjective feedback into a deterministic diagnostic pipeline, replacing guesswork with data-driven root cause isolation.

WOW Moment: Key Findings

By deliberately inducing three classic failure modes and measuring them against a baseline, RAGAS metrics reveal precise diagnostic signatures. The decision tree logic (context_recall → faithfulness → answer_relevancy) isolates failure stages with near-perfect accuracy.

Approach	context_recall	faithfulness	answer_relevancy	context_precision
Baseline	0.625	0.829	0.502	0.583
Problem 1: Low Recall	0.250	0.750	0.191	0.375
Problem 2: Hallucination	0.625	0.320	0.487	0.583
Problem 3: Off-Topic	0.613	0.817	0.183	0.550

Key Findings:

Retrieval Failure Signature: Sharp drop in context_recall (0.625 → 0.250) with collateral damage to context_precision. Indicates semantic fragmentation or insufficient top_k.
Generation Failure Signature: context_recall remains stable while faithfulness plummets (0.829 → 0.320). Indicates prompt leakage allowing external knowledge injection.
Format/Relevancy Failure Signature: faithfulness and context_recall stay normal, but answer_relevancy crashes (0.502 → 0.183). Indicates rigid structural constraints overriding direct question answering.
Sweet Spot: Baseline configuration balances grounding, recall, and directness. Targeted fixes restore specific metrics without destabilizing others.

Core Solution

The diagnostic workflow follows a strict decision tree. Each branch maps to a configurable pipeline parameter or prompt constraint.

Diagnostic Decision Tree Logic

User reports "bad answer"
        ↓
Check context_recall
        ├─ Low ────→ Retrieval problem
        │             ├─ Key content not retrieved
        │             ├─ Chunks too small or too large
        │             └─ Top-K too low
        │
        └─ Normal ──→ Check faithfulness
                          ├─ Low ────→ Generation problem (hallucination)
                          │             └─ Prompt encourages model to go beyond context
                          │
                          └─ Normal ──→ Check answer_relevancy
                                            └─ Low ──→ Off-topic answer
                                                       └─ Prompt force

s rigid structure


### Failure Mode 1: Low Retrieval Recall

p1_pipeline = RAGPipeline( chunk_size=64, # Extremely small — docs shattered into fragments chunk_overlap=0, top_k=1, # Only 1 chunk retrieved — most info is lost prompt_type="baseline", )

**Root Cause**: Documents are fragmented into 64-character tokens. Concepts are distributed across multiple chunks. `top_k=1` retrieves only one fragment, leaving the LLM with incomplete semantics. `context_recall` drops sharply as key content remains unretrieved.

### Failure Mode 2: Hallucination

p2_pipeline = RAGPipeline( chunk_size=512, chunk_overlap=50, top_k=4, prompt_type="hallucination", # Hallucination-inducing prompt )

PROMPT_HALLUCINATION = ChatPromptTemplate.from_messages([ ("system", "You are an encyclopedic AI assistant with broad knowledge. Answer questions comprehensively " "drawing on your extensive training. The reference material below is just a starting point — " "feel free to expand with additional background knowledge beyond what's provided. " "Make the answer as rich and informative as possible."), ("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nProvide a comprehensive answer:"), ])

**Root Cause**: The prompt explicitly authorizes external knowledge injection. RAGAS `faithfulness` flags any claim not traceable to `{context}`, causing a sharp metric drop despite intact retrieval.

### Failure Mode 3: Off-Topic Answers

p3_pipeline = RAGPipeline( chunk_size=512, chunk_overlap=50, top_k=4, prompt_type="offtopic", # Forced academic survey format )

PROMPT_OFFTOPIC = ChatPromptTemplate.from_messages([ ("system", "You are a senior technical researcher writing academic surveys. " "For every question, structure your response exactly as follows:\n" "1. Technical Background and Historical Evolution\n" "2. Major Technical Schools and Comparative Analysis\n" "3. Current Challenges and Future Trends\n" "Each section must be at least 200 words. Use academic language."), ("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nWrite the survey:"), ])

**Root Cause**: Rigid structural mandates override direct question answering. `answer_relevancy` measures directness; forced academic formatting decouples output from user intent.

### Remediation Configurations

Before: chunks too small, top_k too low

RAGPipeline(chunk_size=64, chunk_overlap=0, top_k=1)

After: reasonable chunk size + sufficient top_k

RAGPipeline(chunk_size=512, chunk_overlap=50, top_k=4)

**Reference Values:**
| Use Case | chunk_size | overlap | top_k |
|----------|------------|---------|-------|
| Short Q&A (technical FAQ) | 256–512 | 20–50 | 3–5 |
| Long document comprehension | 512–1024 | 50–100 | 4–6 |
| Code repository search | Per function/class | 0 | 3–5 |

**Prompt Hardening Principle**: Explicitly constrain the model to the reference material and define fallback behavior for insufficient context.

PROMPT_STRICT = Cha