Back to KB
Difficulty
Intermediate
Read Time
5 min

RAG Series (9): When RAG Gives Bad Answers β€” Root Cause Diagnosis with RAGAS

By Codcompass TeamΒ·Β·5 min read

Current Situation Analysis

Deployed RAG systems frequently exhibit subtle degradation where users report answers that "aren't quite right." Traditional engineering responses involve iterative, intuition-driven tweaks: adjusting prompts, swapping embedding models, or modifying retrieval parameters. This approach suffers from critical failure modes:

  • Confounded Variables: Multiple changes are applied simultaneously, making it impossible to isolate which modification actually improved or degraded performance.
  • Subjective Validation: "Feels off" lacks quantifiable thresholds, leading to regression when new edge cases emerge.
  • Root Cause Ambiguity: Without metric decomposition, engineers cannot distinguish between upstream retrieval failures (missing context) and downstream generation failures (hallucination or format drift).
  • Inefficient Debugging: Traditional manual inspection scales poorly. RAGAS metrics transform subjective feedback into a deterministic diagnostic pipeline, replacing guesswork with data-driven root cause isolation.

WOW Moment: Key Findings

By deliberately inducing three classic failure modes and measuring them against a baseline, RAGAS metrics reveal precise diagnostic signatures. The decision tree logic (context_recall β†’ faithfulness β†’ answer_relevancy) isolates failure stages with near-perfect accuracy.

Approachcontext_recallfaithfulnessanswer_relevancycontext_precision
Baseline0.6250.8290.5020.583
Problem 1: Low Recall0.2500.7500.1910.375
Problem 2: Hallucination0.6250.3200.4870.583
Problem 3: Off-Topic0.6130.8170.1830.550

Key Findings:

  • Retrieval Failure Signature: Sharp drop in context_recall (0.625 β†’ 0.250) with collateral damage to context_precision. Indicates semantic fragmentation or insufficient top_k.
  • Generation Failure Signature: context_recall remains stable while faithfulness plummets (0.829 β†’ 0.320). Indicates prompt leakage allowing external knowledge injection.
  • Format/Relevancy Failure Signature: faithfulness and context_recall stay normal, but answer_relevancy crashes (0.502 β†’ 0.183). Indicates rigid structural constraints overriding direct question answering.
  • Sweet Spot: Baseline configuration balances grounding, recall, and directness. Targeted fixes restore specific metrics without destabilizing others.

Core Solution

The diagnostic workflow follows a strict decision tree. Each branch maps to a configurable pipeline parameter or prompt constraint.

Diagnostic Decision Tree Logic

User reports "bad answer"
        ↓
Check context_recall
        β”œβ”€ Low ────→ Retrieval problem
        β”‚             β”œβ”€ Key content not retrieved
        β”‚             β”œβ”€ Chunks too small or too large
        β”‚             └─ Top-K too low
        β”‚
        └─ Normal ──→ Check faithfulness
                          β”œβ”€ Low ────→ Generation problem (hallucination)
                          β”‚             └─ Prompt encourages model to go beyond context
                          β”‚
                          └─ Normal ──→ Check answer_relevancy
                                            └─ Low ──→ Off-topic answer
                                                       └─ Prompt force

s rigid structure


### Failure Mode 1: Low Retrieval Recall

p1_pipeline = RAGPipeline( chunk_size=64, # Extremely small β€” docs shattered into fragments chunk_overlap=0, top_k=1, # Only 1 chunk retrieved β€” most info is lost prompt_type="baseline", )

**Root Cause**: Documents are fragmented into 64-character tokens. Concepts are distributed across multiple chunks. `top_k=1` retrieves only one fragment, leaving the LLM with incomplete semantics. `context_recall` drops sharply as key content remains unretrieved.

### Failure Mode 2: Hallucination

p2_pipeline = RAGPipeline( chunk_size=512, chunk_overlap=50, top_k=4, prompt_type="hallucination", # Hallucination-inducing prompt )

PROMPT_HALLUCINATION = ChatPromptTemplate.from_messages([ ("system", "You are an encyclopedic AI assistant with broad knowledge. Answer questions comprehensively " "drawing on your extensive training. The reference material below is just a starting point β€” " "feel free to expand with additional background knowledge beyond what's provided. " "Make the answer as rich and informative as possible."), ("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nProvide a comprehensive answer:"), ])

**Root Cause**: The prompt explicitly authorizes external knowledge injection. RAGAS `faithfulness` flags any claim not traceable to `{context}`, causing a sharp metric drop despite intact retrieval.

### Failure Mode 3: Off-Topic Answers

p3_pipeline = RAGPipeline( chunk_size=512, chunk_overlap=50, top_k=4, prompt_type="offtopic", # Forced academic survey format )

PROMPT_OFFTOPIC = ChatPromptTemplate.from_messages([ ("system", "You are a senior technical researcher writing academic surveys. " "For every question, structure your response exactly as follows:\n" "1. Technical Background and Historical Evolution\n" "2. Major Technical Schools and Comparative Analysis\n" "3. Current Challenges and Future Trends\n" "Each section must be at least 200 words. Use academic language."), ("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nWrite the survey:"), ])

**Root Cause**: Rigid structural mandates override direct question answering. `answer_relevancy` measures directness; forced academic formatting decouples output from user intent.

### Remediation Configurations

Before: chunks too small, top_k too low

RAGPipeline(chunk_size=64, chunk_overlap=0, top_k=1)

After: reasonable chunk size + sufficient top_k

RAGPipeline(chunk_size=512, chunk_overlap=50, top_k=4)

**Reference Values:**
| Use Case | chunk_size | overlap | top_k |
|----------|------------|---------|-------|
| Short Q&A (technical FAQ) | 256–512 | 20–50 | 3–5 |
| Long document comprehension | 512–1024 | 50–100 | 4–6 |
| Code repository search | Per function/class | 0 | 3–5 |

**Prompt Hardening Principle**: Explicitly constrain the model to the reference material and define fallback behavior for insufficient context.

PROMPT_STRICT = Cha


## Pitfall Guide
1. **Intuition-Driven Tuning**: Modifying prompts or embeddings without baseline metrics creates confounded variables. Always run RAGAS evaluation before and after changes to isolate impact.
2. **Fragmented Chunking**: Setting `chunk_size` below semantic boundaries (e.g., <128 tokens) shatters context. `context_recall` will drop regardless of embedding quality. Align chunk size with concept density.
3. **Prompt Leakage**: Phrases like "feel free to expand" or "use your general knowledge" break grounding. RAG requires explicit constraints: "Answer strictly based on provided context."
4. **Over-Structured Prompts**: Forcing rigid templates (academic, bullet-heavy, word-count mandates) decouples output from user intent. Prioritize directness over formatting compliance.
5. **Ignoring Context Precision**: Low `context_recall` often coincides with low `context_precision`. Retrieving tiny, topically relevant but semantically empty fragments wastes LLM context windows and degrades generation quality.
6. **Missing Fallback Instructions**: Failing to instruct the model to explicitly state "insufficient information" when context lacks answers forces hallucination or irrelevant padding. Always include a negative constraint clause.

## Deliverables
- **πŸ“˜ RAG Diagnostic Blueprint**: Step-by-step decision tree mapping RAGAS metric thresholds to pipeline components (Retrieval vs. Generation). Includes metric sensitivity analysis and remediation pathways.
- **βœ… Pre-Deployment RAG Validation Checklist**: 12-point verification protocol covering chunking strategy, prompt grounding constraints, `top_k` calibration, and RAGAS baseline thresholds before production rollout.
- **βš™οΈ Configuration Templates**: Production-ready JSON/YAML templates for `RAGPipeline` parameters (chunk_size, overlap, top_k) by use case, plus hardened prompt templates with explicit grounding and fallback clauses.