ernal knowledge injection.
- Format/Relevancy Failure Signature:
faithfulness and context_recall stay normal, but answer_relevancy crashes (0.502 β 0.183). Indicates rigid structural constraints overriding direct question answering.
- Sweet Spot: Baseline configuration balances grounding, recall, and directness. Targeted fixes restore specific metrics without destabilizing others.
Core Solution
The diagnostic workflow follows a strict decision tree. Each branch maps to a configurable pipeline parameter or prompt constraint.
Diagnostic Decision Tree Logic
User reports "bad answer"
β
Check context_recall
ββ Low βββββ Retrieval problem
β ββ Key content not retrieved
β ββ Chunks too small or too large
β ββ Top-K too low
β
ββ Normal βββ Check faithfulness
ββ Low βββββ Generation problem (hallucination)
β ββ Prompt encourages model to go beyond context
β
ββ Normal βββ Check answer_relevancy
ββ Low βββ Off-topic answer
ββ Prompt forces rigid structure
Failure Mode 1: Low Retrieval Recall
p1_pipeline = RAGPipeline(
chunk_size=64, # Extremely small β docs shattered into fragments
chunk_overlap=0,
top_k=1, # Only 1 chunk retrieved β most info is lost
prompt_type="baseline",
)
Root Cause: Documents are fragmented into 64-character tokens. Concepts are distributed across multiple chunks. top_k=1 retrieves only one fragment, leaving the LLM with incomplete semantics. context_recall drops sharply as key content remains unretrieved.
Failure Mode 2: Hallucination
p2_pipeline = RAGPipeline(
chunk_size=512,
chunk_overlap=50,
top_k=4,
prompt_type="hallucination", # Hallucination-inducing prompt
)
PROMPT_HALLUCINATION = ChatPromptTemplate.from_messages([
("system", "You are an encyclopedic AI assistant with broad knowledge. Answer questions comprehensively "
"drawing on your extensive training. The reference material below is just a starting point β "
"feel free to expand with additional background knowledge beyond what's provided. "
"Make the answer as rich and informative as possible."),
("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nProvide a comprehensive answer:"),
])
Root Cause: The prompt explicitly authorizes external knowledge injection. RAGAS faithfulness flags any claim not traceable to {context}, causing a sharp metric drop despite intact retrieval.
Failure Mode 3: Off-Topic Answers
p3_pipeline = RAGPipeline(
chunk_size=512,
chunk_overlap=50,
top_k=4,
prompt_type="offtopic", # Forced academic survey format
)
PROMPT_OFFTOPIC = ChatPromptTemplate.from_messages([
("system", "You are a senior technical researcher writing academic surveys. "
"For every question, structure your response exactly as follows:\n"
"1. Technical Background and Historical Evolution\n"
"2. Major Technical Schools and Comparative Analysis\n"
"3. Current Challenges and Future Trends\n"
"Each section must be at least 200 words. Use academic language."),
("human", "Reference material:\n{context}\n\nQuestion: {question}\n\nWrite the survey:"),
])
Root Cause: Rigid structural mandates override direct question answering. answer_relevancy measures directness; forced academic formatting decouples output from user intent.
# Before: chunks too small, top_k too low
RAGPipeline(chunk_size=64, chunk_overlap=0, top_k=1)
# After: reasonable chunk size + sufficient top_k
RAGPipeline(chunk_size=512, chunk_overlap=50, top_k=4)
Reference Values:
| Use Case | chunk_size | overlap | top_k |
|---|
| Short Q&A (technical FAQ) | 256β512 | 20β50 | 3β5 |
| Long document comprehension | 512β1024 | 50β100 | 4β6 |
| Code repository search | Per function/class | 0 | 3β5 |
Prompt Hardening Principle: Explicitly constrain the model to the reference material and define fallback behavior for insufficient context.
PROMPT_STRICT = Cha
Pitfall Guide
- Intuition-Driven Tuning: Modifying prompts or embeddings without baseline metrics creates confounded variables. Always run RAGAS evaluation before and after changes to isolate impact.
- Fragmented Chunking: Setting
chunk_size below semantic boundaries (e.g., <128 tokens) shatters context. context_recall will drop regardless of embedding quality. Align chunk size with concept density.
- Prompt Leakage: Phrases like "feel free to expand" or "use your general knowledge" break grounding. RAG requires explicit constraints: "Answer strictly based on provided context."
- Over-Structured Prompts: Forcing rigid templates (academic, bullet-heavy, word-count mandates) decouples output from user intent. Prioritize directness over formatting compliance.
- Ignoring Context Precision: Low
context_recall often coincides with low context_precision. Retrieving tiny, topically relevant but semantically empty fragments wastes LLM context windows and degrades generation quality.
- Missing Fallback Instructions: Failing to instruct the model to explicitly state "insufficient information" when context lacks answers forces hallucination or irrelevant padding. Always include a negative constraint clause.
Deliverables
- π RAG Diagnostic Blueprint: Step-by-step decision tree mapping RAGAS metric thresholds to pipeline components (Retrieval vs. Generation). Includes metric sensitivity analysis and remediation pathways.
- β
Pre-Deployment RAG Validation Checklist: 12-point verification protocol covering chunking strategy, prompt grounding constraints,
top_k calibration, and RAGAS baseline thresholds before production rollout.
- βοΈ Configuration Templates: Production-ready JSON/YAML templates for
RAGPipeline parameters (chunk_size, overlap, top_k) by use case, plus hardened prompt templates with explicit grounding and fallback clauses.