"Requirements:\n"
"- Replace all pronouns (it, this, these, which one, etc.) with specific nouns\n"
"- Fill in any omitted subjects or objects\n"
"- Output only the rewritten question, no explanation\n"
"If the question is already complete and standalone, return it unchanged."),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
])
Enter fullscreen mode Exit fullscreen mode
Three design decisions worth noting:
1. **"Output only the question"** β without this explicit constraint, the LLM explains its reasoning, producing output that far exceeds what an embedding model can handle
2. **History placed between system and human** β `MessagesPlaceholder("chat_history")` expands to the full message list at that position
3. **Unchanged passthrough condition** β Turn 1 or semantically complete questions don't need rewriting; give the LLM an exit
### [](#the-full-convrag-chain)The Full ConvRAG Chain
Step 1: History-aware retrieval
history_aware_retriever = ... # see above
Step 2: Generate answer using retrieved docs + conversation history
ANSWER_PROMPT = ChatPromptTemplate.from_messages([
("system",
"You are a RAG technology expert. Answer based on the reference material.\n"
"Reference material:\n{context}"),
MessagesPlaceholder("chat_history"), # history also informs generation
("human", "{input}"),
])
qa_chain = create_stuff_documents_chain(llm, ANSWER_PROMPT)
rag_chain = create_retrieval_chain(history_aware_retriever, qa_chain)
Step 3: Session-based history management
store: dict[str, ChatMessageHistory] = {}
def get_session_history(session_id: str) -> ChatMessageHistory:
if session_id not in store:
store[session_id] = ChatMessageHistory()
return store[session_id]
conv_rag = RunnableWithMessageHistory(
rag_chain,
get_session_history,
input_messages_key="input",
history_messages_key="chat_history",
output_messages_key="answer",
)
Enter fullscreen mode Exit fullscreen mode
Each `session_id` maps to an isolated conversation history. `RunnableWithMessageHistory` automatically injects history before each invoke and appends the new Q&A pair afterward.
* * *
## [](#question-rewriting-results)Question Rewriting Results
Three test conversations, showing the rewriting output for each Turn 2 follow-up:
[RAGAS follow-up]
Original: What are its four core metrics?
Rewritten: What are the four core metrics in the RAGAS framework?
[Vector DB follow-up]
Original: Which one is best for production?
Rewritten: Among the common vector databases (Chroma, Pinecone, Milvus, Qdrant),
which is most suitable for a production environment?
[Advanced RAG follow-up]
Original: What about Graph RAG and Agentic RAG?
Rewritten: What problems do Graph RAG and Agentic RAG each solve?
Enter fullscreen mode Exit fullscreen mode
"Its" becomes "in the RAGAS framework." "Which one" expands to a full list of the databases mentioned in Turn 1. Without this disambiguation, these questions produce garbage retrieval results. With it, they retrieve the right documents.
* * *
## [](#retrieval-comparison-the-real-story-in-turn-2)Retrieval Comparison: The Real Story in Turn 2
Retrieving with "What are its four core metrics?" directly versus retrieving with "What are the four core metrics in the RAGAS framework?":
Baseline retrieved (raw: "What are its four core metrics?"):
doc1: RAG core workflow: Retrieval β Augmentation β Generation.
RAG was introduced by Meta AI in 2020...
doc2: Document chunking strategies affect RAG retrieval quality:
fixed-size chunking (chunk_size=512-1024) works for general cases...
ConvRAG retrieved (rewritten: "What are the four core metrics in RAGAS?"):
doc1: RAGAS is an evaluation framework designed specifically for RAG systems,
introduced by Es et al. in 2023.
The four core metrics: 1. context_recall... 2. context_precision...
doc2: Embedding models convert text to vectors, setting the quality ceiling
for semantic retrieval...
Enter fullscreen mode Exit fullscreen mode
Baseline retrieves the RAG introduction and chunking strategies β both about RAG, but neither contains the RAGAS metrics. ConvRAG retrieves the RAGAS document directly. The gap is qualitative, not marginal.
* * *
## [](#ragas-metrics-an-interesting-reversal)RAGAS Metrics: An Interesting Reversal
======================================================================
RAGAS Metrics Comparison (Baseline vs Conversational RAG)
Metric Baseline ConvRAG Delta
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
context_recall 0.667 0.400 β-0.267 β
context_precision 0.880 0.870 β-0.010
faithfulness 1.000 1.000 β+0.000
answer_relevancy 0.432 0.430 β-0.002
Note: Evaluated on the final turn (Turn 3) of each 3-turn conversation
Enter fullscreen mode Exit fullscreen mode
ConvRAG's context\_recall is 0.267 lower than Baseline. That's counterintuitive β why would "better retrieval" produce less relevant context?
**The answer is in what RAGAS actually evaluated.**
The evaluation ran on Turn 3 of each conversation:
- "Which metric is hardest to improve, and why?"
- "If my team is just starting with RAG, which database should we choose?"
- "What is the evolutionary relationship between these four techniques?"
These Turn 3 questions are **semantically complete on their own**. Even without conversation history, retrieving directly on these questions finds the right documents. The Baseline does exactly that β and it works.
ConvRAG takes Turn 3 questions and rewrites them incorporating the accumulated history. "What is the evolutionary relationship between these four techniques" might become "What is the evolutionary relationship between Self-RAG, CRAG, Graph RAG, and Agentic RAG" β semantically richer, but the changed phrasing may cause the retrieval to land on slightly different documents, reducing context\_recall.
**RAGAS failed to capture Conversational RAG's core value.**
The value is in Turn 2 β pronoun disambiguation turning a failed retrieval into a correct one. RAGAS evaluated Turn 3, where the questions happened to work without history. The experiment design favored the Baseline scenario, obscuring ConvRAG's genuine contribution.
This is a recurring theme in this series: metrics measure what they measure. Always ask β what scenario did the metric actually test? What did it miss?
* * *
## [](#when-to-use-conversational-rag)When to Use Conversational RAG
Scenario
Baseline RAG
Conversational RAG
Every question is standalone
β
Direct retrieval, low cost
β οΈ Rewriting adds latency and cost
Follow-ups with pronouns ("it", "which one")
β Retrieval fails
β
Disambiguation β correct retrieval
Follow-ups with omitted subjects
β Retrieval fails
β
Subject restored β correct retrieval
Multi-turn deep exploration of a topic
β οΈ No context accumulation
β
Coherent, history-informed answers
**Memory management trade-offs**: this implementation keeps the full conversation history. It's accurate but token cost grows with each turn. Common production alternatives:
- **Sliding window**: keep only the last N turns
- **Summary memory**: compress older turns into a summary via LLM, keep the most recent 1β2 turns in full detail
The choice depends on conversation length and how far back the relevant context might reach.
* * *
## [](#full-code)Full Code
Complete code is open-sourced at:
[https://github.com/chendongqi/llm-in-action/tree/main/18-conversational-rag](https://github.com/chendongqi/llm-in-action/tree/main/18-conversational-rag)
Key file:
- `conversational_rag.py` β full implementation: two pipelines, question rewriting demo, RAGAS evaluation
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 18-conversational-rag
cp .env.example .env
pip install -r requirements.txt
python conversational_rag.py
Enter fullscreen mode Exit fullscreen mode
* * *
## [](#summary)Summary
This article implemented Conversational RAG. Key findings:
1. **Pronoun disambiguation is the core problem** β "what are its four metrics?" retrieves completely irrelevant documents; the Turn 2 retrieval comparison makes this gap unmistakable
2. **Question rewriting works well** β GLM-4-flash accurately rewrites "what are its four metrics?" to "what are the four core metrics of the RAGAS framework?"; disambiguation quality is solid
3. **RAGAS showed a reversal** β ConvRAG's context\_recall was lower (0.400 vs 0.667), because the Turn 3 test questions were semantically complete on their own; direct retrieval happened to work fine for those specific questions
4. **Metrics and scenario value diverged most sharply here** β the value of Conversational RAG lies in the "pronoun follow-up fails" scenario, which RAGAS didn't test; the numbers don't reflect the actual benefit
Across this series: Self-RAG asked "should we retrieve?", CRAG asked "is what we retrieved good enough?", Graph RAG handled relational reasoning, Agentic RAG unified them into a decision loop, and Conversational RAG now handles the temporal dimension β making each question aware of what came before. Each one expands the range of scenarios the system handles correctly.
* * *
## [](#references)References
- [LangChain Conversational RAG Documentation](https://python.langchain.com/docs/tutorials/qa_chat_history/)
- [RunnableWithMessageHistory API Reference](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.history.RunnableWithMessageHistory.html)
- [RAGAS: Automated Evaluation of Retrieval Augmented Generation (Es et al., 2023)](https://arxiv.org/abs/2309.15217)