RAG Series (18): Conversational RAG — The Pronoun Problem in Multi-Turn Dialogue

By Codcompass Team·2026-05-17·7 min read

The Hidden Assumption in Single-Turn RAG

Every article in this series so far has worked with one type of question: a standalone, self-contained query that retrieves documents and generates an answer.

Real conversations don't work like that.

After asking "What is RAGAS?", a user naturally continues:

Turn 1: What is RAGAS?
Turn 2: What are its four core metrics?
Turn 3: Which one is hardest to improve, and why?

Enter fullscreen mode Exit fullscreen mode

Turn 1 is fine. "Its" in Turn 2 refers to RAGAS. "Which one" in Turn 3 refers to the four metrics mentioned in Turn 2. To a human, the referent is obvious. To a retrieval system, "what are its four core metrics?" is a query with no subject — the vector search will find documents semantically similar to "its four metrics," which could be anything.

This is single-turn RAG's hidden assumption: every question is independent and complete. The moment follow-up questions appear, this assumption breaks.

History-Aware Retriever: Rewrite Before You Retrieve

The fix is straightforward: before retrieval, use one LLM call to combine the current question with the conversation history and rewrite it into a standalone, self-contained question. Then use the rewritten question for retrieval.

Turn 1: What is RAGAS?                → retrieve directly (no history)
Turn 2: What are its four metrics?
        ↓ combine with Turn 1 history
        "What are the four core metrics in the RAGAS framework?"
        ↓ retrieve using rewritten question
Turn 3: Which one is hardest to improve?
        ↓ combine with Turn 1+2 history
        "Among RAGAS's four metrics, which is hardest to improve, and why?"
        ↓ retrieve using rewritten question

Enter fullscreen mode Exit fullscreen mode

LangChain provides create_history_aware_retriever for this pattern, but to guard against verbose LLM output triggering the embedding model's 512-token limit, this implementation builds the chain manually with a truncation step:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableBranch, RunnableLambda

def _extract_standalone_question(text: str) -> str:
    """Keep only the first line — guards against verbose LLM output
    exceeding the embedding model's 512-token input limit."""
    lines = [l.strip() for l in text.strip().split("\n") if l.strip()]
    question = lines[0] if lines else text
    return question[:400]  # hard cap

_contextualize_chain = (
    CONTEXTUALIZE_PROMPT
    | llm
    | StrOutputParser()
    | RunnableLambda(_extract_standalone_question)
)

# No history → retrieve directly; history present → rewrite first
history_aware_retriever = RunnableBranch(
    (
        lambda x: not x.get("chat_history"),
        (lambda x: x["input"]) | retriever,
    ),
    _contextualize_chain | retriever,
)

Enter fullscreen mode Exit fullscreen mode

The Architecture

The Contextualize Prompt

CONTEXTUALIZE_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "Given the conversation history and the latest question, rewrite the "
     "question as a standalone, self-contained question.\n"

"Requirements:\n" "- Replace all pronouns (it, this, these, which one, etc.) with specific nouns\n" "- Fill in any omitted subjects or objects\n" "- Output only the rewritten question, no explanation\n" "If the question is already complete and standalone, return it unchanged."), MessagesPlaceholder("chat_history"), ("human", "{input}"), ])


Enter fullscreen mode Exit fullscreen mode

Three design decisions worth noting:

1.  **"Output only the question"** — without this explicit constraint, the LLM explains its reasoning, producing output that far exceeds what an embedding model can handle
2.  **History placed between system and human** — `MessagesPlaceholder("chat_history")` expands to the full message list at that position
3.  **Unchanged passthrough condition** — Turn 1 or semantically complete questions don't need rewriting; give the LLM an exit

### [](#the-full-convrag-chain)The Full ConvRAG Chain

Step 1: History-aware retrieval

history_aware_retriever = ... # see above

Step 2: Generate answer using retrieved docs + conversation history

ANSWER_PROMPT = ChatPromptTemplate.from_messages([ ("system", "You are a RAG technology expert. Answer based on the reference material.\n" "Reference material:\n{context}"), MessagesPlaceholder("chat_history"), # history also informs generation ("human", "{input}"), ]) qa_chain = create_stuff_documents_chain(llm, ANSWER_PROMPT) rag_chain = create_retrieval_chain(history_aware_retriever, qa_chain)

Step 3: Session-based history management

store: dict[str, ChatMessageHistory] = {}

def get_session_history(session_id: str) -> ChatMessageHistory: if session_id not in store: store[session_id] = ChatMessageHistory() return store[session_id]

conv_rag = RunnableWithMessageHistory( rag_chain, get_session_history, input_messages_key="input", history_messages_key="chat_history", output_messages_key="answer", )


Enter fullscreen mode Exit fullscreen mode

Each `session_id` maps to an isolated conversation history. `RunnableWithMessageHistory` automatically injects history before each invoke and appends the new Q&A pair afterward.

* * *

## [](#question-rewriting-results)Question Rewriting Results

Three test conversations, showing the rewriting output for each Turn 2 follow-up:

[RAGAS follow-up] Original: What are its four core metrics? Rewritten: What are the four core metrics in the RAGAS framework?

[Vector DB follow-up] Original: Which one is best for production? Rewritten: Among the common vector databases (Chroma, Pinecone, Milvus, Qdrant), which is most suitable for a production environment?

[Advanced RAG follow-up] Original: What about Graph RAG and Agentic RAG? Rewritten: What problems do Graph RAG and Agentic RAG each solve?


Enter fullscreen mode Exit fullscreen mode

"Its" becomes "in the RAGAS framework." "Which one" expands to a full list of the databases mentioned in Turn 1. Without this disambiguation, these questions produce garbage retrieval results. With it, they retrieve the right documents.

* * *

## [](#retrieval-comparison-the-real-story-in-turn-2)Retrieval Comparison: The Real Story in Turn 2

Retrieving with "What are its four core metrics?" directly versus retrieving with "What are the four core metrics in the RAGAS framework?":

Baseline retrieved (raw: "What are its four core metrics?"): doc1: RAG core workflow: Retrieval → Augmentation → Generation. RAG was introduced by Meta AI in 2020... doc2: Document chunking strategies affect RAG retrieval quality: fixed-size chunking (chunk_size=512-1024) works for general cases...

ConvRAG retrieved (rewritten: "What are the four core metrics in RAGAS?"): doc1: RAGAS is an evaluation framework designed specifically for RAG systems, introduced by Es et al. in 2023. The four core metrics: 1. context_recall... 2. context_precision... doc2: Embedding models convert text to vectors, setting the quality ceiling for semantic retrieval...


Enter fullscreen mode Exit fullscreen mode

Baseline retrieves the RAG introduction and chunking strategies — both about RAG, but neither contains the RAGAS metrics. ConvRAG retrieves the RAGAS document directly. The gap is qualitative, not marginal.

* * *

## [](#ragas-metrics-an-interesting-reversal)RAGAS Metrics: An Interesting Reversal

====================================================================== RAGAS Metrics Comparison (Baseline vs Conversational RAG)

Metric Baseline ConvRAG Delta ────────────────────────────────────────────────────────────── context_recall 0.667 0.400 ↓-0.267 ◀ context_precision 0.880 0.870 →-0.010 faithfulness 1.000 1.000 →+0.000 answer_relevancy 0.432 0.430 →-0.002

Note: Evaluated on the final turn (Turn 3) of each 3-turn conversation


Enter fullscreen mode Exit fullscreen mode

ConvRAG's context\_recall is 0.267 lower than Baseline. That's counterintuitive — why would "better retrieval" produce less relevant context?

**The answer is in what RAGAS actually evaluated.**

The evaluation ran on Turn 3 of each conversation:

-   "Which metric is hardest to improve, and why?"
-   "If my team is just starting with RAG, which database should we choose?"
-   "What is the evolutionary relationship between these four techniques?"

These Turn 3 questions are **semantically complete on their own**. Even without conversation history, retrieving directly on these questions finds the right documents. The Baseline does exactly that — and it works.

ConvRAG takes Turn 3 questions and rewrites them incorporating the accumulated history. "What is the evolutionary relationship between these four techniques" might become "What is the evolutionary relationship between Self-RAG, CRAG, Graph RAG, and Agentic RAG" — semantically richer, but the changed phrasing may cause the retrieval to land on slightly different documents, reducing context\_recall.

**RAGAS failed to capture Conversational RAG's core value.**

The value is in Turn 2 — pronoun disambiguation turning a failed retrieval into a correct one. RAGAS evaluated Turn 3, where the questions happened to work without history. The experiment design favored the Baseline scenario, obscuring ConvRAG's genuine contribution.

This is a recurring theme in this series: metrics measure what they measure. Always ask — what scenario did the metric actually test? What did it miss?

* * *

## [](#when-to-use-conversational-rag)When to Use Conversational RAG

Scenario

Baseline RAG

Conversational RAG

Every question is standalone

✅ Direct retrieval, low cost

⚠️ Rewriting adds latency and cost

Follow-ups with pronouns ("it", "which one")

❌ Retrieval fails

✅ Disambiguation → correct retrieval

Follow-ups with omitted subjects

❌ Retrieval fails

✅ Subject restored → correct retrieval

Multi-turn deep exploration of a topic

⚠️ No context accumulation

✅ Coherent, history-informed answers

**Memory management trade-offs**: this implementation keeps the full conversation history. It's accurate but token cost grows with each turn. Common production alternatives:

-   **Sliding window**: keep only the last N turns
-   **Summary memory**: compress older turns into a summary via LLM, keep the most recent 1–2 turns in full detail

The choice depends on conversation length and how far back the relevant context might reach.

* * *

## [](#full-code)Full Code

Complete code is open-sourced at:

[https://github.com/chendongqi/llm-in-action/tree/main/18-conversational-rag](https://github.com/chendongqi/llm-in-action/tree/main/18-conversational-rag)

Key file:

-   `conversational_rag.py` — full implementation: two pipelines, question rewriting demo, RAGAS evaluation

How to run:

git clone https://github.com/chendongqi/llm-in-action cd 18-conversational-rag cp .env.example .env pip install -r requirements.txt python conversational_rag.py


Enter fullscreen mode Exit fullscreen mode

* * *

## [](#summary)Summary

This article implemented Conversational RAG. Key findings:

1.  **Pronoun disambiguation is the core problem** — "what are its four metrics?" retrieves completely irrelevant documents; the Turn 2 retrieval comparison makes this gap unmistakable
2.  **Question rewriting works well** — GLM-4-flash accurately rewrites "what are its four metrics?" to "what are the four core metrics of the RAGAS framework?"; disambiguation quality is solid
3.  **RAGAS showed a reversal** — ConvRAG's context\_recall was lower (0.400 vs 0.667), because the Turn 3 test questions were semantically complete on their own; direct retrieval happened to work fine for those specific questions
4.  **Metrics and scenario value diverged most sharply here** — the value of Conversational RAG lies in the "pronoun follow-up fails" scenario, which RAGAS didn't test; the numbers don't reflect the actual benefit

Across this series: Self-RAG asked "should we retrieve?", CRAG asked "is what we retrieved good enough?", Graph RAG handled relational reasoning, Agentic RAG unified them into a decision loop, and Conversational RAG now handles the temporal dimension — making each question aware of what came before. Each one expands the range of scenarios the system handles correctly.

* * *

## [](#references)References

-   [LangChain Conversational RAG Documentation](https://python.langchain.com/docs/tutorials/qa_chat_history/)
-   [RunnableWithMessageHistory API Reference](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.history.RunnableWithMessageHistory.html)
-   [RAGAS: Automated Evaluation of Retrieval Augmented Generation (Es et al., 2023)](https://arxiv.org/abs/2309.15217)

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back

The Hidden Assumption in Single-Turn RAG

History-Aware Retriever: Rewrite Before You Retrieve

The Architecture

The Contextualize Prompt

🎉 Mid-Year Sale — Unlock Full Article

Production Bundle