Back to KB
Difficulty
Intermediate
Read Time
7 min

RAG Series (18): Conversational RAG β€” The Pronoun Problem in Multi-Turn Dialogue

By Codcompass TeamΒ·Β·7 min read

The Hidden Assumption in Single-Turn RAG

Every article in this series so far has worked with one type of question: a standalone, self-contained query that retrieves documents and generates an answer.

Real conversations don't work like that.

After asking "What is RAGAS?", a user naturally continues:

Turn 1: What is RAGAS?
Turn 2: What are its four core metrics?
Turn 3: Which one is hardest to improve, and why?

Enter fullscreen mode Exit fullscreen mode

Turn 1 is fine. "Its" in Turn 2 refers to RAGAS. "Which one" in Turn 3 refers to the four metrics mentioned in Turn 2. To a human, the referent is obvious. To a retrieval system, "what are its four core metrics?" is a query with no subject β€” the vector search will find documents semantically similar to "its four metrics," which could be anything.

This is single-turn RAG's hidden assumption: every question is independent and complete. The moment follow-up questions appear, this assumption breaks.


History-Aware Retriever: Rewrite Before You Retrieve

The fix is straightforward: before retrieval, use one LLM call to combine the current question with the conversation history and rewrite it into a standalone, self-contained question. Then use the rewritten question for retrieval.

Turn 1: What is RAGAS?                β†’ retrieve directly (no history)
Turn 2: What are its four metrics?
        ↓ combine with Turn 1 history
        "What are the four core metrics in the RAGAS framework?"
        ↓ retrieve using rewritten question
Turn 3: Which one is hardest to improve?
        ↓ combine with Turn 1+2 history
        "Among RAGAS's four metrics, which is hardest to improve, and why?"
        ↓ retrieve using rewritten question

Enter fullscreen mode Exit fullscreen mode

LangChain provides create_history_aware_retriever for this pattern, but to guard against verbose LLM output triggering the embedding model's 512-token limit, this implementation builds the chain manually with a truncation step:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableBranch, RunnableLambda

def _extract_standalone_question(text: str) -> str:
    """Keep only the first line β€” guards against verbose LLM output
    exceeding the embedding model's 512-token input limit."""
    lines = [l.strip() for l in text.strip().split("\n") if l.strip()]
    question = lines[0] if lines else text
    return question[:400]  # hard cap

_contextualize_chain = (
    CONTEXTUALIZE_PROMPT
    | llm
    | StrOutputParser()
    | RunnableLambda(_extract_standalone_question)
)

# No history β†’ retrieve directly; history present β†’ rewrite first
history_aware_retriever = RunnableBranch(
    (
        lambda x: not x.get("chat_history"),
        (lambda x: x["input"]) | retriever,
    ),
    _contextualize_chain | retriever,
)

Enter fullscreen mode Exit fullscreen mode


The Architecture

The Contextualize Prompt

CONTEXTUALIZE_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "Given the conversation history and the latest question, rewrite the "
     "question as a standalone, self-contained question.\n"
  

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back