and returning a structured response object containing the answer and source metadata.
Implementation Strategy
The following TypeScript implementation utilizes the LangChain ecosystem to construct the pipeline. This approach leverages the createStuffDocumentsChain for efficient context injection and createRetrievalChain for end-to-end orchestration.
Step 1: Define Grounding Constraints and Document Chain
The document chain is responsible for formatting retrieved chunks into the prompt. It is critical to define a system message that enforces strict grounding. The model must be instructed to rely solely on the provided context and to decline answering if the context is insufficient.
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { StringOutputParser } from "@langchain/core/output_parsers";
// Configuration for deterministic generation
const llmConfig = {
modelName: "gpt-4o",
temperature: 0,
maxTokens: 1024,
};
const llm = new ChatOpenAI(llmConfig);
// Strict grounding prompt template
const groundingPrompt = ChatPromptTemplate.fromMessages([
[
"system",
`You are a precise knowledge analyst. Your task is to answer the user's question
based ONLY on the provided context.
RULES:
1. Use ONLY the information in the context. Do not use external knowledge.
2. If the context does not contain sufficient information, respond with:
"I cannot answer this question based on the available documentation."
3. Cite specific sources when referencing facts.
Context:
{context}`,
],
["human", "{input}"],
]);
// Create the document processing chain
const combineDocsChain = await createStuffDocumentsChain({
llm,
prompt: groundingPrompt,
outputParser: new StringOutputParser(),
});
Rationale:
- Temperature 0: Setting temperature to zero minimizes stochastic variation, ensuring the model focuses on factual extraction rather than creative generation.
- Explicit Refusal Instruction: The prompt includes a fallback clause instructing the model to decline answering when context is insufficient. This prevents the model from hallucinating to satisfy the user query.
- Separation of Concerns:
createStuffDocumentsChain handles the mechanics of joining documents and managing the context variable, keeping the orchestration logic clean.
Step 2: Orchestrate Retrieval and Generation
The retrieval chain links the vector store retriever to the document chain. This component automates the invocation sequence and structures the output.
import { createRetrievalChain } from "langchain/chains/retrieval";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
// Assume vectorStore is initialized with domain documents
const vectorStore = await MemoryVectorStore.fromTexts(
["Document content A...", "Document content B..."],
[{ source: "doc_a" }, { source: "doc_b" }],
new OpenAIEmbeddings()
);
const retriever = vectorStore.asRetriever({
k: 5,
searchType: "mmr",
searchKwargs: { fetchK: 10, lambdaMult: 0.5 },
});
// Build the full retrieval chain
const retrievalChain = await createRetrievalChain({
retriever,
combineDocsChain,
});
// Execute query
const query = "What are the compliance requirements for data residency?";
const result = await retrievalChain.invoke({ input: query });
console.log("Answer:", result.answer);
console.log("Sources:", result.context.map((doc) => doc.metadata.source));
Rationale:
- Structured Output: The chain returns an object with
answer and context keys. This structure is essential for displaying citations to the end-user and for logging provenance data.
- MMR Retrieval: Using Maximal Marginal Relevance (MMR) with
lambdaMult balances relevance with diversity, reducing redundancy in the context window and improving the signal-to-noise ratio for the LLM.
- Metadata Preservation: The retriever configuration ensures that metadata (e.g., source identifiers) is preserved in the context documents, enabling traceability.
Architecture Decisions
- Stuffing vs. Map-Reduce: The
createStuffDocumentsChain approach is selected for its simplicity and effectiveness when the retrieved context fits within the model's context window. For scenarios requiring aggregation across hundreds of documents, a map-reduce strategy would be necessary, but stuffing provides lower latency and better coherence for typical RAG use cases.
- Prompt Isolation: The grounding instructions are embedded in the system message, isolated from user input. This reduces the risk of prompt injection and ensures the constraints are applied consistently regardless of the query content.
- Output Parsing: A
StringOutputParser is used to extract the raw text response. In production, this can be replaced with structured output parsers to enforce JSON schemas for downstream integration.
Pitfall Guide
1. The "Open Book" Trap
Explanation: Omitting strict grounding constraints in the prompt template allows the LLM to revert to its internal knowledge base. Even with retrieved context present, the model may blend external facts with internal data, reintroducing hallucinations.
Fix: Always include explicit instructions in the system prompt mandating that the model answer only based on the provided context. Include a fallback response for insufficient information.
2. Token Budget Blowouts
Explanation: createStuffDocumentsChain concatenates all retrieved documents. If the retriever returns a large number of chunks or chunks are oversized, the total token count may exceed the model's context window, causing truncation or API errors.
Fix: Implement token counting before chain invocation. Configure the retriever with a reasonable k value and use chunking strategies that respect token limits. Consider dynamic context window management that prioritizes the most relevant chunks when limits are approached.
3. Retrieval-Generation Decoupling
Explanation: Optimizing retrieval metrics in isolation does not guarantee generation quality. A retriever may return highly relevant chunks that are poorly formatted or contain noise that confuses the LLM.
Fix: Evaluate the system end-to-end. Use metrics that assess both retrieval relevance and generation faithfulness. Tune retriever parameters (e.g., similarity threshold, chunk size) based on generation outcomes, not just retrieval scores.
Explanation: Failing to inspect or log the context metadata prevents debugging and verification. Without access to the source chunks, engineers cannot determine if a failure was due to poor retrieval or poor generation.
Fix: Always log the context array alongside the answer. Implement structured logging that captures the query, retrieved sources, and generated response. Use this data to build an evaluation harness for continuous improvement.
5. Data Residency Blind Spots
Explanation: While RAG prevents data from being used for model training, context snippets are still transmitted to the LLM API for inference. Organizations may inadvertently send sensitive or regulated data to third-party providers without proper compliance checks.
Fix: Implement a PII redaction layer before retrieval or before the LLM call. Review the LLM provider's data policy regarding data residency, retention, and usage. Consider on-premise models for highly sensitive domains.
6. Prompt Injection via Context
Explanation: Retrieved documents may contain malicious instructions or adversarial text that override system prompts. If user-generated content is indexed, this risk increases significantly.
Fix: Sanitize retrieved content before injection. Use robust prompt separation techniques. Implement adversarial testing to evaluate the system's resilience to injection attacks. Consider using a classifier to filter malicious documents during ingestion.
7. Position Bias in Context
Explanation: LLMs exhibit position bias, where information at the beginning or end of the context window is more likely to be utilized than information in the middle. Stuffing documents without ordering can cause relevant chunks to be ignored.
Fix: Order retrieved chunks by relevance score. Place the most critical information at the start or end of the context block. Experiment with reordering strategies to mitigate position bias effects.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low Latency Requirement | Stuffing Chain with MMR Retrieval | Direct context injection minimizes processing steps. MMR reduces redundancy. | Low compute cost; moderate token usage. |
| High Accuracy / Complex Aggregation | Map-Reduce Chain | Processes chunks independently and synthesizes results, handling larger contexts. | Higher compute cost; increased latency. |
| Strict Compliance / PII | On-Premise LLM + Redaction Layer | Keeps data within organizational boundary; redaction prevents leakage. | High infrastructure cost; lower API dependency. |
| Dynamic Query Types | Intent Routing + Specialized Prompts | Routes queries to domain-specific prompts, improving relevance. | Moderate development cost; improved accuracy. |
Configuration Template
// rag-pipeline.config.ts
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { VectorStoreRetriever } from "@langchain/core/vectorstores";
export interface RAGPipelineConfig {
retriever: VectorStoreRetriever;
modelName?: string;
maxTokens?: number;
temperature?: number;
}
export async function buildRAGPipeline(config: RAGPipelineConfig) {
const {
retriever,
modelName = "gpt-4o",
maxTokens = 1024,
temperature = 0,
} = config;
const llm = new ChatOpenAI({
modelName,
temperature,
maxTokens,
});
const prompt = ChatPromptTemplate.fromMessages([
[
"system",
`You are a domain expert assistant. Answer the user's question strictly based on the provided context.
If the context does not contain the answer, state that you do not have enough information.
Context:
{context}`,
],
["human", "{input}"],
]);
const combineDocsChain = await createStuffDocumentsChain({
llm,
prompt,
outputParser: new StringOutputParser(),
});
const retrievalChain = await createRetrievalChain({
retriever,
combineDocsChain,
});
return retrievalChain;
}
// Usage Example
// const pipeline = await buildRAGPipeline({ retriever: myRetriever });
// const result = await pipeline.invoke({ input: "User query" });
Quick Start Guide
-
Install Dependencies:
npm install langchain @langchain/openai @langchain/core
-
Initialize Vector Store:
Load your domain documents into a vector store (e.g., MemoryVectorStore, Pinecone, or Supabase) and create a retriever with appropriate k and search parameters.
-
Build the Pipeline:
Use the configuration template to instantiate the RAG pipeline. Ensure the prompt includes strict grounding instructions and the LLM is configured with temperature: 0.
-
Invoke and Verify:
Call invoke with a test query. Inspect the result object to confirm the answer is grounded and the context array contains relevant source metadata.
-
Deploy with Logging:
Wrap the pipeline invocation in a logging layer that captures the query, retrieved sources, and generated answer. Monitor for hallucination indicators and context window utilization.