ismatched models cause retrieval failure.
Implementation
// types.ts
export interface DocumentChunk {
id: string;
content: string;
metadata: Record<string, string | number | boolean>;
}
export interface RetrievalResult {
chunk: DocumentChunk;
similarityScore: number;
}
export interface ChunkingStrategy {
split(text: string, maxTokens: number): DocumentChunk[];
}
// chunking-strategies.ts
export class ParagraphAwareChunker implements ChunkingStrategy {
split(text: string, maxTokens: number): DocumentChunk[] {
const paragraphs = text.split(/\n\s*\n/).filter(p => p.trim().length > 0);
const chunks: DocumentChunk[] = [];
let currentBuffer = '';
for (const para of paragraphs) {
const estimatedTokens = para.length / 4; // Rough token estimate
if (estimatedTokens > maxTokens) {
// Fallback for oversized paragraphs
const subChunks = this.splitBySentences(para, maxTokens);
chunks.push(...subChunks);
} else if ((currentBuffer.length / 4) + estimatedTokens > maxTokens) {
chunks.push(this.createChunk(currentBuffer));
currentBuffer = para;
} else {
currentBuffer += (currentBuffer ? '\n\n' : '') + para;
}
}
if (currentBuffer) chunks.push(this.createChunk(currentBuffer));
return chunks;
}
private splitBySentences(text: string, maxTokens: number): DocumentChunk[] {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const chunks: DocumentChunk[] = [];
let buffer = '';
for (const sentence of sentences) {
if ((buffer.length / 4) + (sentence.length / 4) > maxTokens) {
chunks.push(this.createChunk(buffer));
buffer = sentence;
} else {
buffer += sentence;
}
}
if (buffer) chunks.push(this.createChunk(buffer));
return chunks;
}
private createChunk(content: string): DocumentChunk {
return {
id: crypto.randomUUID(),
content: content.trim(),
metadata: {}
};
}
}
// rag-orchestrator.ts
import { ChunkingStrategy, DocumentChunk, RetrievalResult } from './types';
export class RAGOrchestrator {
private chunkingStrategy: ChunkingStrategy;
private vectorStore: VectorStoreInterface;
private embeddingModel: EmbeddingModelInterface;
constructor(
strategy: ChunkingStrategy,
store: VectorStoreInterface,
model: EmbeddingModelInterface
) {
this.chunkingStrategy = strategy;
this.vectorStore = store;
this.embeddingModel = model;
}
async ingestDocument(docId: string, content: string, meta: Record<string, any>): Promise<void> {
const chunks = this.chunkingStrategy.split(content, 500);
// Enrich chunks with document metadata
const enrichedChunks = chunks.map(chunk => ({
...chunk,
metadata: { ...chunk.metadata, sourceDocId: docId, ...meta }
}));
// Batch embed for efficiency
const embeddings = await this.embeddingModel.encodeBatch(
enrichedChunks.map(c => c.content)
);
// Upsert to vector store
await this.vectorStore.upsert(
enrichedChunks.map((chunk, idx) => ({
id: chunk.id,
vector: embeddings[idx],
metadata: chunk.metadata,
text: chunk.content
}))
);
}
async query(question: string, topK: number = 3): Promise<RetrievalResult[]> {
const queryVector = await this.embeddingModel.encode(question);
const rawResults = await this.vectorStore.search(
queryVector,
topK,
{ minScore: 0.75 } // Threshold to filter noise
);
return rawResults.map(res => ({
chunk: {
id: res.id,
content: res.text,
metadata: res.metadata
},
similarityScore: res.score
}));
}
}
// Interfaces for external dependencies
interface VectorStoreInterface {
upsert(records: any[]): Promise<void>;
search(vector: number[], k: number, opts?: any): Promise<any[]>;
}
interface EmbeddingModelInterface {
encode(text: string): Promise<number[]>;
encodeBatch(texts: string[]): Promise<number[][]>;
}
Rationale
- Paragraph-Aware Chunking: The
ParagraphAwareChunker prioritizes semantic boundaries. Splitting mid-paragraph often severs the relationship between a claim and its supporting evidence. The fallback to sentence-level splitting handles edge cases where paragraphs are excessively long.
- Similarity Threshold: The
query method includes a minScore filter. Retrieving chunks with low similarity scores introduces noise that can confuse the LLM. It is better to return fewer high-quality chunks than many mediocre ones.
- Batch Embedding: The
ingestDocument method uses encodeBatch. Vectorizing chunks individually incurs significant overhead. Batching leverages GPU parallelism and reduces API latency.
Pitfall Guide
Production RAG systems fail in predictable ways. Below are the most common failure modes and their remedies.
-
The Boundary Effect
- Explanation: Fixed-size chunking splits text arbitrarily, often cutting sentences or breaking logical flow. The LLM receives a fragment that lacks context.
- Fix: Use semantic chunking strategies. Always implement overlap (e.g., 10-15% of chunk size) to preserve context at boundaries.
-
Embedding Model Mismatch
- Explanation: Using one model for indexing and a different model for queries. The vector spaces are incompatible, resulting in random retrieval.
- Fix: Enforce a strict contract where the embedding model is a singleton dependency shared across ingestion and retrieval. Never mix models.
-
Context Swamping
- Explanation: Retrieving too many chunks or chunks that are too large. The LLM's context window fills with irrelevant text, diluting the signal and increasing hallucination risk.
- Fix: Implement reranking. Retrieve a larger set (e.g., top-20) and use a cross-encoder reranker to select the top-3 most relevant chunks. Limit chunk size to 300-600 tokens.
-
Metadata Stripping
- Explanation: Dropping metadata during chunking. The system cannot cite sources or filter by document type.
- Fix: Design the chunk schema to require metadata. Propagate document-level metadata to all child chunks during ingestion.
-
The "Garbage In" Trap
- Explanation: Indexing low-quality documents with OCR errors, formatting artifacts, or outdated information. The LLM retrieves noise and generates poor answers.
- Fix: Implement a preprocessing pipeline. Clean HTML/PDF artifacts, remove boilerplate, and validate document freshness before ingestion.
-
Latency Neglect
- Explanation: Performing embedding and retrieval synchronously in the request path. This adds 500ms-2s of latency per query.
- Fix: Use async pipelines for ingestion. For retrieval, consider caching frequent queries or using approximate nearest neighbor (ANN) indexes for sub-50ms search latency.
-
Evaluation Blindness
- Explanation: Deploying RAG without measuring retrieval accuracy. Teams assume the system works because the LLM generates fluent text.
- Fix: Implement a golden dataset of query-answer pairs. Measure retrieval recall (did we get the right chunk?) and generation faithfulness (did the answer match the chunk?).
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Frequently changing docs | RAG | Instant updates without retraining. | Low (Storage + Embedding compute) |
| Style/Format enforcement | Fine-Tuning | RAG cannot change model behavior or tone. | High (Training compute + Data curation) |
| Strict citation required | RAG | Fine-tuned models cannot cite sources reliably. | Medium (Vector DB + Retrieval infra) |
| Low-latency, high-volume | RAG + Caching | Embedding at query time adds latency. Cache results. | Medium (Cache infra + Vector DB) |
| Private, sensitive data | RAG | Data stays in vector store; model sees only context. | Medium (Secure infra + Access control) |
Configuration Template
Use this TypeScript configuration to bootstrap a RAG pipeline with sensible defaults.
// rag-config.ts
export interface RAGConfig {
chunking: {
strategy: 'paragraph' | 'sentence' | 'fixed';
maxTokens: number;
overlapRatio: number;
};
retrieval: {
topK: number;
minSimilarity: number;
rerank: boolean;
};
embedding: {
modelId: string;
batchSize: number;
};
}
export const defaultConfig: RAGConfig = {
chunking: {
strategy: 'paragraph',
maxTokens: 500,
overlapRatio: 0.15,
},
retrieval: {
topK: 3,
minSimilarity: 0.75,
rerank: true,
},
embedding: {
modelId: 'sentence-transformers/all-MiniLM-L6-v2',
batchSize: 32,
},
};
Quick Start Guide
- Initialize Dependencies: Install your vector database client and embedding library. Ensure the embedding model is downloaded or accessible via API.
- Define Schema: Create the
DocumentChunk interface and metadata schema. Align this with your vector store's capabilities.
- Run Ingestion: Load a sample document set. Run the ingestion pipeline to chunk, embed, and store the data. Verify chunk counts and metadata propagation.
- Test Retrieval: Execute test queries. Inspect the retrieved chunks for relevance and similarity scores. Adjust
topK and minSimilarity as needed.
- Integrate Generator: Connect the retrieval results to your LLM prompt template. Validate that the generated answers are grounded in the retrieved context.