tization-First Design:** Target Q4_K_M or Q5_K_M quantization for LLMs and FP16/INT8 for embeddings to balance VRAM usage and accuracy.
Implementation: TypeScript Pipeline
This implementation assumes a backend like Ollama or llama.cpp for model serving and a vector store interface.
1. Chunking Strategy
interface Chunk {
id: string;
content: string;
metadata: Record<string, any>;
}
export class SemanticChunker {
private readonly maxTokens: number;
private readonly overlap: number;
constructor(maxTokens: number, overlap: number) {
this.maxTokens = maxTokens;
this.overlap = overlap;
}
chunk(text: string, metadata: Record<string, any>): Chunk[] {
const sentences = text.split(/(?<=[.!?])\s+/);
const chunks: Chunk[] = [];
let currentChunk = "";
let idCounter = 0;
for (const sentence of sentences) {
// Rough token estimation; replace with actual tokenizer for precision
const estimatedTokens = sentence.length / 4;
if ((currentChunk.length / 4) + estimatedTokens > this.maxTokens) {
chunks.push({
id: `chunk_${idCounter++}`,
content: currentChunk.trim(),
metadata
});
// Preserve overlap
const words = currentChunk.split(' ');
const overlapWords = words.slice(-Math.floor(this.overlap / 4));
currentChunk = overlapWords.join(' ') + " " + sentence;
} else {
currentChunk += " " + sentence;
}
}
if (currentChunk.trim()) {
chunks.push({
id: `chunk_${idCounter++}`,
content: currentChunk.trim(),
metadata
});
}
return chunks;
}
}
2. Hybrid Retrieval with RRF
import { createClient } from '@chroma/chromadb'; // Example vector DB
interface SearchResult {
id: string;
score: number;
}
export class HybridRetriever {
private vectorDb: any;
private keywordIndex: any; // Placeholder for BM25 index
private k: number;
private alpha: number; // RRF weighting
constructor(k: number = 5, alpha: number = 0.6) {
this.k = k;
this.alpha = alpha;
}
async retrieve(query: string): Promise<SearchResult[]> {
// 1. Dense Retrieval
const denseResults = await this.vectorDb.query({
queryTexts: [query],
nResults: this.k * 2
});
// 2. Sparse Retrieval (BM25)
const sparseResults = await this.keywordIndex.search(query);
// 3. Reciprocal Rank Fusion
const scores: Map<string, number> = new Map();
const rrfK = 60;
// Process Dense
denseResults.ids[0].forEach((id: string, idx: number) => {
const rank = idx + 1;
const score = this.alpha / (rank + rrfK);
scores.set(id, (scores.get(id) || 0) + score);
});
// Process Sparse
sparseResults.forEach((res: any, idx: number) => {
const rank = idx + 1;
const score = (1 - this.alpha) / (rank + rrfK);
scores.set(res.id, (scores.get(res.id) || 0) + score);
});
// Sort and return top K
return Array.from(scores.entries())
.sort((a, b) => b[1] - a[1])
.slice(0, this.k)
.map(([id, score]) => ({ id, score }));
}
}
3. Pipeline Orchestration
export class LocalRAGPipeline {
private chunker: SemanticChunker;
private retriever: HybridRetriever;
private llmClient: any; // Ollama/Llama.cpp client
constructor(config: any) {
this.chunker = new SemanticChunker(config.maxChunkTokens, config.overlap);
this.retriever = new HybridRetriever(config.topK, config.alpha);
this.llmClient = config.llmClient;
}
async ingest(document: string, metadata: Record<string, any>): Promise<void> {
const chunks = this.chunker.chunk(document, metadata);
// Parallel embedding generation for throughput
const embeddings = await Promise.all(
chunks.map(chunk => this.generateEmbedding(chunk.content))
);
// Upsert to vector DB and keyword index
await this.storeChunks(chunks, embeddings);
}
async query(question: string): Promise<string> {
const relevantChunks = await this.retriever.retrieve(question);
const context = relevantChunks.map(r => r.content).join("\n\n");
const prompt = this.buildPrompt(question, context);
return this.llmClient.generate(prompt);
}
private buildPrompt(question: string, context: string): string {
return `
You are a precise assistant. Answer the question based ONLY on the provided context.
If the answer is not in the context, state that you cannot answer.
Context:
${context}
Question: ${question}
Answer:
`;
}
}
Rationale
- RRF Weighting: The
alpha parameter allows tuning the balance between semantic and keyword search. For technical documentation, lowering alpha (favoring BM25) often improves precision on acronyms and code snippets.
- Overlap Handling: The chunker preserves context across boundaries, preventing the model from losing critical information at chunk edges.
- Modularity: Separating retrieval strategies allows swapping vector databases or embedding models without rewriting the orchestration logic.
Pitfall Guide
Common Mistakes
-
Fixed-Size Chunking Without Overlap:
- Mistake: Splitting text by character count without respecting sentence boundaries or adding overlap.
- Impact: Sentences are truncated, context is lost, and retrieval returns fragmented snippets.
- Fix: Use semantic chunking with sentence-aware splitting and 10-20% overlap.
-
Ignoring Sparse Retrieval:
- Mistake: Relying exclusively on cosine similarity for retrieval.
- Impact: Poor performance on exact matches, IDs, and domain-specific jargon.
- Fix: Implement hybrid search with BM25 and fuse results using RRF.
-
VRAM Swapping Due to Poor Quantization:
- Mistake: Loading FP16 models on hardware with insufficient VRAM, causing OS-level swapping.
- Impact: Latency increases by 10x-50x; inference becomes unusable.
- Fix: Use Q4_K_M or Q5_K_M quantization. Monitor VRAM and enable
n_gpu_layers carefully.
-
Context Window Overflow:
- Mistake: Retrieving too many chunks (
topK too high) for the model's context window.
- Impact: Model attention dilution; the model ignores relevant context in favor of recent or first tokens.
- Fix: Calculate max chunks based on embedding token count and model context limit. Use dynamic
topK.
-
Stale Vector Indices:
- Mistake: One-time ingestion with no mechanism for updates or deletions.
- Impact: RAG returns outdated information; "hallucination" of old facts.
- Fix: Implement incremental updates, versioning, and soft deletes in the vector store.
-
Lack of Evaluation:
- Mistake: Assuming accuracy based on manual testing.
- Impact: Degradation goes unnoticed; pipeline drift.
- Fix: Integrate RAGAS or custom evaluation suites measuring faithfulness, answer relevance, and context precision.
-
Prompt Injection Vulnerabilities:
- Mistake: Treating local models as inherently safe from injection.
- Impact: Malicious content in documents can override system prompts.
- Fix: Sanitize inputs, use XML tags for context separation, and implement guardrails.
Best Practices
- Metadata Filtering: Attach metadata (source, date, department) to chunks and filter retrieval queries to narrow the search space.
- Embedding Model Selection: Use models optimized for your domain.
nomic-embed-text is a strong general-purpose local embedding; consider fine-tuning for specialized jargon.
- HNSW Tuning: Adjust
M and ef_construction parameters in HNSW indices based on dataset size. Larger datasets benefit from higher M for better recall.
- Speculative Decoding: Enable speculative decoding on the LLM backend to accelerate generation without accuracy loss.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Regulated Healthcare | Local Q5 Model + Air-gapped Vector DB | Maximum privacy; Q5 preserves clinical nuance; air-gap eliminates egress risk. | High hardware cost; zero data cost. |
| Developer Docs | Local Q4 Model + Hybrid Search | Speed is prioritized; BM25 handles code snippets and IDs better than dense alone. | Moderate hardware; high developer productivity. |
| Edge/Mobile | Phi-3 Mini + ONNX Runtime | Low latency on constrained devices; ONNX enables CPU optimization. | Low hardware cost; reduced accuracy vs 8B. |
| High-Volume Enterprise | Llama-3-8B + Chroma + RRF | Scalable architecture; RRF balances precision/recall; Chroma handles scale. | Moderate hardware; scalable cost structure. |
Configuration Template
{
"pipeline": {
"chunking": {
"maxTokens": 512,
"overlap": 100,
"strategy": "semantic"
},
"retrieval": {
"topK": 5,
"alpha": 0.6,
"vectorDb": {
"type": "chroma",
"collection": "local_rag_prod",
"hnsw": {
"M": 32,
"efConstruction": 200
}
},
"keywordIndex": {
"type": "bm25",
"k1": 1.2,
"b": 0.75
}
},
"models": {
"llm": {
"name": "llama3:8b-instruct-q4_K_M",
"temperature": 0.1,
"contextWindow": 8192
},
"embedding": {
"name": "nomic-embed-text",
"dimensions": 768
}
},
"monitoring": {
"evalInterval": "weekly",
"vramThreshold": 0.85
}
}
}
Quick Start Guide
-
Install Inference Backend:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull models
ollama pull llama3:8b-instruct-q4_K_M
ollama pull nomic-embed-text
-
Initialize Vector Store:
npm install @chroma/chromadb
# Start Chroma server or use embedded mode
-
Deploy Pipeline:
// main.ts
import { LocalRAGPipeline } from './pipeline';
const config = require('./pipeline.config.json');
const pipeline = new LocalRAGPipeline(config);
// Ingest data
await pipeline. ingest(fs.readFileSync('docs.pdf', 'utf-8'), { source: 'manual_v1' });
// Query
const answer = await pipeline.query('How do I configure RRF?');
console.log(answer);
-
Validate Performance:
Run the evaluation suite against a golden dataset. Ensure TTFT < 2s and Faithfulness > 0.85. Adjust alpha and topK based on results.
-
Monitor:
Enable logging for retrieval scores and generation latency. Set up alerts for VRAM spikes or retrieval failures.