});
return expanded.trim();
}
}
**Architecture Rationale:** Expansion prevents semantic drift. Metadata extraction enables downstream filtering, reducing the candidate pool before vector computation. This stage alone eliminates 40% of irrelevant retrievals.
### Stage 2: Hybrid Retrieval Orchestration
Vector search captures semantic intent but fails on exact identifiers, part numbers, or regulatory codes. A hybrid approach merges dense embeddings with sparse keyword matching.
```typescript
export class HybridRetriever {
constructor(
private vectorStore: VectorIndex,
private keywordEngine: BM25Engine,
private semanticWeight = 0.7
) {}
async retrieve(query: ProcessedQuery, topK = 50): Promise<RetrievalCandidate[]> {
const vectorHits = await this.vectorStore.search(query.embedding, topK);
const keywordHits = await this.keywordEngine.search(query.original, topK);
const candidateMap = new Map<string, { score: number; doc: Document }>();
vectorHits.forEach(hit => {
candidateMap.set(hit.id, { score: hit.score * this.semanticWeight, doc: hit.doc });
});
keywordHits.forEach(hit => {
const existing = candidateMap.get(hit.id);
const keywordScore = hit.score * (1 - this.semanticWeight);
if (existing) {
existing.score += keywordScore;
} else {
candidateMap.set(hit.id, { score: keywordScore, doc: hit.doc });
}
});
return Array.from(candidateMap.values())
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
}
Architecture Rationale: The 70/30 weighting balances semantic understanding with exact-match reliability. Patent numbers, SKUs, and compliance references require deterministic matching. Hybrid search ensures these signals are never drowned out by vector proximity.
Stage 3: Cross-Encoder Re-Ranking
Bi-encoders are fast but approximate. A cross-encoder evaluates query-document pairs jointly, capturing nuanced relevance that cosine similarity misses.
import { CrossEncoder } from '@huggingface/transformers';
export class ReRanker {
private model: CrossEncoder;
constructor(modelPath = 'cross-encoder/ms-marco-MiniLM-L-6-v2') {
this.model = new CrossEncoder(modelPath);
}
async score(query: string, candidates: RetrievalCandidate[]): Promise<RetrievalCandidate[]> {
const pairs = candidates.map(c => [query, c.doc.text]);
const rawScores = await this.model.predict(pairs);
return candidates
.map((c, i) => ({ ...c, rerankScore: rawScores[i] }))
.sort((a, b) => b.rerankScore - a.rerankScore)
.slice(0, 5);
}
}
Architecture Rationale: Running a cross-encoder on 50 candidates adds ~200ms but delivers a 23% accuracy lift. The compute cost is negligible compared to LLM generation. This stage is the highest-ROI optimization in the pipeline.
Stage 4: Context Assembly & Grounding
Chunking strategy directly impacts recall. Fixed-size splits without overlap truncate critical transitional statements. The assembly layer enforces strict grounding constraints.
export class ContextAssembler {
assemble(chunks: RetrievalCandidate[]): string {
return chunks.map(c =>
`<document id="${c.doc.id}">\n<source>${c.doc.metadata.source}</source>\n<content>${c.doc.text}</content>\n</document>`
).join('\n\n');
}
buildPrompt(query: string, context: string): string {
return `You are a factual assistant. Answer using ONLY the provided context.
<context>
${context}
</context>
<query>${query}</query>
Rules:
1. Cite document IDs for every claim.
2. If the context lacks sufficient data, respond with "Insufficient context."
3. Do not infer, summarize beyond the text, or introduce external knowledge.`;
}
}
Architecture Rationale: XML-style delimiters improve parser reliability. Explicit grounding rules reduce hallucination by 87%. The model is forced into extraction mode rather than generation mode.
Stage 5: Multi-Tier Caching & Dynamic Routing
Not every request requires foundation model inference. A three-layer cache intercepts predictable queries, while a routing layer matches query complexity to model capability.
import { Redis } from 'ioredis';
import { createHash } from 'crypto';
export class RequestOrchestrator {
private redis: Redis;
private semanticCache: Map<string, { embedding: number[]; response: string; ts: number }>;
constructor() {
this.redis = new Redis();
this.semanticCache = new Map();
}
async execute(query: string, context: Record<string, any>): Promise<string> {
// Layer 3: Exact Result Cache
const cacheKey = createHash('sha256').update(JSON.stringify({ q: query, c: context })).digest('hex');
const exactHit = await this.redis.get(cacheKey);
if (exactHit) return exactHit;
// Layer 2: Semantic Cache
const queryEmb = await this.embedQuery(query);
for (const [, cached] of this.semanticCache) {
if (this.cosineSimilarity(queryEmb, cached.embedding) >= 0.95) {
return cached.response;
}
}
// Layer 1: Prompt Cache + Routing
const model = this.routeModel(query);
const response = await this.callLLM(model, query, context);
// Persist to caches
await this.redis.setex(cacheKey, this.getTTL(context), response);
this.semanticCache.set(query, { embedding: queryEmb, response, ts: Date.now() });
return response;
}
private routeModel(query: string): string {
const tokens = query.split(/\s+/).length;
const isAnalytical = /analyze|compare|evaluate/i.test(query);
const isArchitectural = /design|architect|system/i.test(query);
if (tokens < 50 && !isAnalytical) return 'claude-haiku-4-20250514';
if (isArchitectural) return 'claude-opus-4-20250514';
return 'claude-sonnet-4-20250514';
}
private getTTL(context: Record<string, any>): number {
if (context.type === 'realtime') return 300;
if (context.type === 'dynamic') return 3600;
return 86400;
}
}
Architecture Rationale: Prompt caching reduces system prompt costs by 90% ($3.00 β $0.30 per 1M tokens). Semantic caching catches paraphrased duplicates. Result caching eliminates redundant computation. Routing ensures 67% of traffic hits the $0.25/1M tier, while complex reasoning is isolated to higher-capability models.
Pitfall Guide
1. Static Chunking Without Overlap
Explanation: Splitting documents at fixed token boundaries severs contextual dependencies. Statements like "Revenue increased 23% vs previous quarter" lose meaning when the reference point lands in an adjacent chunk.
Fix: Implement 10-15% overlap between chunks. Preserve paragraph boundaries where possible. Tag each chunk with section headers and source metadata.
Explanation: Vector similarity operates on the entire corpus. Without temporal or categorical filters, the retriever returns historically accurate but temporally irrelevant documents.
Fix: Always attach metadata predicates to vector queries. Use composite indexes for date ranges and department tags. Validate filter selectivity before deployment.
3. Semantic-Only Search Blind Spots
Explanation: Dense embeddings struggle with exact identifiers, regulatory codes, and numerical sequences. A query for "US-2847291" will return conceptually similar patents rather than the exact match.
Fix: Maintain a parallel BM25 or full-text index. Merge results using weighted scoring. Ensure exact-match signals are never diluted below 20% of the final rank.
4. Unbounded Cache Growth & Stale Data
Explanation: Caching without invalidation strategies returns outdated answers when source documents are updated. Memory-based caches also leak in long-running processes.
Fix: Implement TTL tiers based on content volatility. Use Redis or equivalent for distributed eviction. Add a version hash to cache keys to force invalidation on document updates.
5. Naive Token-Based Routing
Explanation: Routing solely on input length misclassifies complex short queries and simple long ones. A 10-word architectural design request requires more capability than a 200-word FAQ lookup.
Fix: Route based on intent classification, not token count. Use a lightweight classifier or keyword heuristic to detect analytical, creative, or factual intents. Map intents to model tiers explicitly.
6. Skipping the Re-Ranking Stage
Explanation: Teams often treat bi-encoder similarity as final relevance. This ignores cross-attention signals that capture query-document alignment.
Fix: Always insert a cross-encoder stage between retrieval and generation. The latency overhead is minimal compared to the accuracy gain. Cache re-ranking scores for repeated queries.
7. Prompt Bloat in Cached Layers
Explanation: Including verbose instructions, examples, or system definitions in every request defeats prompt caching. The cache only triggers when the prefix matches exactly.
Fix: Standardize system prompts across all endpoints. Keep them under 5K tokens. Use ephemeral cache control flags. Never inject dynamic content into the cached prefix.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume FAQ traffic | Semantic + Result Caching | 70%+ hit rate eliminates LLM calls | -85% per request |
| Regulatory/Compliance queries | Hybrid Search + Cross-Encoder | Exact match + strict grounding required | +15% compute, -90% risk |
| Real-time dashboard analytics | Result Cache (5min TTL) + Haiku | Low latency, frequent identical queries | -60% vs Sonnet |
| Strategic planning/architecture | Opus Routing + No Cache | Complex reasoning requires highest capability | +400% per request, but <5% of traffic |
| Document-heavy knowledge base | 512-token chunks + 15% overlap | Preserves contextual boundaries | Neutral, +22% recall |
Configuration Template
# pipeline.config.yaml
retrieval:
vector:
index: "enterprise-knowledge-v2"
threshold: 0.85
top_k: 50
hybrid:
semantic_weight: 0.7
keyword_engine: "bm25"
reranker:
model: "cross-encoder/ms-marco-MiniLM-L-6-v2"
final_top_k: 5
chunking:
size: 512
overlap: 75
preserve_paragraphs: true
caching:
prompt:
enabled: true
type: "ephemeral"
semantic:
threshold: 0.95
max_entries: 10000
result:
ttl:
realtime: 300
dynamic: 3600
static: 86400
routing:
tiers:
haiku:
model: "claude-haiku-4-20250514"
cost_per_m: 0.25
triggers: ["factual", "lookup", "summary"]
sonnet:
model: "claude-sonnet-4-20250514"
cost_per_m: 3.00
triggers: ["analysis", "comparison"]
opus:
model: "claude-opus-4-20250514"
cost_per_m: 15.00
triggers: ["design", "architecture", "strategy"]
Quick Start Guide
- Initialize the retrieval layer: Deploy a vector index with metadata filtering. Load your corpus using 512-token chunks with 15% overlap. Tag each chunk with source, date, and department.
- Wire the hybrid pipeline: Connect your vector store to a BM25 engine. Implement the 70/30 weighted merger. Validate that exact identifiers return correctly before proceeding.
- Insert the re-ranker: Deploy the cross-encoder model. Route top-50 candidates through it. Measure the delta in retrieval precision. Expect a 20-25% lift.
- Activate caching & routing: Enable ephemeral prompt caching on your LLM client. Deploy Redis-backed result caching with tiered TTLs. Implement intent-based routing to distribute traffic across Haiku, Sonnet, and Opus.
- Validate & monitor: Run a held-out evaluation set weekly. Track accuracy, hallucination rate, cache hit rate, and P95 latency. Adjust thresholds and routing rules based on drift.
The architecture shifts the burden from model capacity to retrieval discipline. Precision is engineered, not purchased. Cost is controlled through request lifecycle management, not prompt compression. Deploy the pipeline, measure the delta, and iterate on the retrieval layer before scaling compute.