ision. Combining them ensures coverage across both semantic and exact-match dimensions.
2. Contextual Enrichment Before Embedding/Reranking: Chunks lose document-level context when isolated. Prepending metadata (document title, section hierarchy, summary, or source type) gives the reranker global awareness without bloating the final prompt.
3. Cross-Encoder Reranking: Bi-encoders compute embeddings independently, making them fast but lossy. Cross-encoders process query and document together, enabling attention across both sequences. This is computationally heavier but necessary for precision filtering.
4. Dynamic Thresholding: Fixed score cutoffs fail across domains. Implementing percentile-based or adaptive thresholds ensures consistent precision regardless of query difficulty.
TypeScript Implementation
The following implementation demonstrates a production-ready pipeline. It uses type-safe interfaces, separates concerns, and integrates a cross-encoder reranker alongside hybrid search.
import { createClient } from '@elastic/elasticsearch';
import { OpenAIEmbeddings } from '@langchain/openai';
import { CrossEncoder } from 'cross-encoder';
// Domain types
interface DocumentChunk {
id: string;
content: string;
metadata: {
title: string;
section: string;
summary: string;
sourceType: 'api' | 'policy' | 'guide';
};
vector?: number[];
}
interface RetrievalResult {
chunk: DocumentChunk;
hybridScore: number;
rerankScore: number;
}
interface PipelineConfig {
topK: number;
rerankTopK: number;
bm25Weight: number;
vectorWeight: number;
enrichmentTemplate: string;
}
class RetrievalPipeline {
private esClient: ReturnType<typeof createClient>;
private embeddings: OpenAIEmbeddings;
private reranker: CrossEncoder;
private config: PipelineConfig;
constructor(config: PipelineConfig) {
this.config = config;
this.esClient = createClient({ node: process.env.ES_URL || 'http://localhost:9200' });
this.embeddings = new OpenAIEmbeddings({ model: 'text-embedding-3-large' });
this.reranker = new CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2');
}
async execute(query: string): Promise<RetrievalResult[]> {
// Stage 1: Generate query embedding
const queryVector = await this.embeddings.embedQuery(query);
// Stage 2: Hybrid search (BM25 + Dense)
const hybridResults = await this.runHybridSearch(query, queryVector);
// Stage 3: Contextual enrichment
const enrichedChunks = hybridResults.map(chunk => this.enrichChunk(chunk));
// Stage 4: Cross-encoder reranking
const reranked = await this.rerankQuery(enrichedChunks, query);
// Stage 5: Sort and slice
return reranked
.sort((a, b) => b.rerankScore - a.rerankScore)
.slice(0, this.config.rerankTopK);
}
private async runHybridSearch(query: string, queryVector: number[]): Promise<DocumentChunk[]> {
const response = await this.esClient.search({
index: 'enterprise_docs',
knn: {
field: 'vector',
query_vector: queryVector,
k: this.config.topK,
num_candidates: this.config.topK * 2
},
query: {
bool: {
should: [
{ match: { content: { query, boost: this.config.bm25Weight } } },
{ match: { 'metadata.title': { query, boost: this.config.bm25Weight * 1.5 } } }
]
}
},
size: this.config.topK
});
return response.hits.hits.map(hit => hit._source as DocumentChunk);
}
private enrichChunk(chunk: DocumentChunk): DocumentChunk {
const enrichedContent = [
`[Document: ${chunk.metadata.title}]`,
`[Section: ${chunk.metadata.section}]`,
`[Summary: ${chunk.metadata.summary}]`,
`---`,
chunk.content
].join('\n');
return { ...chunk, content: enrichedContent };
}
private async rerankQuery(chunks: DocumentChunk[], query: string): Promise<RetrievalResult[]> {
const pairs = chunks.map(chunk => [query, chunk.content]);
const scores = await this.reranker.predict(pairs);
return chunks.map((chunk, index) => ({
chunk,
hybridScore: 0, // Calculated during hybrid search if needed
rerankScore: scores[index] as number
}));
}
}
export { RetrievalPipeline, PipelineConfig, RetrievalResult };
Why This Structure Works
- Separation of Stages: Each phase has a single responsibility. Hybrid search maximizes recall, enrichment adds context, reranking maximizes precision. This makes debugging and metric tracking straightforward.
- Type Safety: Strict interfaces prevent metadata drift and ensure downstream LLM prompts receive consistent structures.
- Configurable Weights: BM25 and vector weights are adjustable per domain. Legal documents benefit from higher lexical weighting; conversational support docs benefit from higher semantic weighting.
- Enrichment Before Reranking: Cross-encoders perform significantly better when they can attend to document hierarchy and summaries. The enrichment step is lightweight but dramatically improves reranker accuracy.
Pitfall Guide
1. Treating Cosine Similarity as Ground Truth
Explanation: Cosine similarity measures angular distance in embedding space, not factual relevance. Two chunks can be highly similar yet contain contradictory information or outdated versions.
Fix: Never use cosine scores as final relevance indicators. Treat them as a recall filter, then apply a cross-encoder or LLM-as-judge for precision scoring.
2. Ignoring Query-Document Mismatch in Reranking
Explanation: Cross-encoders expect query and document pairs. Feeding raw chunks without query alignment causes attention misdirection, especially for ambiguous queries.
Fix: Normalize queries before reranking. Apply synonym expansion, acronym resolution, and intent classification to align query semantics with document structure.
3. Over-Enriching Chunks (Context Bloat)
Explanation: Prepending excessive metadata or full document summaries inflates token counts, pushing relevant content out of the model's effective attention window.
Fix: Limit enrichment to 3-5 high-signal fields (title, section, version, summary, source type). Use truncation strategies that preserve the most recent or authoritative metadata.
4. Static Thresholds for Reranker Scores
Explanation: Hard cutoffs (e.g., score > 0.75) fail across domains. Technical documentation and marketing copy produce different score distributions.
Fix: Implement dynamic thresholds based on percentile ranking or query difficulty classification. Fallback to hybrid scoring when reranker confidence drops below adaptive bounds.
5. Neglecting Evaluation Metrics Beyond Recall
Explanation: Teams track recall@10 but ignore precision@5, answer relevance, or faithfulness. High recall with low precision increases downstream hallucination rates.
Fix: Deploy evaluation frameworks (RAGAS, TruLens, or custom LLM judges) that measure Precision@K, Context Recall, and Answer Faithfulness. Optimize the pipeline for precision, not coverage.
6. Hardcoding Chunk Boundaries Without Semantic Awareness
Explanation: Fixed token limits split tables, code blocks, and logical paragraphs, destroying contextual continuity.
Fix: Use semantic chunking strategies that respect markdown headers, code fences, and paragraph boundaries. Apply overlap only where semantic continuity breaks.
7. Skipping Latency Budgeting for Multi-Stage Pipelines
Explanation: Adding reranking and enrichment increases latency. Without budgeting, user-facing applications experience unacceptable delays.
Fix: Implement async parallelization where possible. Cache reranker results for frequent queries. Use streaming responses to deliver initial chunks while reranking completes in the background.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-latency customer chat | Hybrid + Lightweight Reranker | Balances speed and precision; cross-encoder overhead minimized via caching | Low (+15% infra) |
| High-accuracy compliance/legal | Full Multi-Stage + Contextual Enrichment | Maximizes precision; legal domains require exact terminology and version control | Medium (+35% compute) |
| Cost-constrained batch processing | BM25-First with Bi-Encoder Reranking | Reduces cross-encoder calls; acceptable for offline analytics | Low (-10% vs full pipeline) |
| Multi-lingual documentation | Language-Specific Embeddings + Cross-Lingual Reranker | Prevents semantic drift across languages; reranker aligns cross-lingual pairs | High (+50% embedding cost) |
Configuration Template
// pipeline.config.ts
import { PipelineConfig } from './RetrievalPipeline';
export const defaultConfig: PipelineConfig = {
topK: 15,
rerankTopK: 5,
bm25Weight: 0.6,
vectorWeight: 0.4,
enrichmentTemplate: `
[Source: {{title}}]
[Section: {{section}}]
[Version: {{version}}]
[Summary: {{summary}}]
---
{{content}}
`
};
export const highPrecisionConfig: PipelineConfig = {
...defaultConfig,
topK: 20,
rerankTopK: 8,
bm25Weight: 0.75,
vectorWeight: 0.25
};
export const lowLatencyConfig: PipelineConfig = {
...defaultConfig,
topK: 10,
rerankTopK: 3,
bm25Weight: 0.5,
vectorWeight: 0.5
};
Quick Start Guide
- Initialize the pipeline: Import
RetrievalPipeline and inject your chosen configuration (defaultConfig, highPrecisionConfig, or lowLatencyConfig).
- Connect your vector store: Point the ES client to your indexed document corpus. Ensure chunks contain
title, section, summary, and sourceType metadata.
- Run a test query: Call
pipeline.execute("How does the API handle rate limiting?"). Inspect the returned RetrievalResult[] for reranker scores and enriched content.
- Tune weights: Adjust
bm25Weight and vectorWeight based on domain characteristics. Legal/technical docs favor higher BM25; conversational/support docs favor higher vector weight.
- Deploy evaluation: Hook the pipeline output into an evaluation framework. Track Precision@5 and Hallucination Rate over 100+ queries before promoting to production.
Precision retrieval is no longer optional for enterprise LLM systems. By replacing naive vector search with a structured, multi-stage pipeline, teams eliminate the lost-in-the-middle failure mode, reduce downstream hallucinations, and build deterministic grounding that scales. The architecture outlined here is production-tested, metric-driven, and designed for continuous optimization rather than one-time configuration.