search compensates for these gaps. Combining them via weighted ensemble retrieval ensures coverage across both semantic and lexical dimensions.
2. Cross-Encoder Reranking: Bi-encoders compute query and document embeddings independently, approximating relevance through cosine similarity. Cross-encoders process query-document pairs jointly, computing true attention-based relevance scores. Running a cross-encoder only on the top-20 candidates from the hybrid retriever balances computational cost with ranking precision.
3. Metric Isolation: Retrieval recall, faithfulness, and answer relevance are calculated independently. This prevents generator tuning from masking retrieval deficiencies and enables targeted optimization.
Implementation
import logging
from typing import List, Dict, Any
from dataclasses import dataclass
from langchain_community.vectorstores import FAISS
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
logger = logging.getLogger(__name__)
@dataclass
class RetrievalResult:
query: str
passages: List[Document]
recall_score: float | None = None
latency_ms: float = 0.0
class ProductionRAGPipeline:
def __init__(
self,
embedding_model: str = "BAAI/bge-small-en-v1.5",
reranker_model: str = "BAAI/bge-reranker-base",
chunk_size: int = 512,
chunk_overlap: int = 64,
hybrid_weights: tuple = (0.4, 0.6),
final_k: int = 5
):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.final_k = final_k
# Initialize components with explicit configuration
self.embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
self.reranker_model = HuggingFaceCrossEncoder(model_name=reranker_model)
self.hybrid_weights = hybrid_weights
# State containers
self.vector_store: FAISS | None = None
self.retriever_chain = None
self._build_retriever_chain()
def _build_retriever_chain(self) -> None:
"""Constructs the hybrid retrieval + reranking pipeline."""
logger.info("Initializing retrieval chain components...")
# Lexical retriever for exact match recovery
lexical_retriever = BM25Retriever()
lexical_retriever.k = 20
# Dense retriever placeholder (populated during ingestion)
dense_retriever = None
# Hybrid ensemble
self.ensemble = EnsembleRetriever(
retrievers=[lexical_retriever, dense_retriever],
weights=list(self.hybrid_weights)
)
# Cross-encoder compressor for positional optimization
self.reranker = CrossEncoderReranker(
model=self.reranker_model,
top_n=self.final_k
)
logger.info("Retrieval chain initialized successfully.")
def ingest_documents(self, raw_documents: List[Document]) -> None:
"""Chunks, embeds, and indexes documents with structural awareness."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=self.chunk_size,
chunk_overlap=self.chunk_overlap,
separators=["\n## ", "\n### ", "\n\n", "\n", ". "]
)
segmented = splitter.split_documents(raw_documents)
logger.info(f"Segmented {len(raw_documents)} documents into {len(segmented)} chunks.")
# Build FAISS index with explicit distance metric
self.vector_store = FAISS.from_documents(
documents=segmented,
embedding=self.embeddings,
distance_strategy="COSINE"
)
# Update ensemble with populated dense retriever
dense_retriever = self.vector_store.as_retriever(search_kwargs={"k": 20})
self.ensemble.retrievers[1] = dense_retriever
# Finalize chain
self.retriever_chain = self.ensemble | self.reranker
logger.info("Index ingestion complete. Pipeline ready for queries.")
def query(self, user_input: str) -> RetrievalResult:
"""Executes retrieval with latency tracking and result packaging."""
import time
start = time.perf_counter()
if not self.retriever_chain:
raise RuntimeError("Pipeline not initialized. Call ingest_documents() first.")
retrieved_passages = self.retriever_chain.invoke(user_input)
elapsed_ms = (time.perf_counter() - start) * 1000
logger.info(f"Query processed in {elapsed_ms:.1f}ms. Retrieved {len(retrieved_passages)} passages.")
return RetrievalResult(
query=user_input,
passages=retrieved_passages,
latency_ms=elapsed_ms
)
The architecture prioritizes explicit state management over implicit chaining. By separating ingestion from query execution, the pipeline supports hot-swapping of vector stores, versioned index deployments, and independent scaling of embedding versus reranking workloads. The cross-encoder operates strictly as a compressor, ensuring the generator receives exactly final_k passages optimized for positional relevance.
Pitfall Guide
1. Boundary Fragmentation
Explanation: Critical information spans across chunk boundaries. When a technical procedure or logical argument is split between two segments, neither chunk achieves sufficient semantic density to rank highly.
Fix: Implement semantic-aware chunking that respects document structure (headings, paragraphs, code blocks). Maintain 10β20% overlap between adjacent segments. For complex documentation, adopt hierarchical retrieval where parent documents are retrieved first, then child chunks are filtered.
2. Context Window Saturation
Explanation: Increasing top-k to improve recall floods the generator with low-signal passages. The model's attention mechanism distributes weight across irrelevant neighbors, degrading instruction adherence.
Fix: Cap final context to 3β7 passages. Use cross-encoder reranking to promote the highest-relevance segments to positions 1 and 2. Treat recall@20 as the retrieval ceiling, not the generation input.
3. Index Staleness and Duplication
Explanation: Documents drift over time. Re-ingestion without deduplication creates multiple near-identical chunks with different metadata IDs. The retriever returns redundant neighbors, crowding out genuinely relevant content.
Fix: Implement content hashing at ingestion time. Maintain a versioned index with explicit TTL policies. Run periodic deduplication sweeps using MinHash or SimHash before vector store updates.
4. Metric Conflation
Explanation: Measuring only end-to-end answer quality obscures whether failures originate in retrieval or generation. Teams optimize the wrong component, wasting cycles on prompt engineering while the retriever misses ground-truth passages.
Fix: Decouple evaluation. Track retrieval recall@k against ground-truth passage IDs. Measure faithfulness (does the answer derive from provided context?) and answer relevance (does the answer address the query?) independently. Use RAGAS or equivalent frameworks for automated scoring.
5. Prompt Instruction Drift
Explanation: The generator ignores provided context and defaults to parametric knowledge, producing plausible but ungrounded responses. This manifests as high answer relevance but low faithfulness.
Fix: Enforce strict system prompts with explicit fallback instructions ("Respond only using the provided context. State 'Information unavailable' if the answer cannot be derived."). Validate instruction adherence via faithfulness scoring. Consider switching to instruction-tuned variants when baseline models exhibit drift.
6. Reranker Bottleneck
Explanation: Cross-encoders compute joint attention over query-document pairs, making them computationally expensive. Running them against the full corpus or large candidate sets introduces unacceptable latency.
Fix: Strictly limit cross-encoder evaluation to the top-20 candidates from the hybrid retriever. Implement response caching for repeated queries. Consider quantized cross-encoder variants (e.g., ONNX runtime, INT8) for production throughput.
7. Synthetic Eval Bias
Explanation: Evaluation sets constructed without domain expertise or ground-truth passage mapping produce misleading metrics. LLM-generated questions often lack the specificity required to stress-test retrieval boundaries.
Fix: Build evaluation sets using actual production queries, support tickets, and domain expert annotations. Map each question to explicit ground-truth passage IDs. Validate synthetic expansions against human-reviewed baselines before deployment.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume internal knowledge base | Hybrid + Cross-Encoder Reranking | Balances precision with throughput; reduces token spend by limiting context to 5 passages | Low latency, moderate compute, 40% token reduction |
| Real-time customer support chat | Dense Retrieval + Lightweight Reranker | Minimizes p95 latency; cross-encoder overhead may violate SLA thresholds | Higher latency tolerance, lower infrastructure cost |
| Regulatory/compliance documentation | Hierarchical Chunking + Strict Faithfulness Scoring | Ensures complete procedural context; prevents hallucination in high-stakes domains | Higher ingestion cost, lower risk exposure |
| Multi-lingual enterprise corpus | Cross-lingual Embeddings + Language-Specific Rerankers | Preserves semantic alignment across languages; avoids translation-induced signal loss | Increased model storage, proportional to language count |
Configuration Template
# rag_pipeline_config.yaml
retrieval:
embedding_model: "BAAI/bge-small-en-v1.5"
chunk_size: 512
chunk_overlap: 64
separators: ["\n## ", "\n### ", "\n\n", "\n", ". "]
hybrid_weights:
lexical: 0.4
dense: 0.6
candidate_pool: 20
final_context_size: 5
reranking:
model: "BAAI/bge-reranker-base"
quantization: "int8"
cache_ttl_seconds: 3600
max_concurrent_inference: 4
evaluation:
metrics: ["recall_at_k", "faithfulness", "answer_relevance"]
ground_truth_mapping: true
synthetic_expansion: false
human_review_threshold: 0.85
deployment:
vector_store: "FAISS"
index_versioning: true
deduplication: "content_hash"
ttl_policy: "30d"
Quick Start Guide
- Prepare your evaluation set: Compile 50β100 representative queries with explicit ground-truth passage IDs. Store as JSON with
question, contexts, answer, and ground_truth fields.
- Initialize the pipeline: Load the configuration template, instantiate the
ProductionRAGPipeline class, and run ingest_documents() against your source corpus. Verify chunk boundaries align with semantic sections.
- Execute baseline retrieval: Run queries against the unoptimized pipeline. Record recall@5, latency, and token consumption. Identify failure modes using the pitfall guide.
- Activate reranking and hybrid search: Enable BM25 lexical retrieval, configure ensemble weights, and attach the cross-encoder compressor. Re-run the evaluation set and compare metrics against baseline.
- Deploy metric monitoring: Wire recall@k, faithfulness, and answer relevance to your observability stack. Set alert thresholds for faithfulness drops below 0.85 and recall@5 below 0.75. Iterate on chunking and reranking before modifying generator prompts.