ost efficiency, and compliance. The following implementation outlines a production-ready pattern using open-standard components.
Step 1: Ingestion & Chunking Strategy
Fixed-size chunking destroys semantic boundaries. Enterprise documents require structural awareness. Implement a hybrid chunking strategy:
- Parse documents using layout-aware extractors (e.g., Unstructured, Marker, or Adobe PDF Extract)
- Split by semantic units (headings, paragraphs, code blocks) with 15β20% overlap
- Attach metadata: source URI, section hierarchy, author, classification, update timestamp
- Store raw chunks in object storage (S3/GCS) with checksums for idempotent reprocessing
# Semantic chunking with metadata attachment
def chunk_document(doc: Document) -> List[Chunk]:
parser = LayoutParser(model="layoutlmv3")
blocks = parser.extract(doc.content)
chunks = []
for i, block in enumerate(blocks):
chunk = Chunk(
text=block.text,
metadata={
"source": doc.uri,
"section": block.heading,
"chunk_index": i,
"updated_at": datetime.utcnow().isoformat(),
"classification": doc.classification
}
)
chunks.append(chunk)
return chunks
Step 2: Embedding & Hybrid Search
Dense vectors alone fail on exact matches, acronyms, and structured data. Implement hybrid search combining dense embeddings with sparse lexical retrieval (BM25).
from langchain_community.vectorstores import FAISS
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
# Dense + Sparse ensemble
dense_retriever = FAISS.from_documents(chunks, embedding_model).as_retriever(search_kwargs={"k": 15})
sparse_retriever = BM25Retriever.from_documents(chunks)
ensemble = EnsembleRetriever(retrievers=[dense_retriever, sparse_retriever], weights=[0.6, 0.4])
Step 3: Cross-Encoder Reranking
Vector search returns candidates; reranking orders them by semantic relevance to the query. Use a cross-encoder model (e.g., bge-reranker-large, ms-marco-MiniLM-L-12-v2) to score top-15 candidates and truncate to top-5.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def rerank(query: str, candidates: List[str]) -> List[str]:
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
ranked = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)]
return ranked[:5]
Step 4: Query Routing & Semantic Caching
Not all queries require full retrieval. Implement a routing layer:
- Exact match or FAQ queries β Redis semantic cache (cosine similarity threshold 0.92)
- Complex queries β hybrid search + reranking
- Fallback β direct LLM generation with disclaimer when confidence < threshold
Cache keys should combine query embedding, user tenant ID, and document version hash to prevent stale responses.
Step 5: Security & Governance Layer
Enterprise RAG must enforce data boundaries at query time:
- Row-level filtering via metadata predicates (e.g.,
{"classification": "internal", "tenant_id": "acme"})
- PII redaction pre- and post-generation using regex + NER models
- Query/response audit logging to immutable storage
- RBAC integration with identity provider (Okta, Azure AD, Auth0)
# Metadata-aware retrieval filter
def retrieve_with_rbac(query: str, tenant_id: str, user_role: str) -> List[Document]:
filter_dict = {"tenant_id": tenant_id}
if user_role == "viewer":
filter_dict["classification"] = "public"
return ensemble.invoke(query, filter=filter_dict)
Architecture Decisions
- Async ingestion pipeline: Decouple document processing from query latency. Use message queues (Kafka/RabbitMQ) for chunking, embedding, and indexing.
- Versioned vector indices: Maintain snapshot-based indices to support rollback and A/B testing without downtime.
- Model routing: Route queries to lightweight models for simple retrieval and heavy models for complex reasoning. Use a classifier or confidence score to trigger routing.
- Evaluation loop: Integrate RAGAS or TruLens for continuous measurement of faithfulness, answer relevance, and context precision. Trigger alerts when metrics drift >5%.
Pitfall Guide
- Chunking without semantic boundaries: Fixed-token splits fracture tables, code, and headings, causing retrieval noise. Always parse by document structure.
- Vector-only search: Dense embeddings miss exact matches, acronyms, and numeric data. Hybrid search is mandatory for enterprise accuracy.
- Skipping reranking: Top-15 vector results contain low-signal candidates. Cross-encoder reranking consistently improves precision by 15β25%.
- Neglecting evaluation pipelines: Without automated RAG metrics, accuracy degradation goes undetected until user complaints surface. Implement continuous evaluation from day one.
- Hardcoding prompts: Static prompts cannot adapt to query complexity or retrieved context length. Use dynamic templating with context-aware compression.
- Ignoring cost/latency tradeoffs: Unbounded retrieval and redundant LLM calls explode costs. Implement semantic caching, result compression, and model routing.
- Treating RAG as stateless: Enterprise workloads require session context, user-specific filters, and audit trails. Stateless designs fail compliance and personalization requirements.
Production Bundle
Action Checklist
Decision Matrix
| Component | Option A | Option B | Option C | Best For |
|---|
| Vector DB | pgvector | Weaviate | Milvus | Small/medium: pgvector. Multi-tenant: Weaviate. High-scale: Milvus |
| Embedding Model | text-embedding-3-large | bge-m3 | nomic-embed | Accuracy-critical: 3-large. Multilingual: bge-m3. Cost-optimized: nomic |
| Orchestration | LangChain | LlamaIndex | Haystack | Rapid prototyping: LangChain. Document-heavy: LlamaIndex. Production pipelines: Haystack |
| Caching Strategy | Redis (semantic) | Upstash | Custom LRU | Low-latency: Redis. Serverless: Upstash. Simple workloads: Custom LRU |
| Evaluation Framework | RAGAS | TruLens | DeepEval | Standard metrics: RAGAS. Observability: TruLens. CI/CD integration: DeepEval |
Configuration Template
# docker-compose.rag.yml
version: "3.9"
services:
vector-db:
image: weaviate/weaviate:1.24.0
environment:
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "false"
AUTHORIZATION_ADMINLIST_ENABLED: "true"
QUERY_DEFAULTS_LIMIT: 25
ports: ["8080:8080"]
volumes: ["weaviate_data:/var/lib/weaviate"]
cache:
image: redis/redis-stack:7.2.0-v10
ports: ["6379:6379"]
command: ["redis-server", "--save", "60", "1", "--loglevel", "warning"]
ingestion-worker:
build: ./workers/ingestion
environment:
VECTOR_DB_URL: "http://vector-db:8080"
EMBEDDING_MODEL: "BAAI/bge-m3"
CHUNK_OVERLAP: "0.15"
depends_on: [vector-db]
api-gateway:
build: ./services/api
environment:
VECTOR_DB_URL: "http://vector-db:8080"
REDIS_URL: "redis://cache:6379"
RERANKER_MODEL: "cross-encoder/ms-marco-MiniLM-L-12-v2"
RBAC_PROVIDER: "azure-ad"
AUDIT_LOG_ENDPOINT: "https://logs.internal.company.com/rag"
ports: ["8000:8000"]
depends_on: [vector-db, cache]
volumes:
weaviate_data:
Quick Start Guide
- Ingest sample data: Run the ingestion worker against a 100-document corpus. Verify chunk metadata, embedding dimensions, and index versioning.
- Deploy hybrid search: Configure dense + BM25 retrievers with 0.6/0.4 weighting. Test query latency and recall@10 against a validation set.
- Add reranker & cache: Integrate cross-encoder reranking. Enable semantic caching with cosine similarity threshold 0.92 and tenant-aware keys.
- Validate & monitor: Run RAGAS evaluation pipeline. Confirm P95 latency <250ms, recall@10 >0.85, and cost per 1K queries <$5. Enable audit logging and RBAC filters.
Enterprise RAG is not a single model call. It is a distributed retrieval system with strict accuracy, latency, and compliance requirements. By decoupling ingestion, enforcing hybrid search, implementing reranking, caching strategically, and embedding continuous evaluation, teams can transition from prototype to production without sacrificing performance or governance.