mented into semantically coherent units. Chunks should average 300β500 tokens to balance context richness with retrieval precision. Overlapping boundaries prevent critical information from being split across segments.
from dataclasses import dataclass
from typing import List, Tuple
@dataclass
class DocumentChunk:
chunk_id: str
source_text: str
metadata: dict
def segment_corpus(raw_texts: List[str], target_size: int = 400, overlap: int = 50) -> List[DocumentChunk]:
"""Splits raw text into overlapping semantic units."""
chunks: List[DocumentChunk] = []
chunk_counter = 0
for source in raw_texts:
words = source.split()
for i in range(0, len(words), target_size - overlap):
segment = " ".join(words[i : i + target_size])
chunks.append(DocumentChunk(
chunk_id=f"doc_{chunk_counter}",
source_text=segment,
metadata={"source_index": chunk_counter, "length": len(segment.split())}
))
chunk_counter += 1
return chunks
Architecture Rationale: Overlapping chunks preserve contextual continuity. A 50-token overlap ensures that concepts spanning chunk boundaries remain intact during retrieval. Metadata attachment enables downstream filtering without re-embedding.
Phase 2: Vector Embedding Pipeline
Embeddings transform text into dense numerical representations. We use all-MiniLM-L6-v2 for its balance of speed and dimensionality (384 dimensions, ~80MB footprint). L2 normalization ensures cosine similarity reduces to a simple dot product, eliminating expensive square root calculations during search.
import numpy as np
from sentence_transformers import SentenceTransformer
class EmbeddingEngine:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self._model = SentenceTransformer(model_name)
self._vectors: np.ndarray = np.array([])
self._index_map: List[str] = []
def index(self, corpus: List[DocumentChunk]) -> None:
texts = [c.source_text for c in corpus]
self._index_map = [c.chunk_id for c in corpus]
raw_embeddings = self._model.encode(texts, normalize_embeddings=True)
self._vectors = np.array(raw_embeddings)
@property
def dimensionality(self) -> int:
return self._vectors.shape[1] if self._vectors.size > 0 else 0
Architecture Rationale: Local embedding models remove external API dependencies and latency from the retrieval path. Normalization at index time shifts computational overhead to ingestion, making query-time search O(N) matrix multiplication instead of O(N) distance calculations.
Phase 3: Semantic Routing & Retrieval
Query embedding follows the same normalization pipeline. Similarity scoring uses matrix multiplication against the stored vector store. Top-K indices are extracted and mapped back to source text.
class SemanticRouter:
def __init__(self, engine: EmbeddingEngine):
self._engine = engine
def fetch_context(self, query: str, top_k: int = 3) -> List[DocumentChunk]:
query_vec = self._engine._model.encode([query], normalize_embeddings=True)
similarity_scores = (self._engine._vectors @ query_vec.T).flatten()
ranked_indices = np.argsort(similarity_scores)[::-1][:top_k]
# Reconstruct chunks from index map (simplified for clarity)
results = []
for idx in ranked_indices:
results.append(DocumentChunk(
chunk_id=self._engine._index_map[idx],
source_text="", # In production, store text alongside vectors or fetch from DB
metadata={}
))
return results
Architecture Rationale: Dot product on L2-normalized vectors is mathematically equivalent to cosine similarity but executes 3β5x faster in NumPy. This approach scales efficiently to 100k+ chunks before requiring approximate nearest neighbor (ANN) indexing.
Phase 4: Grounded Synthesis
The generation phase injects retrieved chunks into a constrained prompt. Anthropic's Messages API handles the synthesis. Prompt caching is applied to the system instruction block to leverage the 5-minute ephemeral window.
import anthropic
from typing import List
class ResponseSynthesizer:
def __init__(self, api_key: str, model_id: str = "claude-sonnet-4-6"):
self._client = anthropic.Anthropic(api_key=api_key)
self._model_id = model_id
def generate(self, query: str, context_blocks: List[str]) -> str:
formatted_context = "\n\n".join(f"[REF {i+1}] {block}" for i, block in enumerate(context_blocks))
system_prompt = [
{
"type": "text",
"text": "Synthesize answers using ONLY the provided reference blocks. Cite reference numbers. State explicitly if information is missing.",
"cache_control": {"type": "ephemeral"}
}
]
response = self._client.messages.create(
model=self._model_id,
max_tokens=512,
system=system_prompt,
messages=[{"role": "user", "content": f"References:\n{formatted_context}\n\nQuery: {query}"}]
)
return response.content[0].text
Architecture Rationale: The cache_control flag marks the system prompt for ephemeral storage. Anthropic retains identical prefixes for 5 minutes, serving subsequent requests from cache. This reduces input token billing by up to 90% for high-frequency queries. Temperature is implicitly controlled by the strict grounding instruction, eliminating the need for explicit temperature parameters in most retrieval workflows.
Pitfall Guide
1. Semantic Fragmentation
Explanation: Splitting documents at arbitrary character counts or newlines severs logical relationships. A retrieved chunk may contain a premise without its conclusion, forcing the model to guess.
Fix: Implement semantic chunking using sentence boundary detection or recursive text splitters. Maintain 10β15% overlap between segments to preserve cross-boundary context.
2. Unnormalized Vector Mathematics
Explanation: Computing cosine similarity without L2 normalization requires manual division by vector magnitudes. Skipping normalization or applying it inconsistently between index and query phases corrupts similarity scores.
Fix: Normalize embeddings at both ingestion and query time. Verify that np.linalg.norm(vector) β 1.0 before performing dot products.
3. Context Leakage & Grounding Failure
Explanation: LLMs are trained to be helpful and will default to internal knowledge when retrieval context is sparse or ambiguous. Without explicit constraints, models blend retrieved data with pretraining, producing unverifiable answers.
Fix: Enforce strict system instructions. Use reference tagging ([REF 1]) and require citation. Set temperature=0.1 or omit it to minimize creative variance.
4. Cache Invalidation Blindness
Explanation: Anthropic's prompt caching uses a 5-minute ephemeral window. Developers assuming persistent caching will see unexpected cost spikes when system prompts change or traffic drops below the cache threshold.
Fix: Design system prompts to be static across query types. Monitor cache hit rates via API response headers. Treat caching as a traffic-dependent optimization, not a guaranteed state.
5. Silent Retrieval Failures
Explanation: Vector search always returns top-K results, even when similarity scores are near zero. The model receives irrelevant context and fabricates answers to satisfy the prompt.
Fix: Implement a similarity threshold (e.g., score < 0.65). Return a fallback message or trigger a secondary search strategy when confidence drops. Log low-confidence retrievals for pipeline tuning.
Explanation: Semantic search over an unfiltered corpus retrieves outdated or irrelevant documents. A query about "2024 pricing" may return 2022 documentation if semantic similarity outweighs temporal relevance.
Fix: Apply metadata filters (date ranges, product categories, access levels) before vector search. Use hybrid approaches: filter first, then rank semantically.
7. Over-Reliance on Cosine Similarity
Explanation: Dense embeddings capture semantic meaning but struggle with exact keyword matching, numerical comparisons, or structured data extraction. Pure vector search misses precise factual queries.
Fix: Implement hybrid search combining BM25 keyword matching with dense vector retrieval. Apply a cross-encoder reranker (e.g., bge-reranker-large) on top-K results to reorder by contextual relevance.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| <10k documents, low latency needs | NumPy + all-MiniLM-L6-v2 | Zero infrastructure overhead, sub-50ms retrieval | Baseline |
| 10kβ100k documents, multi-tenant | pgvector or ChromaDB | Persistent storage, metadata filtering, concurrent access | +15β20% infra |
| >100k documents, high accuracy | Pinecone/Milvus + cross-encoder reranker | ANN indexing scales to millions, reranking boosts precision | +30β40% infra |
| High-frequency identical queries | Prompt caching + Sonnet 4 | 5-min ephemeral window slashes repeated input costs | -70β90% token spend |
| Classification/Extraction tasks | Claude Haiku | Optimized for speed and cost, sufficient for structured tasks | -60% vs Sonnet |
| Non-urgent bulk processing | Batch API | Asynchronous processing with 50% discount, 24-hour SLA | -50% total cost |
Configuration Template
rag_pipeline:
ingestion:
chunk_size: 400
chunk_overlap: 50
metadata_fields: ["source", "date", "category"]
embedding:
model: "all-MiniLM-L6-v2"
normalize: true
dimensionality: 384
retrieval:
top_k: 5
similarity_threshold: 0.62
reranker: "bge-reranker-large"
hybrid_weight: 0.3 # BM25 vs dense vector balance
generation:
model: "claude-sonnet-4-6"
max_tokens: 512
cache_system_prompt: true
grounding_instruction: "Answer using ONLY provided references. Cite [REF n]. State 'Information not found' if absent."
evaluation:
framework: "RAGAS"
metrics: ["context_recall", "faithfulness", "answer_relevance"]
threshold_faithfulness: 0.85
Quick Start Guide
- Install Dependencies: Run
pip install anthropic sentence-transformers numpy pyyaml to provision the core stack.
- Initialize Pipeline: Load your configuration template, instantiate the
EmbeddingEngine, and run index() against your document corpus.
- Query Execution: Pass user input to
SemanticRouter.fetch_context(), extract source text, and feed results to ResponseSynthesizer.generate().
- Validate Output: Check similarity scores against your threshold. Log cache hit rates and generation latency for baseline metrics.
- Scale Iteratively: Swap NumPy storage for pgvector or ChromaDB when corpus exceeds 50k chunks. Introduce cross-encoder reranking if precision drops below 80%.