hical expansion for multi-hop reasoning, and semantic boundaries for compliance or research-heavy workloads. The data confirms that retrieval quality is not a function of model capability, but of structural preparation.
Core Solution
Building a production-grade retrieval pipeline requires decoupling ingestion, indexing, and query routing into distinct, testable stages. The following implementation demonstrates a modular chunking architecture that supports strategy switching, metadata preservation, and parent-child expansion.
Step 1: Ingestion and Strategy Selection
Raw documents must be parsed into clean text streams before segmentation. The chunking strategy should be selected based on document topology and query complexity, not hardcoded.
from typing import List, Protocol
import tiktoken
from dataclasses import dataclass
@dataclass
class ChunkMetadata:
source_file: str
section_header: str
chunk_index: int
token_count: int
class ChunkingStrategy(Protocol):
def segment(self, raw_text: str, metadata: ChunkMetadata) -> List[dict]: ...
class RecursiveSegmenter:
def __init__(self, max_tokens: int = 512, overlap_ratio: float = 0.15):
self.max_tokens = max_tokens
self.overlap = int(max_tokens * overlap_ratio)
self.encoding = tiktoken.get_encoding("cl100k_base")
def segment(self, raw_text: str, metadata: ChunkMetadata) -> List[dict]:
tokens = self.encoding.encode(raw_text)
chunks = []
start = 0
while start < len(tokens):
end = start + self.max_tokens
chunk_tokens = tokens[start:end]
chunk_text = self.encoding.decode(chunk_tokens)
chunks.append({
"content": chunk_text,
"metadata": {**metadata.__dict__, "chunk_index": len(chunks)}
})
start += (self.max_tokens - self.overlap)
return chunks
Step 2: Semantic Boundary Detection
For precision-sensitive workloads, semantic chunking replaces arbitrary token limits with embedding-driven topic transitions.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
class SemanticBoundaryEngine:
def __init__(self, threshold_percentile: int = 90):
self.splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=threshold_percentile
)
def segment(self, raw_text: str, metadata: ChunkMetadata) -> List[dict]:
raw_chunks = self.splitter.split_text(raw_text)
return [
{
"content": segment,
"metadata": {**metadata.__dict__, "chunk_index": i}
}
for i, segment in enumerate(raw_chunks)
]
Step 3: Hierarchical Index Construction
Parent-child indexing resolves the precision-context dilemma. Child chunks enable high-resolution vector search, while parent chunks provide complete contextual framing during generation.
from langchain.retrievers import ParentDocumentRetriever
from langchain_community.vectorstores import Chroma
from langchain.storage import InMemoryStore
class HierarchicalIndexBuilder:
def __init__(self, embedding_fn, parent_size: int = 2000, child_size: int = 400):
self.vector_store = Chroma(embedding_function=embedding_fn)
self.doc_store = InMemoryStore()
self.retriever = ParentDocumentRetriever(
vectorstore=self.vector_store,
docstore=self.doc_store,
child_splitter=RecursiveSegmenter(max_tokens=child_size),
parent_splitter=RecursiveSegmenter(max_tokens=parent_size)
)
def ingest(self, documents: List[str]):
self.retriever.add_documents(documents)
def query(self, user_input: str, top_k: int = 3) -> List[str]:
results = self.retriever.get_relevant_documents(user_input, k=top_k)
return [doc.page_content for doc in results]
Architecture Decisions and Rationale
Strategy Pattern Implementation: Chunking logic is abstracted behind a Protocol interface. This allows runtime strategy switching without modifying ingestion pipelines. Production systems should evaluate multiple strategies on a validation corpus before deployment.
Token-Based Overlap Calculation: Overlap is derived as a ratio of max_tokens rather than hardcoded character counts. This ensures consistent boundary behavior across different encoding schemes and prevents semantic fragmentation at chunk edges.
Parent-Child Decoupling: Vector search operates exclusively on child chunks to minimize noise and maximize precision. Retrieval automatically expands to parent documents, guaranteeing that the generation layer receives complete contextual framing without manual prompt stitching.
Metadata Preservation: Every chunk carries source attribution, section headers, and positional indices. This enables audit trails, citation generation, and query routing based on document taxonomy.
Pitfall Guide
1. The Arbitrary Split Trap
Explanation: Applying fixed-size token or character limits to prose, legal contracts, or technical documentation. This severs sentences mid-thought and destroys semantic continuity.
Fix: Default to recursive structural splitting for unstructured text. Reserve fixed-size chunking exclusively for logs, telemetry, or uniform CSV/JSON streams.
2. Overlap Blindness
Explanation: Setting overlap too low causes relevant information to fracture across boundaries. Setting it too high introduces redundant context, inflating token costs and confusing the attention mechanism.
Fix: Calculate overlap as 10β15% of the target chunk size. Validate by checking retrieval recall on boundary-heavy test queries.
3. Semantic Threshold Misconfiguration
Explanation: Using default percentile thresholds without dataset calibration. A 95th percentile threshold may create excessively large chunks on dense technical material, while a 70th percentile may over-fragment narrative text.
Fix: Run a grid search on a representative sample corpus. Measure retrieval precision against a ground-truth Q&A set to identify the optimal breakpoint.
Explanation: Discarding document structure, headers, and source identifiers during segmentation. This eliminates citation capabilities and prevents query routing based on content taxonomy.
Fix: Attach structured metadata to every chunk during ingestion. Preserve section hierarchies and file origins to enable filtered retrieval and audit compliance.
5. Retrieval Architecture Mismatch
Explanation: Using flat vector search for multi-hop reasoning or complex analytical queries. Flat retrieval returns isolated passages, forcing the model to hallucinate connections that were never indexed together.
Fix: Deploy hierarchical indexing for context-heavy queries. Implement Agentic RAG for multi-step reasoning where the system dynamically decides whether to search, summarize, or cross-reference. Use GraphRAG when relationships between entities (people, systems, regulations) drive query intent.
6. Ignoring Evaluation Loops
Explanation: Deploying chunking strategies without measuring retrieval recall, context precision, or generation faithfulness. Teams assume better embeddings compensate for poor segmentation.
Fix: Implement automated evaluation using frameworks like RAGAS or TruLens. Track metrics across chunking variants and enforce regression testing before pipeline updates.
7. Over-Engineering Ingestion
Explanation: Applying semantic chunking and graph traversal to simple FAQ or log-query workloads. This introduces unnecessary compute overhead and latency without measurable accuracy gains.
Fix: Match architecture complexity to query complexity. Use recursive splitting + flat retrieval for straightforward fact retrieval. Reserve hierarchical and graph-based approaches for analytical, compliance, or research workloads.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Simple FAQ or log queries | Recursive splitting + flat vector search | Low complexity, fast retrieval, sufficient precision for direct fact lookup | Low compute, minimal token overhead |
| Technical documentation or policy manuals | Hierarchical parent-child indexing | Child chunks enable precise retrieval; parent chunks preserve full context for accurate generation | Medium compute during ingestion, higher retrieval accuracy reduces retry costs |
| Research papers or compliance audits | Semantic boundary chunking + Agentic RAG | Topic-transition detection preserves semantic continuity; agentic routing handles multi-hop reasoning | High ingestion compute, justified by precision requirements and reduced hallucination rates |
| Entity-heavy knowledge bases | GraphRAG + hierarchical indexing | Knowledge graphs capture relationships between entities; hierarchical chunks provide contextual grounding | Highest infrastructure cost, optimal for complex analytical queries and relationship traversal |
Configuration Template
rag_pipeline:
ingestion:
strategy: recursive
max_tokens: 512
overlap_ratio: 0.15
preserve_headers: true
indexing:
type: hierarchical
parent_size: 2000
child_size: 400
vector_backend: chroma
docstore: in_memory
retrieval:
top_k: 3
reranker: cohere
expand_to_parent: true
evaluation:
metrics: [recall, context_precision, faithfulness]
validation_corpus: ./tests/ground_truth.json
Quick Start Guide
- Install dependencies:
pip install langchain langchain-community langchain-openai tiktoken chromadb
- Initialize the pipeline: Load your document corpus, configure the
RecursiveSegmenter or SemanticBoundaryEngine, and attach metadata handlers.
- Build the index: Run ingestion through the
HierarchicalIndexBuilder. Verify that child chunks populate the vector store and parent chunks populate the document store.
- Execute test queries: Run a validation set against the retriever. Measure recall and context precision. Adjust overlap ratios or semantic thresholds if boundary fragmentation occurs.
- Deploy with monitoring: Route production traffic through the pipeline. Track token usage, latency, and retrieval accuracy. Implement automated regression tests for chunking configuration changes.