.
Step 1: Token Accounting Layer
Token estimation must happen before any network call. Hardcoding character divisions is fragile. Instead, implement a dual-estimation strategy that falls back to word-based heuristics when structural markers are absent.
interface TokenEstimator {
estimateByChars(text: string): number;
estimateByWords(text: string): number;
getConservativeEstimate(text: string): number;
}
class StandardTokenEstimator implements TokenEstimator {
private readonly CHAR_RATIO = 4.0;
private readonly WORD_RATIO = 1.3;
estimateByChars(text: string): number {
return Math.ceil(text.length / this.CHAR_RATIO);
}
estimateByWords(text: string): number {
const wordMatches = text.match(/\w+/g);
const wordCount = wordMatches ? wordMatches.length : 0;
return Math.ceil(wordCount * this.WORD_RATIO);
}
getConservativeEstimate(text: string): number {
const charEst = this.estimateByChars(text);
const wordEst = this.estimateByWords(text);
return Math.max(charEst, wordEst);
}
}
Architecture Rationale: Using the maximum of both heuristics prevents underestimation. Code-heavy inputs tokenize differently than prose. The conservative estimate acts as a safety buffer, ensuring you never accidentally breach the window during assembly.
Step 2: Semantic Partitioning
Splitting by character count destroys logical boundaries. Semantic partitioning respects structural markers (double newlines, headings, code fences) and applies controlled overlap to preserve cross-boundary context.
interface PartitionConfig {
maxTokens: number;
overlapTokens: number;
structuralRegex: RegExp;
}
class SemanticPartitioner {
private estimator: TokenEstimator;
private config: PartitionConfig;
constructor(estimator: TokenEstimator, config: PartitionConfig) {
this.estimator = estimator;
this.config = config;
}
partition(rawText: string): string[] {
const segments = rawText.split(this.config.structuralRegex).filter(s => s.trim().length > 0);
const chunks: string[] = [];
let currentBuffer: string[] = [];
let currentTokenCount = 0;
for (const segment of segments) {
const segmentTokens = this.estimator.getConservativeEstimate(segment);
if (currentTokenCount + segmentTokens > this.config.maxTokens) {
if (currentBuffer.length > 0) {
chunks.push(currentBuffer.join('\n\n'));
}
const overlapBuffer: string[] = [];
let overlapTokens = 0;
for (let i = currentBuffer.length - 1; i >= 0; i--) {
const prevTokens = this.estimator.getConservativeEstimate(currentBuffer[i]);
if (overlapTokens + prevTokens <= this.config.overlapTokens) {
overlapBuffer.unshift(currentBuffer[i]);
overlapTokens += prevTokens;
} else {
break;
}
}
currentBuffer = [...overlapBuffer, segment];
currentTokenCount = overlapTokens + segmentTokens;
} else {
currentBuffer.push(segment);
currentTokenCount += segmentTokens;
}
}
if (currentBuffer.length > 0) {
chunks.push(currentBuffer.join('\n\n'));
}
return chunks;
}
}
Architecture Rationale: Overlap is not optional; it's a context bridge. When a paragraph spans a chunk boundary, the model loses causal relationships. A 10-15% token overlap preserves syntactic continuity without bloating the window. The structural regex allows you to adapt partitioning to markdown, JSON, or source code by simply swapping the delimiter pattern.
Step 3: Dynamic Context Assembly & Retrieval
Never embed the entire corpus into the prompt. Use a lightweight retrieval layer to surface only relevant partitions, then assemble the final payload with strict token budgeting.
interface RetrievalResult {
chunk: string;
score: number;
}
interface ContextAssemblerConfig {
maxContextTokens: number;
reservedForOutput: number;
systemPromptTokens: number;
}
class ContextAssembler {
private estimator: TokenEstimator;
private config: ContextAssemblerConfig;
constructor(estimator: TokenEstimator, config: ContextAssemblerConfig) {
this.estimator = estimator;
this.config = config;
}
assemble(systemPrompt: string, query: string, candidates: RetrievalResult[]): string[] {
const availableTokens = this.config.maxContextTokens
- this.config.systemPromptTokens
- this.config.reservedForOutput
- this.estimator.getConservativeEstimate(query);
const assembled: string[] = [];
let consumed = 0;
for (const candidate of candidates) {
const chunkTokens = this.estimator.getConservativeEstimate(candidate.chunk);
if (consumed + chunkTokens <= availableTokens) {
assembled.push(candidate.chunk);
consumed += chunkTokens;
} else {
break;
}
}
return assembled;
}
}
Architecture Rationale: Context assembly must be deterministic. By reserving tokens for the system prompt, user query, and expected model output, you prevent mid-generation truncation. The assembly loop respects a hard budget, guaranteeing that the final payload never exceeds the model's threshold. This approach scales linearly: adding more documents to the corpus does not increase per-query cost, only retrieval latency.
Pitfall Guide
1. Fixed-Byte Chunking
Explanation: Splitting documents at exact character or byte boundaries fractures sentences, breaks code syntax, and severs logical flow. The model receives incomplete thoughts and compensates with hallucination.
Fix: Always partition on structural markers (newlines, headings, braces, semicolons). Use a regex that respects document syntax before applying token limits.
2. Zero Overlap Strategy
Explanation: Removing overlap to save tokens creates context cliffs. Information that spans two chunks becomes unrecoverable, degrading retrieval accuracy by 15-20%.
Fix: Implement a sliding overlap window targeting 10-15% of the max chunk size. This preserves boundary continuity at minimal token cost.
3. Summary Drift in Multi-Turn Conversations
Explanation: Repeatedly summarizing older messages compounds information loss. Each compression cycle discards nuance, eventually producing a generic summary that fails to ground new responses.
Fix: Anchor summaries to extracted key facts or use a sliding window that retains the last N turns verbatim while compressing only the tail. Validate summaries against original facts before injection.
4. Prompt Bloat & Conversational Filler
Explanation: System prompts padded with polite phrasing, redundant role definitions, or excessive formatting instructions consume 10-20% of the window without improving output.
Fix: Adopt directive-style prompts. Strip conversational filler. Use structured formats (role, goal, constraints, output schema) that parsers and models both process efficiently.
5. Ignoring Streaming Overhead
Explanation: Streaming responses hide incremental token consumption. Teams often track only the final payload, missing the cumulative cost of partial deltas and tool calls.
Fix: Implement real-time token counters that accumulate during stream consumption. Set early-termination thresholds to halt generation when cost or length limits are approached.
6. Context Window Math Errors
Explanation: Developers calculate limits based only on user input, forgetting that system prompts, tool definitions, function schemas, and output buffers all consume the same window.
Fix: Reserve 20-25% of the total window for non-user content. Always subtract system prompt tokens, tool schemas, and expected output length before allocating space for retrieved context.
7. Embedding Dimension Mismatch
Explanation: Retrieval fails silently when chunk embeddings and query embeddings use different vector dimensions or normalization strategies. Cosine similarity returns noise.
Fix: Standardize on a single embedding model and dimension (e.g., 1536 or 3072). Validate vector shapes at ingestion and query time. Apply L2 normalization consistently across the pipeline.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single large document (50K+ tokens) | Semantic Partitioning + RAG | Isolates relevant sections, avoids full-window inflation | Reduces input cost by ~60% |
| Multi-turn customer support chat | Context Compression (Summarization) | Maintains conversational flow while shedding older noise | Moderate cost, improves continuity |
| Real-time code completion | Fixed-Size Chunking (with syntax awareness) | Low latency required; structural markers align with code blocks | Predictable, low overhead |
| High-accuracy compliance review | Semantic Partitioning + Strict Budget Assembly | Guarantees no context loss, enforces token caps | Higher per-query cost, but audit-safe |
| Batch document ingestion | Parallel Partitioning + Vector Store | Scales horizontally, decouples ingestion from inference | Upfront compute, linear query cost |
Configuration Template
export const PipelineConfig = {
models: {
primary: {
name: 'claude-3-5-sonnet-20241022',
maxContextTokens: 200000,
inputCostPer1K: 0.003,
outputCostPer1K: 0.015
},
fallback: {
name: 'gpt-4-turbo',
maxContextTokens: 128000,
inputCostPer1K: 0.01,
outputCostPer1K: 0.03
}
},
partitioning: {
maxTokens: 4000,
overlapTokens: 600,
structuralRegex: /\n{2,}|(?<=\n)(#{1,6}\s)/
},
assembly: {
reservedForOutput: 4000,
systemPromptTokens: 300,
maxRetrievalCandidates: 5
},
safety: {
enableEarlyTermination: true,
costThresholdPerRequest: 0.50,
tokenBufferMultiplier: 1.15
}
};
Quick Start Guide
- Initialize the estimator and partitioner: Instantiate
StandardTokenEstimator and SemanticPartitioner using the structural regex and overlap settings from your configuration.
- Ingest and chunk: Pass raw documents through the partitioner. Store resulting chunks in a vector database or local index with consistent embedding dimensions.
- Assemble context: On query, retrieve top-k candidates, run them through
ContextAssembler to enforce token budgets, and construct the final message payload.
- Stream with accounting: Execute the request with streaming enabled. Accumulate tokens incrementally, compare against
costThresholdPerRequest, and terminate early if limits are breached.
- Validate and iterate: Log actual vs. estimated token counts per request. Adjust
tokenBufferMultiplier and overlap thresholds based on observed drift before scaling to production traffic.