PromptModule {
id: string;
template: string;
estimatedTokens: number;
}
class PromptRegistry {
private modules: Map<string, PromptModule> = new Map();
register(module: PromptModule): void {
this.modules.set(module.id, module);
}
compose(taskIds: string[]): { content: string; tokenCount: number } {
let content = '';
let totalTokens = 0;
for (const id of taskIds) {
const mod = this.modules.get(id);
if (mod) {
content += mod.template + '\n';
totalTokens += mod.estimatedTokens;
}
}
return { content, tokenCount: totalTokens };
}
}
// Usage: Register modules once at startup
const registry = new PromptRegistry();
registry.register({
id: 'finance-tone',
template: 'Respond with professional financial terminology. Be concise.',
estimatedTokens: 25
});
registry.register({
id: 'invoice-rules',
template: 'Validate invoice numbers against format INV-YYYY-NNNN.',
estimatedTokens: 30
});
// Compose only what is needed for the current request
const systemPrompt = registry.compose(['finance-tone', 'invoice-rules']);
**Rationale:** This approach eliminates redundant instruction tokens. If a user queries a simple status, the system avoids loading complex validation rules, saving tokens per request.
#### 2. Adaptive RAG with Reranking
Retrieval-Augmented Generation (RAG) systems often suffer from context bloat by retrieving a fixed `top_k` chunks regardless of relevance. An optimized pipeline retrieves a larger candidate set, applies a reranking model, and passes only the highest-scoring chunks to the LLM.
```typescript
interface ContextChunk {
id: string;
content: string;
score: number;
}
class AdaptiveRAG {
async retrieve(query: string, metadataFilters: Record<string, any>): Promise<ContextChunk[]> {
// Step 1: Retrieve broader candidate set with metadata filtering
const candidates = await vectorStore.search({
query,
topK: 20,
filters: metadataFilters // e.g., { department: 'finance', docType: 'contract' }
});
// Step 2: Rerank candidates against query
const ranked = await reranker.score(query, candidates);
// Step 3: Return only top 3 high-confidence chunks
return ranked.slice(0, 3).map(c => ({
id: c.id,
content: c.content,
score: c.score
}));
}
}
Rationale: Metadata filtering reduces the search space before retrieval, while reranking ensures that only semantically relevant context consumes LLM tokens. Passing 3 high-quality chunks is superior to passing 10 noisy chunks, reducing input tokens and hallucination risk.
3. Intent-Based Routing
Agentic workflows introduce multiplicative token costs. A single query should not trigger a multi-agent chain if a direct tool call suffices. An intent router classifies the request complexity and selects the appropriate execution path.
type AgentType = 'DIRECT_TOOL' | 'SINGLE_AGENT' | 'MULTI_AGENT';
class IntentRouter {
async classify(query: string): Promise<AgentType> {
const complexity = await llm.evaluateComplexity(query);
if (complexity.score < 0.3) return 'DIRECT_TOOL';
if (complexity.score < 0.7) return 'SINGLE_AGENT';
return 'MULTI_AGENT';
}
}
// Execution logic
const route = await router.classify(userQuery);
let result;
switch (route) {
case 'DIRECT_TOOL':
result = await toolExecutor.run(query);
break;
case 'MULTI_AGENT':
result = await multiAgentOrchestrator.run(query);
break;
default:
result = await singleAgent.run(query);
}
Rationale: This prevents "agentic over-engineering." Simple queries bypass expensive orchestration layers, saving tokens and latency. Complex queries are routed to the appropriate depth of reasoning.
4. Output Constraints and Schema Enforcement
Output token waste is often ignored. LLMs tend to over-generate verbose explanations when concise answers suffice. Enforcing strict output schemas and token limits controls generation costs.
interface InvoiceResponse {
status: 'APPROVED' | 'PENDING' | 'REJECTED';
amount: number;
reason?: string;
}
const response = await llm.generate<InvoiceResponse>({
prompt: userQuery,
context: contextChunks,
schema: InvoiceResponseSchema,
maxTokens: 150,
temperature: 0.2
});
Rationale: Structured outputs (JSON schemas) reduce variability and verbosity. Setting maxTokens prevents runaway generation. Lower temperature improves consistency, reducing the need for regeneration loops.
Pitfall Guide
Production systems fail when token constraints are treated as afterthoughts. The following pitfalls are common in real-world deployments.
| Pitfall | Explanation | Fix |
|---|
| The "Append-All" History Trap | Systems that append the entire chat history to every request cause linear token growth. After 20 turns, context becomes bloated with irrelevant early messages. | Implement summarization windows. Replace raw history with a compressed summary of past turns, or use a sliding window with relevance scoring. |
| Context Window Dilution | Teams assume larger context windows yield better results, pushing 100k+ tokens. This dilutes the model's attention, increasing latency and hallucination rates. | Enforce relevance thresholds. Only inject context that scores above a semantic similarity cutoff. Bigger is not better; precision is. |
| Output Token Blindness | Engineers monitor input tokens but ignore generation costs. Verbose LLM responses can double the token count per request without adding value. | Apply output constraints. Use response schemas, word limits, and tone instructions to enforce conciseness. Monitor output token drift. |
| Static System Prompts | Sending a 2,000-token system prompt for every request, even when only 200 tokens of instructions are relevant to the current task. | Adopt dynamic prompt composition. Use a registry to inject only task-specific instructions. Cache static instructions where the API supports it. |
| Agentic Over-Engineering | Triggering multi-agent chains for simple queries like "What is my balance?" This multiplies token usage by 5-10x unnecessarily. | Implement intent classification. Route simple queries to direct tool calls. Reserve agents for complex, multi-step reasoning. |
| RAG "Top-K" Dogma | Blindly retrieving top_k=10 chunks regardless of query complexity or document density. This floods the context with noise. | Use adaptive retrieval. Adjust top_k based on query ambiguity. Always apply reranking to filter low-relevance chunks before LLM injection. |
| Missing Telemetry | No visibility into token consumption per workflow, agent, or endpoint. Cost spikes are detected only after billing cycles. | Deploy token middleware. Log input/output tokens, cost per request, and latency for every call. Set up alerts for abnormal token drift. |
Production Bundle
Action Checklist
Decision Matrix
Use this matrix to select the appropriate architecture based on query characteristics and cost constraints.
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Volume, Simple Queries | Direct Tool Call | Bypasses LLM reasoning entirely. Lowest latency and cost. | Minimal |
| Complex Reasoning Required | Multi-Agent Workflow | Necessary for multi-step logic, but incurs high token overhead. | High |
| Large Document Corpus | RAG with Reranking | Ensures only relevant context is processed. Reduces input tokens significantly. | Medium |
| Conversational Interface | Summarized Memory | Prevents history explosion while maintaining context continuity. | Low-Medium |
| Strict Compliance/Format | Schema-Enforced Output | Guarantees structured responses and limits generation tokens. | Low |
Configuration Template
A ready-to-use configuration structure for token governance.
// token-budget.config.ts
export const TokenBudgetConfig = {
// Global limits
maxInputTokens: 4000,
maxOutputTokens: 500,
costPer1kTokens: 0.002, // Example pricing
// RAG settings
rag: {
retrievalTopK: 20,
rerankTopK: 3,
minRelevanceScore: 0.75,
metadataFilters: ['department', 'docType', 'dateRange']
},
// Prompt settings
prompts: {
modularization: true,
cacheStaticInstructions: true,
maxSystemPromptTokens: 300
},
// Routing thresholds
routing: {
simpleThreshold: 0.3,
complexThreshold: 0.7,
allowedAgents: ['research', 'validation', 'summarizer']
},
// Output constraints
output: {
enforceSchema: true,
maxTokens: 150,
temperature: 0.2
},
// Observability
telemetry: {
enabled: true,
logTokens: true,
logCost: true,
alertThreshold: 1.5 // Alert if cost exceeds 1.5x baseline
}
};
Quick Start Guide
- Instrument Telemetry: Add a middleware wrapper around your LLM client that logs token counts and calculates cost per request. Deploy this immediately to establish a baseline.
- Apply Output Constraints: Identify your top 5 most frequent endpoints. Replace free-text responses with JSON schemas and set
maxTokens limits. Measure the reduction in output tokens.
- Optimize Retrieval: If using RAG, add metadata filtering to your search queries. Introduce a reranking step to reduce the number of chunks passed to the model from 10 to 3.
- Audit Prompts: Review your system prompts. Remove redundant instructions and modularize task-specific rules. Implement dynamic composition to avoid sending unnecessary tokens.
- Monitor and Iterate: Set up dashboards for token usage, cost per workflow, and latency. Review metrics weekly to identify drift and optimize further.