ype-safe wrapper around the provider's tokenizer to standardize counting across the codebase.
import { getEncoding, type Tiktoken } from "tiktoken";
export class TokenBudgetGuard {
private encoder: Tiktoken;
private readonly INPUT_PRICE_PER_K = 0.005;
private readonly OUTPUT_PRICE_PER_K = 0.015;
constructor(model: string = "gpt-4o") {
this.encoder = getEncoding("cl100k_base");
}
public countTokens(text: string): number {
return this.encoder.encode(text).length;
}
public preFlightCheck(
systemPrompt: string,
userPayload: string,
maxInputTokens: number,
expectedOutputTokens: number
): { isValid: boolean; estimatedCost: number; tokenBreakdown: Record<string, number> } {
const inputTokens = this.countTokens(systemPrompt) + this.countTokens(userPayload);
const inputCost = (inputTokens / 1000) * this.INPUT_PRICE_PER_K;
const outputCost = (expectedOutputTokens / 1000) * this.OUTPUT_PRICE_PER_K;
const totalCost = inputCost + outputCost;
if (inputTokens > maxInputTokens) {
return {
isValid: false,
estimatedCost: totalCost,
tokenBreakdown: { input: inputTokens, output: expectedOutputTokens }
};
}
return { isValid: true, estimatedCost: totalCost, tokenBreakdown: { input: inputTokens, output: expectedOutputTokens } };
}
}
Architectural Rationale: Local counting runs in microseconds and requires zero network I/O. By enforcing maxInputTokens before the API call, you prevent accidental database dumps or log files from entering the context window. The cost calculation uses provider baseline rates; these should be externalized to environment configuration to accommodate pricing updates without code changes.
Step 2: Context Compression & Middle-Truncation
Prompts accumulate redundancy through repeated system instructions, excessive whitespace, and legacy HTML/markup artifacts. Compression reduces token count without degrading semantic meaning. When truncation is unavoidable, middle-truncation preserves both introductory context and concluding directives, which is critical for legal, technical, or debugging workflows.
export class ContextCompressor {
public sanitize(raw: string): string {
return raw
.replace(/<[^>]+>/g, "")
.replace(/\n{3,}/g, "\n\n")
.replace(/ {2,}/g, " ")
.trim();
}
public applyMiddleTruncation(
text: string,
encoder: Tiktoken,
budget: number
): string {
const tokens = encoder.encode(text);
if (tokens.length <= budget) return text;
const half = Math.floor(budget / 2);
const truncated = tokens.slice(0, half).concat(tokens.slice(-half));
return encoder.decode(truncated);
}
}
Architectural Rationale: Regex-based sanitization handles 80% of common bloat patterns. Middle-truncation is deliberately chosen over tail-truncation because many production prompts require the opening premise and the closing instruction to remain intact. The safety margin (typically 50β100 tokens) accounts for system prompt overhead and prevents boundary errors.
Step 3: Complexity-Based Model Routing
Flagship models are overqualified for classification, labeling, and simple extraction tasks. Routing logic should evaluate prompt complexity and token volume to select the most cost-efficient model that meets accuracy thresholds.
export class ModelRouter {
private readonly SIMPLE_KEYWORDS = ["classify", "label", "category", "yes or no", "true or false", "extract field"];
public selectModel(prompt: string, tokenCount: number): string {
const isSimple = this.SIMPLE_KEYWORDS.some(k => prompt.toLowerCase().includes(k));
if (tokenCount < 600 && isSimple) return "gpt-4o-mini";
if (tokenCount < 2000) return "gpt-4o-mini";
return "gpt-4o";
}
}
Architectural Rationale: Smaller models like gpt-4o-mini typically cost 10β15Γ less per token while maintaining >94% accuracy on structured tasks. Routing 70% of traffic to the smaller model and 30% to the flagship model routinely halves monthly spend. The keyword heuristic is lightweight; in production, replace it with a lightweight classifier or embedding similarity check for higher precision.
Step 4: Deterministic Response Caching
Identical prompts must never trigger duplicate API calls. Caching is most effective for FAQ systems, template generation, and batch classification. In-memory caches are insufficient for distributed deployments; Redis with TTL-based invalidation is the production standard.
import { createHash } from "crypto";
export class ResponseCache {
private cache: Map<string, string> = new Map();
public generateKey(system: string, user: string, model: string): string {
const payload = JSON.stringify({ system, user, model }, Object.keys({ system, user, model }).sort());
return createHash("sha256").update(payload).digest("hex");
}
public get(key: string): string | undefined {
return this.cache.get(key);
}
public set(key: string, value: string, ttlSeconds: number = 3600): void {
this.cache.set(key, value);
setTimeout(() => this.cache.delete(key), ttlSeconds * 1000);
}
}
Architectural Rationale: The cache key incorporates system prompt, user input, and model selection to prevent cross-contamination. TTL values should align with data volatility: 1 hour for dynamic content, 24 hours for reference data, or longer for static templates. In production, swap the Map for ioredis or @redis/client with cluster support.
Step 5: Telemetry & Feature-Level Cost Attribution
Raw token counts are useless without attribution. Every API response includes a usage object. Aggregate these metrics into a time-series database and tag them by business feature, not just by model.
export class CostTelemetry {
public recordUsage(
feature: string,
promptTokens: number,
completionTokens: number,
model: string
): void {
const inputCost = (promptTokens / 1000) * 0.005;
const outputCost = (completionTokens / 1000) * 0.015;
// Emit to Prometheus/InfluxDB/CloudWatch
console.log(`[TELEMETRY] feature=${feature} model=${model} input=${promptTokens} output=${completionTokens} cost=${(inputCost + outputCost).toFixed(6)}`);
}
}
Architectural Rationale: Feature-level tagging exposes hidden cost centers. A chatbot feature budgeted at $1/day that consistently hits $15/day signals either prompt bloat, missing cache layers, or inappropriate model routing. Time-series aggregation enables alerting on daily spend thresholds before invoices arrive.
Pitfall Guide
1. The "One Token Equals One Word" Assumption
Explanation: Developers frequently estimate costs by dividing character count by 4 or assuming 1 token = 1 word. This breaks down catastrophically for code, JSON, and non-Latin languages, where tokenization can be 2β3Γ denser.
Fix: Always use the provider's official tokenizer (tiktoken, @anthropic-ai/tokenizer, etc.) for accurate counting. Never rely on character-based approximations in production.
2. Ignoring the Output Token Multiplier
Explanation: Teams optimize prompt length but leave max_tokens unset or set to the model's maximum. Since output tokens cost 3β5Γ more, a single verbose response can exceed the cost of the entire prompt.
Fix: Set max_tokens to 20β30% above your expected output length. Enforce hard caps on every API call, especially for classification, extraction, and JSON generation tasks.
3. Blind Tail-Truncation of Context
Explanation: Cutting off the end of a long document to fit context windows destroys concluding instructions, legal clauses, or debugging traces. The model receives the premise but loses the directive.
Fix: Implement middle-truncation when context exceeds budget. Preserve the first and last segments to maintain both setup and instruction integrity.
4. Over-Caching Non-Deterministic Prompts
Explanation: Caching prompts that contain timestamps, user IDs, or session tokens creates cache misses or returns stale data. The cache key must reflect only the deterministic components.
Fix: Strip dynamic variables before generating cache keys, or use parameterized templates. Set TTLs that match data refresh cycles. Validate cache hits against freshness requirements.
5. Setting max_tokens to the Model Limit
Explanation: Leaving max_tokens at 4096 or 8192 when you only need 50 tokens forces the model to generate filler text, inflating costs and latency. The API bills for every token generated, not just the useful ones.
Fix: Calculate expected output length during design phase. Apply a 20β30% buffer for safety. Monitor actual completion tokens and adjust caps quarterly.
6. Aggregating Costs at the Model Level Instead of Feature Level
Explanation: Tracking spend by gpt-4o vs gpt-4o-mini hides which product features are driving costs. A single misconfigured endpoint can drain the budget while other features appear efficient.
Fix: Tag every telemetry event with feature, endpoint, and user_segment. Build dashboards that show cost per feature, not just cost per model.
7. Streaming Without Token Accounting
Explanation: Streaming responses improve perceived latency but complicate cost tracking. If you only log final usage, you miss intermediate token generation patterns and cannot enforce mid-stream caps.
Fix: Parse streaming deltas to count tokens in real-time. Implement early termination logic when output exceeds budget thresholds. Log partial usage for accurate attribution.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| FAQ / Static Knowledge Base | Redis cache with 24h TTL + gpt-4o-mini | Identical prompts repeat frequently; small model handles retrieval accurately | 60β80% reduction |
| Dynamic User Chat | Pre-flight budgeting + middle-truncation + max_tokens=300 | Context grows unpredictably; hard caps prevent verbose drift | 30β45% reduction |
| Batch Classification | Complexity routing to gpt-4o-mini + parallel processing | Simple tasks don't require flagship reasoning; parallelism improves throughput | 70β85% reduction |
| Legal/Technical Docs | Middle-truncation + feature-level telemetry | Conclusion and directives are critical; accurate attribution prevents budget blind spots | 20β35% reduction |
| Real-time Streaming | Mid-stream token counting + early termination | Prevents runaway generation; maintains latency SLAs while controlling output cost | 25β40% reduction |
Configuration Template
// llm-cost-config.ts
export const LLM_COST_CONFIG = {
models: {
flagship: "gpt-4o",
efficient: "gpt-4o-mini"
},
pricing: {
inputPer1k: 0.005,
outputPer1k: 0.015
},
limits: {
maxInputTokens: 3000,
outputBufferMultiplier: 1.25,
cacheTTLSeconds: 3600
},
routing: {
simpleThreshold: 600,
mediumThreshold: 2000,
keywords: ["classify", "label", "category", "yes or no", "true or false"]
},
telemetry: {
enabled: true,
endpoint: "/metrics/llm-cost",
aggregationInterval: 60
}
};
Quick Start Guide
- Install tokenizer & dependencies: Run
npm install tiktoken ioredis to enable local counting and distributed caching.
- Initialize the guard layer: Instantiate
TokenBudgetGuard, ContextCompressor, ModelRouter, and ResponseCache at application startup. Inject LLM_COST_CONFIG for centralized control.
- Wrap API calls: Replace direct provider SDK calls with a middleware function that runs
preFlightCheck(), applies sanitize() and applyMiddleTruncation(), selects the model via selectModel(), checks the cache, and emits telemetry after response completion.
- Deploy telemetry pipeline: Configure your time-series database to ingest
[TELEMETRY] logs. Create a dashboard grouping costs by feature and set daily alert thresholds at 80% of budget.
- Validate in staging: Run a synthetic load test with mixed prompt types. Verify cache hit rates, routing accuracy, and cost attribution before promoting to production.