imits. It acts as a soft guardrail that logs drift, warns on threshold breaches, and optionally blocks requests that exceed business-defined caps. This prevents single runaway calls from skewing monthly budgets.
Implementation (TypeScript)
The following implementation demonstrates a telemetry engine that coordinates tagging, estimation, and cache profiling. It uses explicit interfaces, dependency injection, and async-safe state management.
import { createHash } from 'crypto';
// βββ Domain Interfaces βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
interface RunMetadata {
feature: string;
userId: string;
templateVersion: string;
model: string;
}
interface CacheMetrics {
readTokens: number;
writeTokens: number;
totalInputTokens: number;
totalOutputTokens: number;
}
interface BudgetEstimate {
estimatedCost: number;
blocked: boolean;
warningTriggered: boolean;
}
interface PricingConfig {
inputPerMillion: number;
outputPerMillion: number;
}
interface TelemetryConfig {
pricing: Record<string, PricingConfig>;
hardCap: number;
warnThreshold: number;
outputLogPath: string;
}
// βββ Core Engine βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
class CostAttributionEngine {
private config: TelemetryConfig;
private runRegistry: Map<string, RunMetadata> = new Map();
private cacheStore: Map<string, CacheMetrics[]> = new Map();
constructor(config: TelemetryConfig) {
this.config = config;
}
/**
* Initiates a traced run. Returns a stable run ID for downstream correlation.
*/
public initiateRun(metadata: RunMetadata): string {
const runId = createHash('sha256')
.update(`${metadata.feature}-${metadata.userId}-${Date.now()}`)
.digest('hex')
.slice(0, 16);
this.runRegistry.set(runId, metadata);
return runId;
}
/**
* Pre-flight estimation. Calculates expected cost before token consumption.
*/
public estimateBudget(
runId: string,
inputTokens: number,
outputTokens: number
): BudgetEstimate {
const meta = this.runRegistry.get(runId);
if (!meta) throw new Error(`Run ${runId} not found in registry`);
const pricing = this.config.pricing[meta.model];
if (!pricing) throw new Error(`Pricing missing for model ${meta.model}`);
const inputCost = (inputTokens / 1_000_000) * pricing.inputPerMillion;
const outputCost = (outputTokens / 1_000_000) * pricing.outputPerMillion;
const totalEstimate = inputCost + outputCost;
const blocked = totalEstimate > this.config.hardCap;
const warningTriggered = totalEstimate > this.config.warnThreshold;
return {
estimatedCost: totalEstimate,
blocked,
warningTriggered,
};
}
/**
* Records cache headers and finalizes run accounting.
*/
public finalizeRun(
runId: string,
cacheData: CacheMetrics,
actualInputTokens: number,
actualOutputTokens: number
): void {
const meta = this.runRegistry.get(runId);
if (!meta) throw new Error(`Run ${runId} not found in registry`);
const pricing = this.config.pricing[meta.model];
const actualCost =
(actualInputTokens / 1_000_000) * pricing.inputPerMillion +
(actualOutputTokens / 1_000_000) * pricing.outputPerMillion;
// Persist cache metrics for feature-level aggregation
const featureKey = meta.feature;
if (!this.cacheStore.has(featureKey)) {
this.cacheStore.set(featureKey, []);
}
this.cacheStore.get(featureKey)!.push(cacheData);
// In production, append to JSONL or emit to OpenTelemetry collector
this.emitTelemetry(runId, meta, actualCost, cacheData);
}
/**
* Aggregates cache effectiveness per feature.
*/
public getCacheReport(): Record<string, { hitRatio: number; savingsUsd: number }> {
const report: Record<string, { hitRatio: number; savingsUsd: number }> = {};
for (const [feature, metrics] of this.cacheStore.entries()) {
const totalRead = metrics.reduce((sum, m) => sum + m.readTokens, 0);
const totalInput = metrics.reduce((sum, m) => sum + m.totalInputTokens, 0);
const hitRatio = totalInput > 0 ? totalRead / totalInput : 0;
// Simplified savings calculation based on cached read tokens
const pricing = this.config.pricing['claude-sonnet-4-6'];
const savings = (totalRead / 1_000_000) * pricing.inputPerMillion * 0.85;
report[feature] = { hitRatio, savingsUsd: savings };
}
return report;
}
private emitTelemetry(
runId: string,
meta: RunMetadata,
cost: number,
cache: CacheMetrics
): void {
const payload = {
runId,
timestamp: new Date().toISOString(),
...meta,
cost,
cacheRead: cache.readTokens,
cacheWrite: cache.writeTokens,
};
// Replace with file stream write or OTLP exporter
console.log(JSON.stringify(payload));
}
}
// βββ Usage Example βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
async function demonstrateAttribution() {
const engine = new CostAttributionEngine({
pricing: {
'claude-sonnet-4-6': { inputPerMillion: 3.0, outputPerMillion: 15.0 },
'gpt-5.4': { inputPerMillion: 2.5, outputPerMillion: 10.0 },
},
hardCap: 0.05,
warnThreshold: 0.02,
outputLogPath: './telemetry/runs.jsonl',
});
// 1. Tag at ingress
const runId = engine.initiateRun({
feature: 'document-summarization',
userId: 'acct-8821',
templateVersion: 'v3.2-stable',
model: 'claude-sonnet-4-6',
});
// 2. Pre-flight estimation
const estimate = engine.estimateBudget(runId, 4200, 600);
if (estimate.blocked) {
console.warn(`Run ${runId} blocked: $${estimate.estimatedCost.toFixed(4)} exceeds cap`);
return;
}
if (estimate.warningTriggered) {
console.info(`Run ${runId} warning: $${estimate.estimatedCost.toFixed(4)} near threshold`);
}
// 3. Simulate LLM call & capture cache headers
const mockResponse = {
usage: {
input_tokens: 4150,
output_tokens: 580,
cache_read_input_tokens: 3200,
cache_creation_input_tokens: 950,
},
};
// 4. Finalize with cache normalization
engine.finalizeRun(
runId,
{
readTokens: mockResponse.usage.cache_read_input_tokens,
writeTokens: mockResponse.usage.cache_creation_input_tokens,
totalInputTokens: mockResponse.usage.input_tokens,
totalOutputTokens: mockResponse.usage.output_tokens,
},
mockResponse.usage.input_tokens,
mockResponse.usage.output_tokens
);
// 5. Aggregate cache ROI
const cacheReport = engine.getCacheReport();
console.log('Cache ROI:', JSON.stringify(cacheReport, null, 2));
}
demonstrateAttribution();
Why This Architecture Works
- State isolation per run: The
runRegistry prevents cross-contamination between concurrent requests. Each run carries its own metadata, ensuring accurate attribution even under high concurrency.
- Explicit cache normalization: Provider responses are mapped to a unified
CacheMetrics shape. This allows cross-model cache analysis without vendor lock-in.
- Separation of estimation and execution: The
estimateBudget method runs synchronously before network I/O. This enables early rejection or model downgrading without consuming provider tokens.
- Observable output: The
emitTelemetry method is designed to interface with OpenTelemetry, Kafka, or append-only JSONL streams. Production deployments should route this to a time-series backend for P95/P99 analysis.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Late-Stage Tagging | Attaching metadata after the LLM call completes or inside retry loops. Async context loss means attribution keys are missing or misaligned. | Tag at the API gateway or service ingress. Pass RunMetadata through request context or headers. Never reconstruct attribution from logs. |
| Global Cache Aggregation | Calculating cache hit ratios across all features. High-hit features mask low-hit ones, leading to false confidence in caching strategy. | Segment cache metrics by feature and template_version. Analyze hit ratios per workflow, not globally. |
| Static Tokenization Assumptions | Using word count or character length to estimate tokens. Different models tokenize differently; estimates drift by 20-40%. | Use provider-specific tokenizer libraries or pre-flight API endpoints that return exact token counts before generation. |
| Over-Reliance on Pre-Flight Caps | Treating estimation as a hard guarantee. Estimates ignore dynamic tool outputs, streaming variance, and provider rate adjustments. | Use caps as soft guardrails. Implement circuit breakers that degrade gracefully (e.g., switch to cheaper model, truncate context) rather than hard failures. |
| Ignoring Embedding & Tool Costs | Attributing only generation tokens. Retrieval pipelines, vector searches, and tool execution often consume 30-50% of total compute. | Instrument embedding calls separately. Tag tool execution runs with the same feature key to maintain end-to-end attribution. |
| Hardcoded Pricing Models | Embedding rate cards in application code. Provider pricing changes break accounting and cause budget miscalculations. | Externalize pricing to a configuration service or fetch from a provider pricing API. Cache rates with TTL and validate against actual invoices monthly. |
| Optimizing for Mean Instead of P95 | Focusing on average cost per run. Outliers (long tool loops, verbose responses) drive 70% of spend but disappear in mean calculations. | Track P90/P95/P99 cost distributions. Implement outlier detection that flags runs exceeding 3x the feature median for manual review. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Multi-tenant SaaS with usage-based billing | Per-user + per-feature attribution with P95 cost tracking | Enables accurate margin calculation and tiered pricing adjustments | High: Directly impacts revenue recognition and customer pricing |
| Internal agent suite with 3+ workflows | Feature-level tagging + cache ROI segmentation | Identifies which pipelines benefit from prompt stabilization vs. model downgrades | Medium: Reduces compute waste by 20-40% through targeted optimizations |
| High-throughput batch processing | Pre-flight estimation + hard caps + streaming truncation | Prevents runaway jobs from consuming monthly budgets in hours | Critical: Avoids catastrophic overages and enables predictable batch scheduling |
| Single-feature, single-model deployment | Aggregate billing only | Attribution overhead exceeds value when there is no comparative dimension | Low: Skip instrumentation; focus on latency and accuracy |
Configuration Template
// telemetry.config.ts
export const TELEMETRY_CONFIG = {
pricing: {
'claude-sonnet-4-6': { inputPerMillion: 3.0, outputPerMillion: 15.0 },
'gpt-5.4': { inputPerMillion: 2.5, outputPerMillion: 10.0 },
'embedding-v3': { inputPerMillion: 0.1, outputPerMillion: 0.0 },
},
thresholds: {
hardCap: 0.05,
warnThreshold: 0.02,
p95DriftAlert: 0.30, // 30% deviation triggers investigation
},
cache: {
minHitRatioForOptimization: 0.60,
prefixStabilityCheck: true, // Validate system prompt consistency
},
export: {
format: 'jsonl',
path: './data/telemetry/runs.jsonl',
otelEndpoint: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
},
};
Quick Start Guide
- Initialize the engine: Import
CostAttributionEngine and pass the configuration object. Ensure pricing data matches your provider's current rate card.
- Tag at ingress: Call
initiateRun() with feature, user_id, template_version, and model before any LLM or embedding call. Store the returned runId in request context.
- Estimate before execution: Run
estimateBudget() with expected token counts. Implement conditional logic to downgrade models or truncate context if thresholds are breached.
- Finalize with cache data: After the provider response, extract cache headers and call
finalizeRun(). Route the output to your telemetry backend for aggregation and P95 analysis.
Cost attribution is not an accounting exercise; it is an architectural feedback loop. When you can trace every dollar to a specific workflow, user segment, and prompt template, optimization stops being guesswork and becomes a deterministic engineering process.