Which Agent Feature Costs the Most? Here's How to Find Out.

By Codcompass Team·2026-05-26·10 min read

Current Situation Analysis

Modern LLM-powered applications suffer from a fundamental accounting blind spot: billing is aggregated, but value is distributed. Engineering teams deploy multiple features, routing logic, and model tiers through a single API key. The resulting invoice presents a monolithic total, obscuring which workflows drive spend, which users generate the most compute, and where architectural optimizations would yield the highest ROI.

This opacity stems from how SDKs abstract away token consumption. Developers interact with high-level completion endpoints, while providers bill on raw token volume and cache utilization. Without explicit instrumentation, cost attribution becomes a retrospective guessing game. Teams optimize for latency or accuracy, assuming cost scales linearly with traffic. In reality, LLM spend follows a power law: a small subset of prompt templates, tool loops, or cache-miss patterns typically accounts for 60-80% of the monthly bill.

The problem is compounded by three industry realities:

Prompt caching is highly asymmetric. Stable system prefixes can achieve 70-90% cache hit rates, slashing compute costs. However, cache effectiveness varies drastically by feature. A summarization pipeline with a fixed instruction block will cache efficiently, while a dynamic search agent that mutates its system prompt per request will see near-zero hit rates. Aggregated metrics mask this divergence.
Tokenization is non-linear. Word count, character length, and actual token consumption diverge significantly across models. Relying on rough estimates leads to budget overruns and inaccurate unit economics.
Pre-flight estimation is necessary but insufficient. Provider rate limits and hard caps exist, but they trigger after spend occurs. A pre-execution estimation layer allows teams to reject or downgrade requests before tokens are consumed, turning cost control from reactive to proactive.

Without granular telemetry, teams cannot answer basic operational questions: Should we downgrade the search agent to a cheaper model? Is the document upload workflow worth the cache overhead? Which user segment is subsidizing compute-heavy features? Cost attribution transforms these questions from speculation into data-driven decisions.

WOW Moment: Key Findings

The shift from aggregate billing to feature-level attribution reveals optimization levers that are invisible in standard dashboards. The following comparison demonstrates how attribution depth changes operational outcomes:

Approach	Metric 1	Metric 2	Metric 3
Aggregate Dashboard	Total monthly spend visible	No feature breakdown	Cache ROI unknown
Feature-Level Tagging	Cost split by workflow	Run frequency tracked	Cache effectiveness masked
Cache-Aware Attribution	Per-feature spend & run count	Cache hit ratio & savings	Actionable optimization paths

This finding matters because it decouples cost reduction strategies. High total spend with high cache hit ratios indicates a volume problem: reduce run frequency, batch requests, or implement request coalescing. Low total spend with low cache hit ratios indicates a structural problem: stabilize prompt prefixes, extract invariant instructions, or route to models with larger context windows. Attribution does not just report numbers; it prescribes architecture changes.

Core Solution

Building a production-grade cost attribution system requires three coordinated layers: metadata injection at ingress, cache header normalization, and pre-flight budget evaluation. The implementation must operate asynchronously, survive network failures, and remain decoupled from business logic.

Architecture Decisions & Rationale

Ingress Tagging Over Retrospective Enrichment: Metadata must be attached when the request enters the system. Async chains, retries, and parallel tool calls fragment context downstream. Capturing feature, user_id, template_version, and model at the entry point guarantees every downstream token event inherits the correct attribution keys.
Provider-Agnostic Cache Profiling: Cache headers (cache_read_input_tokens, cache_creation_input_tokens) differ across providers. A normalization layer translates provider-specific metrics into a unified CacheMetrics interface, enabling cross-model cache ROI analysis.
Pre-Flight Estimation as a Circuit Breaker: Estimation should not replace provider l

imits. It acts as a soft guardrail that logs drift, warns on threshold breaches, and optionally blocks requests that exceed business-defined caps. This prevents single runaway calls from skewing monthly budgets.

Implementation (TypeScript)

The following implementation demonstrates a telemetry engine that coordinates tagging, estimation, and cache profiling. It uses explicit interfaces, dependency injection, and async-safe state management.

import { createHash } from 'crypto';

// ─── Domain Interfaces ───────────────────────────────────────────────────────
interface RunMetadata {
  feature: string;
  userId: string;
  templateVersion: string;
  model: string;
}

interface CacheMetrics {
  readTokens: number;
  writeTokens: number;
  totalInputTokens: number;
  totalOutputTokens: number;
}

interface BudgetEstimate {
  estimatedCost: number;
  blocked: boolean;
  warningTriggered: boolean;
}

interface PricingConfig {
  inputPerMillion: number;
  outputPerMillion: number;
}

interface TelemetryConfig {
  pricing: Record<string, PricingConfig>;
  hardCap: number;
  warnThreshold: number;
  outputLogPath: string;
}

// ─── Core Engine ─────────────────────────────────────────────────────────────
class CostAttributionEngine {
  private config: TelemetryConfig;
  private runRegistry: Map<string, RunMetadata> = new Map();
  private cacheStore: Map<string, CacheMetrics[]> = new Map();

  constructor(config: TelemetryConfig) {
    this.config = config;
  }

  /**
   * Initiates a traced run. Returns a stable run ID for downstream correlation.
   */
  public initiateRun(metadata: RunMetadata): string {
    const runId = createHash('sha256')
      .update(`${metadata.feature}-${metadata.userId}-${Date.now()}`)
      .digest('hex')
      .slice(0, 16);

    this.runRegistry.set(runId, metadata);
    return runId;
  }

  /**
   * Pre-flight estimation. Calculates expected cost before token consumption.
   */
  public estimateBudget(
    runId: string,
    inputTokens: number,
    outputTokens: number
  ): BudgetEstimate {
    const meta = this.runRegistry.get(runId);
    if (!meta) throw new Error(`Run ${runId} not found in registry`);

    const pricing = this.config.pricing[meta.model];
    if (!pricing) throw new Error(`Pricing missing for model ${meta.model}`);

    const inputCost = (inputTokens / 1_000_000) * pricing.inputPerMillion;
    const outputCost = (outputTokens / 1_000_000) * pricing.outputPerMillion;
    const totalEstimate = inputCost + outputCost;

    const blocked = totalEstimate > this.config.hardCap;
    const warningTriggered = totalEstimate > this.config.warnThreshold;

    return {
      estimatedCost: totalEstimate,
      blocked,
      warningTriggered,
    };
  }

  /**
   * Records cache headers and finalizes run accounting.
   */
  public finalizeRun(
    runId: string,
    cacheData: CacheMetrics,
    actualInputTokens: number,
    actualOutputTokens: number
  ): void {
    const meta = this.runRegistry.get(runId);
    if (!meta) throw new Error(`Run ${runId} not found in registry`);

    const pricing = this.config.pricing[meta.model];
    const actualCost =
      (actualInputTokens / 1_000_000) * pricing.inputPerMillion +
      (actualOutputTokens / 1_000_000) * pricing.outputPerMillion;

    // Persist cache metrics for feature-level aggregation
    const featureKey = meta.feature;
    if (!this.cacheStore.has(featureKey)) {
      this.cacheStore.set(featureKey, []);
    }
    this.cacheStore.get(featureKey)!.push(cacheData);

    // In production, append to JSONL or emit to OpenTelemetry collector
    this.emitTelemetry(runId, meta, actualCost, cacheData);
  }

  /**
   * Aggregates cache effectiveness per feature.
   */
  public getCacheReport(): Record<string, { hitRatio: number; savingsUsd: number }> {
    const report: Record<string, { hitRatio: number; savingsUsd: number }> = {};

    for (const [feature, metrics] of this.cacheStore.entries()) {
      const totalRead = metrics.reduce((sum, m) => sum + m.readTokens, 0);
      const totalInput = metrics.reduce((sum, m) => sum + m.totalInputTokens, 0);
      const hitRatio = totalInput > 0 ? totalRead / totalInput : 0;

      // Simplified savings calculation based on cached read tokens
      const pricing = this.config.pricing['claude-sonnet-4-6'];
      const savings = (totalRead / 1_000_000) * pricing.inputPerMillion * 0.85;

      report[feature] = { hitRatio, savingsUsd: savings };
    }

    return report;
  }

  private emitTelemetry(
    runId: string,
    meta: RunMetadata,
    cost: number,
    cache: CacheMetrics
  ): void {
    const payload = {
      runId,
      timestamp: new Date().toISOString(),
      ...meta,
      cost,
      cacheRead: cache.readTokens,
      cacheWrite: cache.writeTokens,
    };
    // Replace with file stream write or OTLP exporter
    console.log(JSON.stringify(payload));
  }
}

// ─── Usage Example ───────────────────────────────────────────────────────────
async function demonstrateAttribution() {
  const engine = new CostAttributionEngine({
    pricing: {
      'claude-sonnet-4-6': { inputPerMillion: 3.0, outputPerMillion: 15.0 },
      'gpt-5.4': { inputPerMillion: 2.5, outputPerMillion: 10.0 },
    },
    hardCap: 0.05,
    warnThreshold: 0.02,
    outputLogPath: './telemetry/runs.jsonl',
  });

  // 1. Tag at ingress
  const runId = engine.initiateRun({
    feature: 'document-summarization',
    userId: 'acct-8821',
    templateVersion: 'v3.2-stable',
    model: 'claude-sonnet-4-6',
  });

  // 2. Pre-flight estimation
  const estimate = engine.estimateBudget(runId, 4200, 600);
  if (estimate.blocked) {
    console.warn(`Run ${runId} blocked: $${estimate.estimatedCost.toFixed(4)} exceeds cap`);
    return;
  }
  if (estimate.warningTriggered) {
    console.info(`Run ${runId} warning: $${estimate.estimatedCost.toFixed(4)} near threshold`);
  }

  // 3. Simulate LLM call & capture cache headers
  const mockResponse = {
    usage: {
      input_tokens: 4150,
      output_tokens: 580,
      cache_read_input_tokens: 3200,
      cache_creation_input_tokens: 950,
    },
  };

  // 4. Finalize with cache normalization
  engine.finalizeRun(
    runId,
    {
      readTokens: mockResponse.usage.cache_read_input_tokens,
      writeTokens: mockResponse.usage.cache_creation_input_tokens,
      totalInputTokens: mockResponse.usage.input_tokens,
      totalOutputTokens: mockResponse.usage.output_tokens,
    },
    mockResponse.usage.input_tokens,
    mockResponse.usage.output_tokens
  );

  // 5. Aggregate cache ROI
  const cacheReport = engine.getCacheReport();
  console.log('Cache ROI:', JSON.stringify(cacheReport, null, 2));
}

demonstrateAttribution();

Why This Architecture Works

State isolation per run: The runRegistry prevents cross-contamination between concurrent requests. Each run carries its own metadata, ensuring accurate attribution even under high concurrency.
Explicit cache normalization: Provider responses are mapped to a unified CacheMetrics shape. This allows cross-model cache analysis without vendor lock-in.
Separation of estimation and execution: The estimateBudget method runs synchronously before network I/O. This enables early rejection or model downgrading without consuming provider tokens.
Observable output: The emitTelemetry method is designed to interface with OpenTelemetry, Kafka, or append-only JSONL streams. Production deployments should route this to a time-series backend for P95/P99 analysis.

Pitfall Guide

Pitfall	Explanation	Fix
Late-Stage Tagging	Attaching metadata after the LLM call completes or inside retry loops. Async context loss means attribution keys are missing or misaligned.	Tag at the API gateway or service ingress. Pass `RunMetadata` through request context or headers. Never reconstruct attribution from logs.
Global Cache Aggregation	Calculating cache hit ratios across all features. High-hit features mask low-hit ones, leading to false confidence in caching strategy.	Segment cache metrics by `feature` and `template_version`. Analyze hit ratios per workflow, not globally.
Static Tokenization Assumptions	Using word count or character length to estimate tokens. Different models tokenize differently; estimates drift by 20-40%.	Use provider-specific tokenizer libraries or pre-flight API endpoints that return exact token counts before generation.
Over-Reliance on Pre-Flight Caps	Treating estimation as a hard guarantee. Estimates ignore dynamic tool outputs, streaming variance, and provider rate adjustments.	Use caps as soft guardrails. Implement circuit breakers that degrade gracefully (e.g., switch to cheaper model, truncate context) rather than hard failures.
Ignoring Embedding & Tool Costs	Attributing only generation tokens. Retrieval pipelines, vector searches, and tool execution often consume 30-50% of total compute.	Instrument embedding calls separately. Tag tool execution runs with the same `feature` key to maintain end-to-end attribution.
Hardcoded Pricing Models	Embedding rate cards in application code. Provider pricing changes break accounting and cause budget miscalculations.	Externalize pricing to a configuration service or fetch from a provider pricing API. Cache rates with TTL and validate against actual invoices monthly.
Optimizing for Mean Instead of P95	Focusing on average cost per run. Outliers (long tool loops, verbose responses) drive 70% of spend but disappear in mean calculations.	Track P90/P95/P99 cost distributions. Implement outlier detection that flags runs exceeding 3x the feature median for manual review.

Production Bundle

Action Checklist

Instrument ingress layer: Attach feature, user_id, template_version, and model to every request before LLM invocation.
Normalize cache headers: Map provider-specific cache tokens to a unified CacheMetrics interface for cross-model analysis.
Implement pre-flight estimation: Calculate expected cost before network I/O. Log estimates alongside actuals for drift detection.
Route telemetry to time-series backend: Replace console logging with OpenTelemetry, Kafka, or append-only JSONL streams for P95/P99 analysis.
Externalize pricing configuration: Store rate cards in a config service with TTL. Validate monthly against provider invoices.
Add outlier detection: Flag runs exceeding 3x feature median cost. Investigate prompt bloat, tool loops, or streaming variance.
Instrument embedding & tool calls: Tag retrieval and execution steps with the same feature key to maintain end-to-end attribution.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Multi-tenant SaaS with usage-based billing	Per-user + per-feature attribution with P95 cost tracking	Enables accurate margin calculation and tiered pricing adjustments	High: Directly impacts revenue recognition and customer pricing
Internal agent suite with 3+ workflows	Feature-level tagging + cache ROI segmentation	Identifies which pipelines benefit from prompt stabilization vs. model downgrades	Medium: Reduces compute waste by 20-40% through targeted optimizations
High-throughput batch processing	Pre-flight estimation + hard caps + streaming truncation	Prevents runaway jobs from consuming monthly budgets in hours	Critical: Avoids catastrophic overages and enables predictable batch scheduling
Single-feature, single-model deployment	Aggregate billing only	Attribution overhead exceeds value when there is no comparative dimension	Low: Skip instrumentation; focus on latency and accuracy

Configuration Template

// telemetry.config.ts
export const TELEMETRY_CONFIG = {
  pricing: {
    'claude-sonnet-4-6': { inputPerMillion: 3.0, outputPerMillion: 15.0 },
    'gpt-5.4': { inputPerMillion: 2.5, outputPerMillion: 10.0 },
    'embedding-v3': { inputPerMillion: 0.1, outputPerMillion: 0.0 },
  },
  thresholds: {
    hardCap: 0.05,
    warnThreshold: 0.02,
    p95DriftAlert: 0.30, // 30% deviation triggers investigation
  },
  cache: {
    minHitRatioForOptimization: 0.60,
    prefixStabilityCheck: true, // Validate system prompt consistency
  },
  export: {
    format: 'jsonl',
    path: './data/telemetry/runs.jsonl',
    otelEndpoint: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
  },
};

Quick Start Guide

Initialize the engine: Import CostAttributionEngine and pass the configuration object. Ensure pricing data matches your provider's current rate card.
Tag at ingress: Call initiateRun() with feature, user_id, template_version, and model before any LLM or embedding call. Store the returned runId in request context.
Estimate before execution: Run estimateBudget() with expected token counts. Implement conditional logic to downgrade models or truncate context if thresholds are breached.
Finalize with cache data: After the provider response, extract cache headers and call finalizeRun(). Route the output to your telemetry backend for aggregation and P95 analysis.

Cost attribution is not an accounting exercise; it is an architectural feedback loop. When you can trace every dollar to a specific workflow, user segment, and prompt template, optimization stops being guesswork and becomes a deterministic engineering process.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back