AI Cost Attribution: LLM Chargeback by Business Unit

By Codcompass Team·2026-06-01·8 min read

Deterministic LLM Cost Attribution: Gateway Enforcement and OpenTelemetry Integration

Current Situation Analysis

The fundamental friction in modern AI FinOps stems from a structural mismatch: provider billing exports are aggregate, while internal accountability is request-level. OpenAI and Amazon Bedrock invoices deliver monthly totals, model identifiers, and broad token categories, but they contain zero knowledge of your internal cost centers, product lines, or environment boundaries. When engineering teams attempt to map account-level invoices to business-unit P&L statements, the attribution breaks down.

This problem is routinely overlooked because organizations treat provider exports as the source of truth rather than a raw input. At sub-$5,000 monthly spend, manual spreadsheets or rough percentage splits mask the inaccuracy. Once spend crosses the $10,000–$25,000 threshold across multiple teams, the noise becomes financially material. A 7% attribution variance on a $60,000 monthly bill translates to $4,200 in misallocated costs. That margin is large enough to distort unit economics, trigger budget overruns, and force finance teams to issue repeated adjustment entries during month-end close.

The issue compounds when platform infrastructure is shared. Without strict runtime tagging, non-production traffic (staging experiments, load tests, developer sandboxes) bleeds into production cost reports. Business-unit leaders are charged for unauthorized work, engineering disputes the numbers, and procurement lacks the granularity to negotiate committed use discounts or reserved capacity. The industry standard for solving this is shifting from post-hoc invoice parsing to deterministic, request-level telemetry captured at the network boundary.

WOW Moment: Key Findings

The most impactful insight for FinOps and platform engineering is that attribution accuracy is not a function of logging volume, but of enforcement timing. Capturing metadata at the gateway before the request leaves your network yields deterministic chargeback data, while app-level logging or provider export allocation introduces probabilistic gaps.

Implementation Pattern	Request-Level Accuracy	Audit Trail Depth	Engineering Overhead	Reconciliation Latency
App-Level Logging	60–75%	Low	Medium	High (days)
Gateway-Enforced Telemetry	95–99%	High	Medium	Low (hours)
Provider Export Allocation	40–60%	Medium	Low	Very High (weeks)

This finding matters because it shifts cost attribution from a retrospective accounting exercise to a real-time platform capability. Gateway enforcement guarantees that every outbound LLM call carries immutable ownership metadata. When paired with OpenTelemetry semantic conventions, it creates a single source of truth that satisfies both operational monitoring and financial audit requirements. Finance teams receive defensible per-request cost events, engineering retains low-latency observability, and procurement gains the granularity needed to optimize model routing and reserved capacity.

Core Solution

Building a finance-ready attribution pipeline requires three architectural decisions: enforce metadata at the proxy layer, instrument with OpenTelemetry GenAI conventions, and decouple pricing calculation from request execution.

Step 1: Gateway Enforcement Layer

Place a lightweight proxy in front of all LLM provider endpoints. This proxy intercepts outbound ca

lls, validates required attribution headers, and rejects or quarantines requests missing mandatory fields. Centralizing enforcement eliminates schema drift across application teams.

Step 2: OpenTelemetry Instrumentation

Emit spans using the official GenAI semantic conventions. These conventions standardize how token usage, model identifiers, and provider names are recorded, ensuring traces remain comparable across internal services and external monitoring tools.

Step 3: Deterministic Pricing Engine

Provider pricing changes frequently. Cache rates, token subtypes, and regional multipliers must be versioned. The pricing engine should compute computed_cost_usd at request time using a snapshot of rates, not a live lookup, to guarantee reproducible month-end calculations.

Step 4: Durable Event Persistence

OTel spans are optimized for observability, not financial reporting. Persist normalized cost events to a columnar warehouse (BigQuery, Snowflake, Redshift) alongside the trace ID. This dual-write pattern links operational debugging with financial reconciliation.

Implementation Example (TypeScript)

import { trace, SpanStatusCode } from "@opentelemetry/api";
import { Request, Response, NextFunction } from "express";
import { PricingEngine } from "./pricing-engine";
import { CostEventPublisher } from "./cost-event-publisher";

const tracer = trace.getTracer("llm-cost-attribution");

export class AttributionProxy {
  constructor(
    private pricingEngine: PricingEngine,
    private publisher: CostEventPublisher
  ) {}

  middleware = (req: Request, res: Response, next: NextFunction) => {
    const requiredHeaders = ["x-cost-center", "x-service-name", "x-deployment-env"];
    const missing = requiredHeaders.filter(h => !req.headers[h]);
    
    if (missing.length > 0) {
      return res.status(400).json({
        error: "Attribution metadata missing",
        required: requiredHeaders
      });
    }

    const span = tracer.startSpan("llm.gateway.request");
    span.setAttribute("gen_ai.provider.name", req.body.provider || "openai");
    span.setAttribute("gen_ai.request.model", req.body.model);
    span.setAttribute("app.cost_center", String(req.headers["x-cost-center"]));
    span.setAttribute("app.service", String(req.headers["x-service-name"]));
    span.setAttribute("app.environment", String(req.headers["x-deployment-env"]));

    req.span = span;
    next();
  };

  async handleCompletion(req: Request, res: Response) {
    const span = req.span as import("@opentelemetry/api").Span;
    try {
      const providerResponse = await this.callProvider(req.body);
      
      const inputTokens = providerResponse.usage.prompt_tokens;
      const outputTokens = providerResponse.usage.completion_tokens;
      const cacheRead = providerResponse.usage.prompt_tokens_details?.cached_tokens || 0;

      span.setAttribute("gen_ai.usage.input_tokens", inputTokens);
      span.setAttribute("gen_ai.usage.output_tokens", outputTokens);
      span.setAttribute("gen_ai.usage.cache_read_tokens", cacheRead);

      const pricingSnapshot = this.pricingEngine.getSnapshot(
        req.body.model,
        new Date().toISOString()
      );

      const cost = this.pricingEngine.calculate(
        inputTokens,
        outputTokens,
        cacheRead,
        pricingSnapshot
      );

      await this.publisher.emit({
        trace_id: span.spanContext().traceId,
        span_id: span.spanContext().spanId,
        request_id: req.headers["x-request-id"] as string,
        cost_center: req.headers["x-cost-center"] as string,
        service: req.headers["x-service-name"] as string,
        environment: req.headers["x-deployment-env"] as string,
        provider: req.body.provider,
        model: req.body.model,
        input_tokens: inputTokens,
        output_tokens: outputTokens,
        cache_read_tokens: cacheRead,
        unit_cost: pricingSnapshot.version,
        computed_cost_usd: cost,
        timestamp: new Date().toISOString()
      });

      span.setStatus({ code: SpanStatusCode.OK });
      span.end();
      res.json(providerResponse);
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      span.end();
      next(err);
    }
  }

  private async callProvider(payload: any) {
    // Provider SDK or HTTP call implementation
    return { usage: { prompt_tokens: 0, completion_tokens: 0, prompt_tokens_details: {} } };
  }
}

Architecture Rationale

Why enforce at the gateway? Application teams inevitably drift on header naming, casing, or optional fields. A single proxy enforces a strict contract, rejects non-compliant traffic early, and guarantees schema consistency downstream.
Why dual-write OTel + warehouse? OTel excels at distributed tracing and latency analysis but lacks financial reconciliation features like versioned pricing snapshots or immutable cost totals. Persisting normalized events to a warehouse enables deterministic month-end closes while OTel handles operational debugging.
Why versioned pricing snapshots? Provider rates change mid-cycle. Computing cost at request time using a versioned snapshot (unit_cost) allows finance to re-run historical calculations if a pricing rule was misapplied, without altering original request data.

Pitfall Guide

Pitfall	Explanation	Fix
Token Category Collapse	Combining input, output, and cache tokens into a single `total_tokens` field destroys pricing accuracy. OpenAI and Bedrock price cache reads/writes differently from standard tokens.	Maintain separate fields (`input_tokens`, `output_tokens`, `cache_read_tokens`, `cache_write_tokens`) in both OTel attributes and warehouse schema.
Unversioned Pricing Rules	Hardcoding rates or fetching live prices during month-end close causes reconciliation drift when providers update pricing mid-cycle.	Store a `pricing_rule_version` or `unit_cost` snapshot per request. Re-run historical calculations against archived rule files during financial close.
Retry Double-Counting	Logging every retry attempt as a successful request inflates costs. Failed retries should not bill, or should bill at a reduced multiplier.	Attach a `parent_request_id` to all retries. Only emit cost events for the final successful attempt, or apply a configurable retry billing policy in the aggregation layer.
Mutable Team Identifiers	Using human-readable team names (`growth_team`, `risk_v2`) as primary keys breaks historical attribution when teams rebrand or restructure.	Use immutable UUIDs for cost centers. Maintain a separate lookup table mapping UUIDs to current display names for reporting.
Streaming Token Latency	Streaming responses emit partial usage data. Capturing token counts before the stream closes results in underreported costs.	Attach a completion listener to the stream. Only extract and emit token usage after the `end` or `close` event fires.
Environment Bleed	Staging, QA, and developer sandboxes share provider accounts without strict tagging, leaking non-production costs into business-unit P&L.	Enforce a controlled vocabulary for `x-deployment-env` at the gateway. Reject requests with unrecognized environment tags or route them to a dedicated cost center.
Ignoring Cache Write Tokens	Bedrock and OpenAI now charge separately for cache writes. Omitting this field underreports costs for high-throughput caching workloads.	Explicitly map provider cache write fields to a dedicated warehouse column. Include cache write pricing in the pricing engine snapshot.

Production Bundle

Action Checklist

Define attribution schema: Establish mandatory headers (x-cost-center, x-service-name, x-deployment-env) and enforce them at the proxy layer.
Configure OpenTelemetry GenAI exporters: Set up span processors to emit gen_ai.usage.* and gen_ai.request.* attributes to your observability backend.
Implement versioned pricing engine: Build a pricing calculator that loads rate files by version and computes computed_cost_usd at request time.
Design warehouse schema: Create a fact table with separate columns for input, output, cache read, and cache write tokens, plus a pricing_rule_version field.
Establish reconciliation cadence: Run daily ingestion jobs, weekly variance reviews against provider invoices, and month-end close with explicit adjustment rules.
Quarantine non-compliant traffic: Route requests missing required attribution metadata to a dead-letter queue for engineering remediation before billing.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Single-team pilot under $5k/mo	App-level logging + manual reconciliation	Low overhead, fast iteration, attribution errors are financially immaterial	Minimal engineering cost, moderate finance time
Multi-team production $10k–$50k/mo	Gateway-enforced telemetry + OTel + warehouse	Deterministic attribution, audit-ready, prevents environment bleed and schema drift	Medium engineering setup, low ongoing finance overhead
Regulated enterprise $50k+/mo	Gateway proxy + OTel + versioned pricing + CUR 2.0 alignment	Meets compliance requirements, supports reserved capacity negotiations, enables provider-level cost allocation tags	High initial architecture cost, significant long-term savings through accurate chargeback

Configuration Template

# attribution-proxy-config.yaml
gateway:
  enforcement:
    required_headers:
      - x-cost-center
      - x-service-name
      - x-deployment-env
    allowed_environments:
      - prod
      - staging
      - dev
    reject_on_missing: true

telemetry:
  otel:
    service_name: "llm-attribution-gateway"
    exporter: "otlp"
    endpoint: "http://otel-collector:4317"
    gen_ai_conventions: true

pricing:
  snapshot_dir: "./pricing-rules"
  default_version: "openai-2024-06"
  cache_multiplier: 0.1
  retry_billing_policy: "final_success_only"

warehouse:
  target: "snowflake"
  schema: "finops"
  table: "llm_cost_events"
  batch_size: 500
  flush_interval_ms: 2000

Quick Start Guide

Deploy the proxy: Run the attribution gateway as a sidecar or dedicated service in front of your LLM SDK calls. Configure the required headers and environment allowlist in the YAML template.
Initialize OTel: Attach the OpenTelemetry SDK to your runtime. Configure the OTLP exporter to point to your collector. Verify that gen_ai.request.model and gen_ai.usage.* attributes appear in your trace viewer.
Load pricing rules: Place versioned JSON rate files in the pricing-rules directory. Each file should map model identifiers to per-million-token rates for input, output, cache read, and cache write.
Run first reconciliation: Execute a 24-hour test run. Query the warehouse for computed_cost_usd grouped by x-cost-center. Compare the sum against the provider invoice export. Investigate any variance >2% using trace IDs to pinpoint missing cache tokens or retry miscounts.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back