lls, validates required attribution headers, and rejects or quarantines requests missing mandatory fields. Centralizing enforcement eliminates schema drift across application teams.
Step 2: OpenTelemetry Instrumentation
Emit spans using the official GenAI semantic conventions. These conventions standardize how token usage, model identifiers, and provider names are recorded, ensuring traces remain comparable across internal services and external monitoring tools.
Step 3: Deterministic Pricing Engine
Provider pricing changes frequently. Cache rates, token subtypes, and regional multipliers must be versioned. The pricing engine should compute computed_cost_usd at request time using a snapshot of rates, not a live lookup, to guarantee reproducible month-end calculations.
Step 4: Durable Event Persistence
OTel spans are optimized for observability, not financial reporting. Persist normalized cost events to a columnar warehouse (BigQuery, Snowflake, Redshift) alongside the trace ID. This dual-write pattern links operational debugging with financial reconciliation.
Implementation Example (TypeScript)
import { trace, SpanStatusCode } from "@opentelemetry/api";
import { Request, Response, NextFunction } from "express";
import { PricingEngine } from "./pricing-engine";
import { CostEventPublisher } from "./cost-event-publisher";
const tracer = trace.getTracer("llm-cost-attribution");
export class AttributionProxy {
constructor(
private pricingEngine: PricingEngine,
private publisher: CostEventPublisher
) {}
middleware = (req: Request, res: Response, next: NextFunction) => {
const requiredHeaders = ["x-cost-center", "x-service-name", "x-deployment-env"];
const missing = requiredHeaders.filter(h => !req.headers[h]);
if (missing.length > 0) {
return res.status(400).json({
error: "Attribution metadata missing",
required: requiredHeaders
});
}
const span = tracer.startSpan("llm.gateway.request");
span.setAttribute("gen_ai.provider.name", req.body.provider || "openai");
span.setAttribute("gen_ai.request.model", req.body.model);
span.setAttribute("app.cost_center", String(req.headers["x-cost-center"]));
span.setAttribute("app.service", String(req.headers["x-service-name"]));
span.setAttribute("app.environment", String(req.headers["x-deployment-env"]));
req.span = span;
next();
};
async handleCompletion(req: Request, res: Response) {
const span = req.span as import("@opentelemetry/api").Span;
try {
const providerResponse = await this.callProvider(req.body);
const inputTokens = providerResponse.usage.prompt_tokens;
const outputTokens = providerResponse.usage.completion_tokens;
const cacheRead = providerResponse.usage.prompt_tokens_details?.cached_tokens || 0;
span.setAttribute("gen_ai.usage.input_tokens", inputTokens);
span.setAttribute("gen_ai.usage.output_tokens", outputTokens);
span.setAttribute("gen_ai.usage.cache_read_tokens", cacheRead);
const pricingSnapshot = this.pricingEngine.getSnapshot(
req.body.model,
new Date().toISOString()
);
const cost = this.pricingEngine.calculate(
inputTokens,
outputTokens,
cacheRead,
pricingSnapshot
);
await this.publisher.emit({
trace_id: span.spanContext().traceId,
span_id: span.spanContext().spanId,
request_id: req.headers["x-request-id"] as string,
cost_center: req.headers["x-cost-center"] as string,
service: req.headers["x-service-name"] as string,
environment: req.headers["x-deployment-env"] as string,
provider: req.body.provider,
model: req.body.model,
input_tokens: inputTokens,
output_tokens: outputTokens,
cache_read_tokens: cacheRead,
unit_cost: pricingSnapshot.version,
computed_cost_usd: cost,
timestamp: new Date().toISOString()
});
span.setStatus({ code: SpanStatusCode.OK });
span.end();
res.json(providerResponse);
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
span.end();
next(err);
}
}
private async callProvider(payload: any) {
// Provider SDK or HTTP call implementation
return { usage: { prompt_tokens: 0, completion_tokens: 0, prompt_tokens_details: {} } };
}
}
Architecture Rationale
- Why enforce at the gateway? Application teams inevitably drift on header naming, casing, or optional fields. A single proxy enforces a strict contract, rejects non-compliant traffic early, and guarantees schema consistency downstream.
- Why dual-write OTel + warehouse? OTel excels at distributed tracing and latency analysis but lacks financial reconciliation features like versioned pricing snapshots or immutable cost totals. Persisting normalized events to a warehouse enables deterministic month-end closes while OTel handles operational debugging.
- Why versioned pricing snapshots? Provider rates change mid-cycle. Computing cost at request time using a versioned snapshot (
unit_cost) allows finance to re-run historical calculations if a pricing rule was misapplied, without altering original request data.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Token Category Collapse | Combining input, output, and cache tokens into a single total_tokens field destroys pricing accuracy. OpenAI and Bedrock price cache reads/writes differently from standard tokens. | Maintain separate fields (input_tokens, output_tokens, cache_read_tokens, cache_write_tokens) in both OTel attributes and warehouse schema. |
| Unversioned Pricing Rules | Hardcoding rates or fetching live prices during month-end close causes reconciliation drift when providers update pricing mid-cycle. | Store a pricing_rule_version or unit_cost snapshot per request. Re-run historical calculations against archived rule files during financial close. |
| Retry Double-Counting | Logging every retry attempt as a successful request inflates costs. Failed retries should not bill, or should bill at a reduced multiplier. | Attach a parent_request_id to all retries. Only emit cost events for the final successful attempt, or apply a configurable retry billing policy in the aggregation layer. |
| Mutable Team Identifiers | Using human-readable team names (growth_team, risk_v2) as primary keys breaks historical attribution when teams rebrand or restructure. | Use immutable UUIDs for cost centers. Maintain a separate lookup table mapping UUIDs to current display names for reporting. |
| Streaming Token Latency | Streaming responses emit partial usage data. Capturing token counts before the stream closes results in underreported costs. | Attach a completion listener to the stream. Only extract and emit token usage after the end or close event fires. |
| Environment Bleed | Staging, QA, and developer sandboxes share provider accounts without strict tagging, leaking non-production costs into business-unit P&L. | Enforce a controlled vocabulary for x-deployment-env at the gateway. Reject requests with unrecognized environment tags or route them to a dedicated cost center. |
| Ignoring Cache Write Tokens | Bedrock and OpenAI now charge separately for cache writes. Omitting this field underreports costs for high-throughput caching workloads. | Explicitly map provider cache write fields to a dedicated warehouse column. Include cache write pricing in the pricing engine snapshot. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single-team pilot under $5k/mo | App-level logging + manual reconciliation | Low overhead, fast iteration, attribution errors are financially immaterial | Minimal engineering cost, moderate finance time |
| Multi-team production $10kβ$50k/mo | Gateway-enforced telemetry + OTel + warehouse | Deterministic attribution, audit-ready, prevents environment bleed and schema drift | Medium engineering setup, low ongoing finance overhead |
| Regulated enterprise $50k+/mo | Gateway proxy + OTel + versioned pricing + CUR 2.0 alignment | Meets compliance requirements, supports reserved capacity negotiations, enables provider-level cost allocation tags | High initial architecture cost, significant long-term savings through accurate chargeback |
Configuration Template
# attribution-proxy-config.yaml
gateway:
enforcement:
required_headers:
- x-cost-center
- x-service-name
- x-deployment-env
allowed_environments:
- prod
- staging
- dev
reject_on_missing: true
telemetry:
otel:
service_name: "llm-attribution-gateway"
exporter: "otlp"
endpoint: "http://otel-collector:4317"
gen_ai_conventions: true
pricing:
snapshot_dir: "./pricing-rules"
default_version: "openai-2024-06"
cache_multiplier: 0.1
retry_billing_policy: "final_success_only"
warehouse:
target: "snowflake"
schema: "finops"
table: "llm_cost_events"
batch_size: 500
flush_interval_ms: 2000
Quick Start Guide
- Deploy the proxy: Run the attribution gateway as a sidecar or dedicated service in front of your LLM SDK calls. Configure the required headers and environment allowlist in the YAML template.
- Initialize OTel: Attach the OpenTelemetry SDK to your runtime. Configure the OTLP exporter to point to your collector. Verify that
gen_ai.request.model and gen_ai.usage.* attributes appear in your trace viewer.
- Load pricing rules: Place versioned JSON rate files in the
pricing-rules directory. Each file should map model identifiers to per-million-token rates for input, output, cache read, and cache write.
- Run first reconciliation: Execute a 24-hour test run. Query the warehouse for
computed_cost_usd grouped by x-cost-center. Compare the sum against the provider invoice export. Investigate any variance >2% using trace IDs to pinpoint missing cache tokens or retry miscounts.