How I track per-customer LLM costs in production
Multi-Tenant LLM Cost Attribution: Architecture, Implementation, and Budget Enforcement
Current Situation Analysis
Inference billing from major model providers operates on a strictly aggregated model. You receive a single invoice covering total input and output tokens across all endpoints, regions, and model variants. For early-stage applications, this abstraction works fine. Average cost-per-request multiplied by monthly active users yields a predictable margin. The problem emerges the moment your platform scales past a handful of tenants.
The industry pain point is attribution latency. Billing dashboards from OpenAI, Anthropic, and OpenRouter report total spend, but they do not natively segment costs by customer, workspace, or subscription tier in real time. Engineering teams are forced to reconcile provider invoices against internal subscription revenue using manual CSV exports and spreadsheet pivot tables. This approach breaks down under three conditions:
- Usage outliers: A single power user or enterprise trial account can trigger tens of thousands of document processing jobs overnight. At standard pricing tiers, a $19/month plan can absorb $50+ in inference costs in hours, destroying unit economics before the next billing cycle.
- Model pricing fragmentation: Different tenants use different models. GPT-4o, Claude Sonnet, and DeepSeek V3 have distinct per-million-token rates. Aggregated dashboards flatten these differences, making it impossible to calculate true margin per tenant.
- Lack of enforcement: Without real-time attribution, budget limits are reactive. You discover overspend after the invoice arrives, not during the request lifecycle.
Most teams overlook this because initial infrastructure focuses on request routing and response caching. Cost tracking is treated as a finance problem rather than an engineering constraint. By the time margin erosion becomes visible, the architecture lacks the telemetry hooks needed to isolate tenant-level consumption. Decoupling usage tracking from subscription billing is not optional in production; it is a prerequisite for sustainable LLM product economics.
WOW Moment: Key Findings
The shift from aggregate billing to per-tenant attribution fundamentally changes how you manage inference economics. The table below compares three common operational approaches across critical production metrics.
| Approach | Attribution Granularity | Alert Latency | Implementation Complexity | Cost Control Capability |
|---|---|---|---|---|
| Provider Dashboard | Account-level only | 24-72 hours | None | None (post-hoc only) |
| CSV Reconciliation | Tenant-level (manual) | 1-7 days | Low | Low (manual intervention) |
| Real-Time Attribution Pipeline | Tenant/Model/Request-level | <5 seconds | Medium | High (automated throttling/alerting) |
The real-time pipeline approach transforms cost tracking from a retrospective accounting exercise into an active control surface. You gain the ability to:
- Trigger Slack or PagerDuty alerts the moment a tenant crosses a configurable budget threshold
- Implement soft limits that degrade model quality or switch to cheaper alternatives before hard caps are hit
- Correlate API spend directly with subscription tiers, enabling accurate LTV/CAC calculations
- Identify abusive or misconfigured integrations before they impact provider invoices
This capability enables predictable margins, automated tenant onboarding with usage caps, and data-driven pricing adjustments without manual reconciliation.
Core Solution
Building a production-grade attribution pipeline requires intercepting inference requests, normalizing token consumption into monetary values, and evaluating budgets asynchronously. The architecture separates request handling from cost evaluation to prevent latency penalties.
Architecture Overview
- Request Interceptor: A lightweight middleware captures the tenant identifier, model selection, and request context before forwarding to the provider.
- Metadata Injection: Provider-specific parameters attach the tenant ID to the API call, enabling downstream attribution.
- Async Event Dispatch: Token usage is published to a durable event queue. This decouples cost tracking from the request-response cycle.
- Time-Series Storage: Usage records are persisted in a relational database optimized for aggregations and rolling window calculations.
- Budget Engine: A scheduled or event-driven worker evaluates cumulative spend against tenant limits and triggers alerts or enforcement actions.
Implementation Details
1. Metadata Injection Strategy
Provider APIs handle tenant attribution differently. OpenAI accepts a user field in the completion payload. Anthropic requires custom headers. OpenRouter supports metadata routing. The interceptor must normalize these differences.
import { OpenAI } from "openai";
import Anthropic from "@anthropic-ai/sdk";
type ProviderConfig = {
openai: OpenAI;
anthropic: Anthropic;
};
export class InferenceRouter {
private clients: ProviderConfig;
constructor(clients: ProviderConfig) {
this.clients = clients;
}
async routeCompletion(
provider: "openai" | "anthropic",
tenantId: string,
model: string,
messages: Array<{ role: string; content: string }>
) {
if (provider === "openai") {
return this.clients.openai.chat.completions.create({
model,
messages,
user: `workspace_${tenantId}`, // Native attribution field
});
}
if (provider === "anthropic") {
return this.clients.anthropic.messages.create(
{
model,
messages,
max_tokens: 1024,
},
{
headers: {
"anthropic-beta": "metadata-2024-01-01",
"x-tenant-id": tenantId, // Custom header routing
},
}
);
}
throw new Error(`Unsupported provider: ${provider}`);
}
}
Why this structure: The router abstracts provider-specific metadata requirements behind a unified interface. This prevents business logic from coupling to vendor implementations and allows seamless provider switching without rewriting attribution logic.
2. Usage Normalization & Event Dispatch
Token counts must be converted to monetary values using current pricing tables. The conversion happens immediately after response completion, then the normalized record is dispatched asynchronously.
import { Inngest } from "inngest";
const inngestClient = new Inngest({ id: "llm-cost-tracker" });
type UsageRecord = {
tenantId: string;
provider: string;
model: string;
inputTokens: number;
outputTokens: number;
timestamp: string;
requestId: string;
};
const PRICING_TABLE: Record<string, { input: number; output: number }> = {
"gpt-4o": { input: 2.5, output: 10 }, // per 1M tokens
"claude-3-5-sonnet": { input: 3, output: 15 },
"deepseek-chat": { input: 0.14, output: 0.28 },
};
export function calculateCost(record: UsageRecord): number {
const rates = PRICING_TABLE[record.model];
if (!rates) return 0;
const inputCost = (record.inputTokens / 1_000_000) * rates.input;
const outputCost = (record.outputTokens / 1_000_000) * rates.output;
return Number((inputCost + outputCost).toFixed(6));
}
export const trackUsageEvent = inngestClient.createFunction(
{ id: "ingest-usage" },
{ event: "inference/usage.created" },
async ({ event, step }) => {
const record = event.data as UsageRecord;
const cost = calculateCost(record);
await step.run("persist-usage", async () => {
// Supabase insert with tenant_id, model, tokens, cost, timestamp
// Uses upsert for idempotency on requestId
});
await step.run("evaluate-budget", async () => {
// Query rolling 30-day spend for tenant
// Compare against tenant.budget_limit
// Trigger Slack alert if threshold exceeded
});
}
);
Why this structure: Cost calculation is isolated from persistence. This allows pricing tables to be updated without touching database logic. Inngest provides automatic retries, dead-letter queues, and scheduled execution, which eliminates the need for custom cron infrastructure.
3. Budget Evaluation & Alerting
Budget checks run asynchronously to avoid blocking the inference request. The engine queries cumulative spend over a configurable window (calendar month or rolling 30 days) and compares it against tenant limits.
export async function evaluateTenantBudget(
tenantId: string,
windowDays: number = 30
): Promise<{ currentSpend: number; limit: number; exceeded: boolean }> {
const cutoff = new Date();
cutoff.setDate(cutoff.getDate() - windowDays);
// Aggregation query against Supabase usage table
const { data: usageRecords } = await supabaseClient
.from("inference_usage")
.select("cost")
.eq("tenant_id", tenantId)
.gte("created_at", cutoff.toISOString());
const currentSpend = usageRecords?.reduce((sum, r) => sum + r.cost, 0) ?? 0;
const { data: tenant } = await supabaseClient
.from("tenants")
.select("budget_limit")
.eq("id", tenantId)
.single();
const limit = tenant?.budget_limit ?? Infinity;
return {
currentSpend,
limit,
exceeded: currentSpend >= limit,
};
}
Why this structure: Rolling windows prevent end-of-month billing spikes from being ignored. Separating budget evaluation from usage ingestion allows independent scaling. The database handles aggregation efficiently, while the application layer focuses on alert routing and enforcement policies.
Pitfall Guide
1. Relying on Provider CSV Exports for Real-Time Control
Explanation: Provider dashboards update on delayed schedules. CSV reconciliation takes hours and cannot enforce limits during active usage. Fix: Implement an async event pipeline that processes usage within seconds of request completion. Use provider exports only for audit reconciliation.
2. Ignoring Token Pricing Tiers and Caching
Explanation: Input and output tokens are priced differently. Prompt caching can reduce costs by 50-90% for repeated contexts, but naive tracking counts cached tokens as full price.
Fix: Store cached_input_tokens separately. Apply caching discounts during cost calculation. Update pricing tables monthly to reflect provider rate changes.
3. Synchronous Budget Enforcement
Explanation: Checking budgets inside the request path adds 50-200ms latency per call. Under load, this creates cascading timeouts. Fix: Decouple enforcement. Allow the request to complete, then evaluate budget asynchronously. Implement soft limits that downgrade model quality or queue requests rather than blocking synchronously.
4. Missing Metadata Fallbacks
Explanation: If tenantId is null or malformed, usage records become orphaned. Aggregated costs balloon without attribution.
Fix: Validate tenant context at the API gateway. Route unauthenticated or missing-tenant requests to a default_workspace with strict limits. Log validation failures for monitoring.
5. Hardcoding Rate Limits Instead of Cost Limits
Explanation: Token limits do not translate linearly to dollars. A 10k token request to GPT-4o costs significantly more than the same request to DeepSeek. Fix: Base limits on monetary thresholds, not token counts. Normalize all usage to USD before comparison. Allow per-model multipliers if business logic requires it.
6. Overlooking Streaming Token Accumulation
Explanation: Streaming responses emit tokens incrementally. If you only track the final response, you lose visibility into long-running generations that may exceed budgets mid-stream. Fix: Accumulate token counts during stream processing. Emit usage events at configurable intervals (e.g., every 500 tokens) for real-time budget tracking.
7. Failing to Handle Idempotency
Explanation: Network retries or Inngest redeliveries can cause duplicate usage records. Double-counting inflates tenant spend and triggers false alerts.
Fix: Include a deterministic requestId in every usage event. Use database upserts with unique constraints on requestId. Deduplicate at the ingestion layer before cost calculation.
Production Bundle
Action Checklist
- Define tenant attribution strategy: Map every inference request to a
tenant_idorworkspace_idbefore provider dispatch - Implement pricing normalization: Create a centralized pricing table that converts input/output tokens to USD per model
- Deploy async event pipeline: Route usage records through Inngest or equivalent queue to decouple tracking from request latency
- Configure time-series storage: Use Supabase or PostgreSQL with indexes on
tenant_id,created_at, andmodelfor fast aggregations - Build budget evaluation engine: Implement rolling window spend calculations with configurable thresholds and alert routing
- Add idempotency safeguards: Enforce unique
requestIdconstraints and deduplicate ingestion events - Establish pricing update cadence: Schedule monthly reviews of provider rate cards and update normalization tables automatically
- Implement soft limit policies: Design degradation paths (model downgrade, queueing, throttling) before hard caps are reached
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Startup MVP (<100 tenants) | Provider CSV + manual tracking | Low overhead, sufficient for early validation | Minimal engineering cost, high reconciliation time |
| Mid-market SaaS (100-10k tenants) | Async event pipeline + Supabase | Real-time attribution, automated alerts, scalable | Moderate infra cost, prevents margin erosion |
| Enterprise Multi-tenant (>10k tenants) | Dedicated usage service + Kafka + ClickHouse | High throughput, complex budget windows, audit compliance | Higher infra cost, enables precise LTV/CAC modeling |
| Cost-sensitive workloads | Streaming accumulation + cache-aware tracking | Captures incremental spend, applies caching discounts | Reduces overbilling by 15-40% on repeated contexts |
Configuration Template
// config/usage-tracking.ts
export const USAGE_TRACKING_CONFIG = {
// Provider metadata injection
providers: {
openai: { metadataField: "user", prefix: "workspace_" },
anthropic: { metadataField: "headers", headerKey: "x-tenant-id" },
openrouter: { metadataField: "meta", key: "tenant_id" },
},
// Budget evaluation
budget: {
windowDays: 30,
alertThresholds: [0.5, 0.75, 0.9, 1.0], // 50%, 75%, 90%, 100%
enforcement: "soft_limit", // soft_limit | hard_limit | degrade_quality
alertChannels: ["slack", "pagerduty"],
},
// Pricing normalization
pricing: {
updateFrequency: "monthly",
cacheDiscountMultiplier: 0.1, // 90% discount for cached tokens
roundingPrecision: 6,
},
// Storage
storage: {
table: "inference_usage",
indexes: ["tenant_id", "created_at", "model", "request_id"],
retentionDays: 365,
},
};
-- Supabase schema for usage tracking
CREATE TABLE inference_usage (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id TEXT NOT NULL,
provider TEXT NOT NULL,
model TEXT NOT NULL,
input_tokens INTEGER NOT NULL DEFAULT 0,
output_tokens INTEGER NOT NULL DEFAULT 0,
cached_input_tokens INTEGER NOT NULL DEFAULT 0,
cost_usd NUMERIC(10,6) NOT NULL,
request_id TEXT NOT NULL UNIQUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_usage_tenant_date ON inference_usage(tenant_id, created_at);
CREATE INDEX idx_usage_model ON inference_usage(model);
CREATE TABLE tenant_budgets (
tenant_id TEXT PRIMARY KEY,
budget_limit NUMERIC(10,2) NOT NULL,
alert_thresholds NUMERIC(3,2)[] DEFAULT ARRAY[0.5, 0.75, 0.9, 1.0],
enforcement_policy TEXT DEFAULT 'soft_limit',
updated_at TIMESTAMPTZ DEFAULT now()
);
Quick Start Guide
- Initialize the tracking client: Install
inngestand@supabase/supabase-js. Create a new Inngest function that listens forinference/usage.createdevents and persists records to Supabase. - Inject tenant metadata: Wrap your provider SDK calls with the
InferenceRouterpattern. Ensure every request includes the tenant identifier using the provider-specific field or header. - Deploy the budget engine: Configure the
evaluateTenantBudgetfunction to run on a 5-minute schedule or trigger on usage events. Connect Slack webhook integration for threshold alerts. - Validate with test tenants: Create sandbox tenants with $0.05 budget limits. Run inference requests and verify that alerts trigger at 50%, 75%, and 100% thresholds. Confirm idempotency by replaying duplicate events.
- Enable production routing: Switch traffic through the attribution pipeline. Monitor Supabase query performance and Inngest execution logs. Adjust budget windows and alert thresholds based on initial tenant behavior.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
