Multi-Tenant LLM Cost Attribution: Architecture, Implementation, and Budget Enforcement

Current Situation Analysis

Inference billing from major model providers operates on a strictly aggregated model. You receive a single invoice covering total input and output tokens across all endpoints, regions, and model variants. For early-stage applications, this abstraction works fine. Average cost-per-request multiplied by monthly active users yields a predictable margin. The problem emerges the moment your platform scales past a handful of tenants.

The industry pain point is attribution latency. Billing dashboards from OpenAI, Anthropic, and OpenRouter report total spend, but they do not natively segment costs by customer, workspace, or subscription tier in real time. Engineering teams are forced to reconcile provider invoices against internal subscription revenue using manual CSV exports and spreadsheet pivot tables. This approach breaks down under three conditions:

Usage outliers: A single power user or enterprise trial account can trigger tens of thousands of document processing jobs overnight. At standard pricing tiers, a $19/month plan can absorb $50+ in inference costs in hours, destroying unit economics before the next billing cycle.
Model pricing fragmentation: Different tenants use different models. GPT-4o, Claude Sonnet, and DeepSeek V3 have distinct per-million-token rates. Aggregated dashboards flatten these differences, making it impossible to calculate true margin per tenant.
Lack of enforcement: Without real-time attribution, budget limits are reactive. You discover overspend after the invoice arrives, not during the request lifecycle.

Most teams overlook this because initial infrastructure focuses on request routing and response caching. Cost tracking is treated as a finance problem rather than an engineering constraint. By the time margin erosion becomes visible, the architecture lacks the telemetry hooks needed to isolate tenant-level consumption. Decoupling usage tracking from subscription billing is not optional in production; it is a prerequisite for sustainable LLM product economics.

WOW Moment: Key Findings

The shift from aggregate billing to per-tenant attribution fundamentally changes how you manage inference economics. The table below compares three common operational approaches across critical production metrics.

Approach	Attribution Granularity	Alert Latency	Implementation Complexity	Cost Control Capability
Provider Dashboard	Account-level only	24-72 hours	None	None (post-hoc only)
CSV Reconciliation	Tenant-level (manual)	1-7 days	Low	Low (manual intervention)
Real-Time Attribution Pipeline	Tenant/Model/Request-level	<5 seconds	Medium	High (automated throttling/alerting)

The real-time pipeline approach transforms cost tracking from a retrospective accounting exercise into an active control surface. You gain the ability to:

Trigger Slack or PagerDuty alerts the moment a tenant crosses a configurable budget threshold
Implement soft limits that degrade model quality or switch to cheaper alternatives before hard caps are hit
Correlate API spend directly with subscription tiers, enabling accurate LTV/CAC calculations
Identify abusive or misconfigured integrations before they impact provider invoices

This capability enables predictable margins, automated tenant onboarding with usage caps, and data-driven pricing adjustments without manual reconciliation.

Core Solution

Building a production-grade attribution pipeline requires intercepting inference requests, normalizing token consumption into monetary values, and evaluating budgets asynchronously. The architecture separates request handling from cost evaluation to prevent latency penalties.

Architecture Overview

Request Interceptor: A lightweight middleware captures the tenant identifier, model selection, and request context before forwarding to the provider.
Metadata Injection: Provider-specific parameters attach the tenant ID to the API call, enabling downstream attribution.
Async Event Dispatch: Token usage is published to a durable event queue. This decouples cost tracking from the request-response cycle.
Time-Series Storage: Usage records are persisted in a relational database optimized for aggregations and rolling window calculations.
Budget Engine: A scheduled or event-driven worker evaluates cumulative spend against tenant limits and triggers alerts or enforcement actions.

Implementation Details

1. Metadata Injection Strategy

Provider APIs handle tenant attribution differently. OpenAI accepts a user field in the completion payload. Anthropic requires custom headers. OpenRouter supports metadata routing. The interceptor must normalize these differences.

import { OpenAI } from "openai";
import Anthropic from "@anthropic-ai/sdk";

type ProviderConfig = {
  openai: OpenAI;
  anthropic: Anthropic;
};

export class InferenceRouter {
  private clients: ProviderConfig;

  constructor(clients: ProviderConfig) {
    this.clients = clients;
  }

  async routeCompletion(
    provider: "openai" | "anthropic",
    tenantId: string,
    model: string,
    messages: Array<{ role: string; content: string }>
  ) {
    if (provider === "openai") {
      return this.clients.openai.chat.completions.create({
        model,
        messages,
        user: `workspace_${tenantId}`, // Native attribution field
      });
    }

    if (provider === "anthropic") {
      return this.clients.anthropic.messages.create(
        {
          model,
          messages,
          max_tokens: 1024,
        },
        {
          headers: {
            "anthropic-beta": "metadata-2024-01-01",
            "x-tenant-id": tenantId, // Custom header routing
          },
        }
      );
    }

    throw new Error(`Unsupported provider: ${provider}`);
  }
}

Why this structure: The router abstracts provider-specific metadata requirements behind a unified interface. This prevents business logic from coupling to vendor implementations and allows seamless provider switching without rewriting attribution logic.

2. Usage Normalization & Event Dispatch

Token counts must be converted to monetary values using current pricing tables. The conversion happens immediately after response completion, then the normalized record is dispatched asynchronously.

import { Inngest } from "inngest";

const inngestClient = new Inngest({ id: "llm-cost-tracker" });

type UsageRecord = {
  tenantId: string;
  provider: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  timestamp: string;
  requestId: string;
};

const PRICING_TABLE: Record<string, { input: number; output: number }> = {
  "gpt-4o": { input: 2.5, output: 10 }, // per 1M tokens
  "claude-3-5-sonnet": { input: 3, output: 15 },
  "deepseek-chat": { input: 0.14, output: 0.28 },
};

export function calculateCost(record: UsageRecord): number {
  const rates = PRICING_TABLE[record.model];
  if (!rates) return 0;
  const inputCost = (record.inputTokens / 1_000_000) * rates.input;
  const outputCost = (record.outputTokens / 1_000_000) * rates.output;
  return Number((inputCost + outputCost).toFixed(6));
}

export const trackUsageEvent = inngestClient.createFunction(
  { id: "ingest-usage" },
  { event: "inference/usage.created" },
  async ({ event, step }) => {
    const record = event.data as UsageRecord;
    const cost = calculateCost(record);

    await step.run("persist-usage", async () => {
      // Supabase insert with tenant_id, model, tokens, cost, timestamp
      // Uses upsert for idempotency on requestId
    });

    await step.run("evaluate-budget", async () => {
      // Query rolling 30-day spend for tenant
      // Compare against tenant.budget_limit
      // Trigger Slack alert if threshold exceeded
    });
  }
);

Why this structure: Cost calculation is isolated from persistence. This allows pricing tables to be updated without touching database logic. Inngest provides automatic retries, dead-letter queues, and scheduled execution, which eliminates the need for custom cron infrastructure.

3. Budget Evaluation & Alerting

Budget checks run asynchronously to avoid blocking the inference request. The engine queries cumulative spend over a configurable window (calendar month or rolling 30 days) and compares it against tenant limits.

export async function evaluateTenantBudget(
  tenantId: string,
  windowDays: number = 30
): Promise<{ currentSpend: number; limit: number; exceeded: boolean }> {
  const cutoff = new Date();
  cutoff.setDate(cutoff.getDate() - windowDays);

  // Aggregation query against Supabase usage table
  const { data: usageRecords } = await supabaseClient
    .from("inference_usage")
    .select("cost")
    .eq("tenant_id", tenantId)
    .gte("created_at", cutoff.toISOString());

  const currentSpend = usageRecords?.reduce((sum, r) => sum + r.cost, 0) ?? 0;
  
  const { data: tenant } = await supabaseClient
    .from("tenants")
    .select("budget_limit")
    .eq("id", tenantId)
    .single();

  const limit = tenant?.budget_limit ?? Infinity;
  return {
    currentSpend,
    limit,
    exceeded: currentSpend >= limit,
  };
}

Why this structure: Rolling windows prevent end-of-month billing spikes from being ignored. Separating budget evaluation from usage ingestion allows independent scaling. The database handles aggregation efficiently, while the application layer focuses on alert routing and enforcement policies.

Pitfall Guide

1. Relying on Provider CSV Exports for Real-Time Control

Explanation: Provider dashboards update on delayed schedules. CSV reconciliation takes hours and cannot enforce limits during active usage. Fix: Implement an async event pipeline that processes usage within seconds of request completion. Use provider exports only for audit reconciliation.

2. Ignoring Token Pricing Tiers and Caching

Explanation: Input and output tokens are priced differently. Prompt caching can reduce costs by 50-90% for repeated contexts, but naive tracking counts cached tokens as full price. Fix: Store cached_input_tokens separately. Apply caching discounts during cost calculation. Update pricing tables monthly to reflect provider rate changes.

3. Synchronous Budget Enforcement

Explanation: Checking budgets inside the request path adds 50-200ms latency per call. Under load, this creates cascading timeouts. Fix: Decouple enforcement. Allow the request to complete, then evaluate budget asynchronously. Implement soft limits that downgrade model quality or queue requests rather than blocking synchronously.

4. Missing Metadata Fallbacks

Explanation: If tenantId is null or malformed, usage records become orphaned. Aggregated costs balloon without attribution. Fix: Validate tenant context at the API gateway. Route unauthenticated or missing-tenant requests to a default_workspace with strict limits. Log validation failures for monitoring.

5. Hardcoding Rate Limits Instead of Cost Limits

Explanation: Token limits do not translate linearly to dollars. A 10k token request to GPT-4o costs significantly more than the same request to DeepSeek. Fix: Base limits on monetary thresholds, not token counts. Normalize all usage to USD before comparison. Allow per-model multipliers if business logic requires it.

6. Overlooking Streaming Token Accumulation

Explanation: Streaming responses emit tokens incrementally. If you only track the final response, you lose visibility into long-running generations that may exceed budgets mid-stream. Fix: Accumulate token counts during stream processing. Emit usage events at configurable intervals (e.g., every 500 tokens) for real-time budget tracking.

7. Failing to Handle Idempotency

Explanation: Network retries or Inngest redeliveries can cause duplicate usage records. Double-counting inflates tenant spend and triggers false alerts. Fix: Include a deterministic requestId in every usage event. Use database upserts with unique constraints on requestId. Deduplicate at the ingestion layer before cost calculation.

Production Bundle

Action Checklist

Define tenant attribution strategy: Map every inference request to a tenant_id or workspace_id before provider dispatch
Implement pricing normalization: Create a centralized pricing table that converts input/output tokens to USD per model
Deploy async event pipeline: Route usage records through Inngest or equivalent queue to decouple tracking from request latency
Configure time-series storage: Use Supabase or PostgreSQL with indexes on tenant_id, created_at, and model for fast aggregations
Build budget evaluation engine: Implement rolling window spend calculations with configurable thresholds and alert routing
Add idempotency safeguards: Enforce unique requestId constraints and deduplicate ingestion events
Establish pricing update cadence: Schedule monthly reviews of provider rate cards and update normalization tables automatically
Implement soft limit policies: Design degradation paths (model downgrade, queueing, throttling) before hard caps are reached

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Startup MVP (<100 tenants)	Provider CSV + manual tracking	Low overhead, sufficient for early validation	Minimal engineering cost, high reconciliation time
Mid-market SaaS (100-10k tenants)	Async event pipeline + Supabase	Real-time attribution, automated alerts, scalable	Moderate infra cost, prevents margin erosion
Enterprise Multi-tenant (>10k tenants)	Dedicated usage service + Kafka + ClickHouse	High throughput, complex budget windows, audit compliance	Higher infra cost, enables precise LTV/CAC modeling
Cost-sensitive workloads	Streaming accumulation + cache-aware tracking	Captures incremental spend, applies caching discounts	Reduces overbilling by 15-40% on repeated contexts

Configuration Template

// config/usage-tracking.ts
export const USAGE_TRACKING_CONFIG = {
  // Provider metadata injection
  providers: {
    openai: { metadataField: "user", prefix: "workspace_" },
    anthropic: { metadataField: "headers", headerKey: "x-tenant-id" },
    openrouter: { metadataField: "meta", key: "tenant_id" },
  },
  // Budget evaluation
  budget: {
    windowDays: 30,
    alertThresholds: [0.5, 0.75, 0.9, 1.0], // 50%, 75%, 90%, 100%
    enforcement: "soft_limit", // soft_limit | hard_limit | degrade_quality
    alertChannels: ["slack", "pagerduty"],
  },
  // Pricing normalization
  pricing: {
    updateFrequency: "monthly",
    cacheDiscountMultiplier: 0.1, // 90% discount for cached tokens
    roundingPrecision: 6,
  },
  // Storage
  storage: {
    table: "inference_usage",
    indexes: ["tenant_id", "created_at", "model", "request_id"],
    retentionDays: 365,
  },
};

-- Supabase schema for usage tracking
CREATE TABLE inference_usage (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id TEXT NOT NULL,
  provider TEXT NOT NULL,
  model TEXT NOT NULL,
  input_tokens INTEGER NOT NULL DEFAULT 0,
  output_tokens INTEGER NOT NULL DEFAULT 0,
  cached_input_tokens INTEGER NOT NULL DEFAULT 0,
  cost_usd NUMERIC(10,6) NOT NULL,
  request_id TEXT NOT NULL UNIQUE,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_usage_tenant_date ON inference_usage(tenant_id, created_at);
CREATE INDEX idx_usage_model ON inference_usage(model);

CREATE TABLE tenant_budgets (
  tenant_id TEXT PRIMARY KEY,
  budget_limit NUMERIC(10,2) NOT NULL,
  alert_thresholds NUMERIC(3,2)[] DEFAULT ARRAY[0.5, 0.75, 0.9, 1.0],
  enforcement_policy TEXT DEFAULT 'soft_limit',
  updated_at TIMESTAMPTZ DEFAULT now()
);

Quick Start Guide

Initialize the tracking client: Install inngest and @supabase/supabase-js. Create a new Inngest function that listens for inference/usage.created events and persists records to Supabase.
Inject tenant metadata: Wrap your provider SDK calls with the InferenceRouter pattern. Ensure every request includes the tenant identifier using the provider-specific field or header.
Deploy the budget engine: Configure the evaluateTenantBudget function to run on a 5-minute schedule or trigger on usage events. Connect Slack webhook integration for threshold alerts.
Validate with test tenants: Create sandbox tenants with $0.05 budget limits. Run inference requests and verify that alerts trigger at 50%, 75%, and 100% thresholds. Confirm idempotency by replaying duplicate events.
Enable production routing: Switch traffic through the attribution pipeline. Monitor Supabase query performance and Inngest execution logs. Adjust budget windows and alert thresholds based on initial tenant behavior.

How I track per-customer LLM costs in production