Stop Your OpenAI Bill from Exploding: Per-User LLM Budget Caps in Node.js

Current Situation Analysis

Pain Points & Failure Modes:

Unpredictable Cost Spikes: A single session can consume 4× the daily application budget without triggering any infrastructure alarms or rate limits.
Decoupled Request/Cost Metrics: Traditional rate limiting operates on requests/minute, but LLM APIs charge per token. A 50-token query ($0.0005) and a 50-page document summary ($0.30) both count as "1 request," creating a 600× cost variance within identical rate-limit buckets.
Silent Billing Leakage: Without real-time cost tracking, overages only surface on monthly invoices, making root-cause analysis and budget enforcement reactive rather than proactive.

Why Traditional Methods Fail: Rate limits protect infrastructure concurrency, not financial exposure. Applying REST-style throttling to token-based APIs is mathematically misaligned. Running an LLM-powered application without per-user cost caps is equivalent to exposing a payment endpoint without spend validation, leaving the system vulnerable to buggy clients, forgotten test sessions, or power users driving triple-digit hourly bills while staying strictly within request-rate bounds.

WOW Moment: Key Findings

Approach	Max Session Cost Variance	Billing Accuracy	Real-Time Enforcement Latency	Cache Utilization Tracking
Traditional Rate Limiting (req/min)	600× spread per request	±40% (estimation only)	<50ms (infra-level)	None
Per-User Cost Capping + Structured Logging	<5% deviation from budget	±2% (API-reported tokens)	<120ms (DB lookup)	Full (cached vs raw)

Key Findings:

Decoupling request count from cost eliminates the 600× pricing blind spot inherent in token-bucket rate limiters.
Storing costs as NUMERIC(10,6) preserves 4–6 significant decimal digits, preventing rounding errors that zero out short calls.
Prompt caching automatically reduces prompt costs by ~50%, but only if prompt_tokens_details.cached_tokens is explicitly tracked and billed separately.

Sweet Spot: Implementing a measure → cap → degrade → cache pipeline with Postgres-backed structured logging allows teams to enforce strict per-session budgets, maintain sub-150ms enforcement latency, and capture real-time cache discounts without modifying core business logic.

Core Solution

Step 1 — Measure before you cap

You cannot cap what you do not measure. The first thing to ship is a single funnel that every LLM call passes through, with structured logging into a real database (not a JSON file, not an analytics tool — something you can JOIN and WHERE against in real time).

Here's the schema I use, simplified:

CREATE TABLE llm_usage_logs (
  id                     BIGSERIAL PRIMARY KEY,
  session_id             TEXT NOT NULL,
  model                  TEXT NOT NULL,
  prompt_tokens          INT  NOT NULL DEFAULT 0,
  cached_prompt_tokens   INT  NOT NULL DEFAULT 0,    -- subset of prompt_tokens that hit OpenAI's cache
  completion_tokens      INT  NOT NULL DEFAULT 0,
  total_tokens           INT  NOT NULL DEFAULT 0,
  prompt_cost_usd        NUMERIC(10,6) NOT NULL DEFAULT 0,
  cached_prompt_cost_usd NUMERIC(10,6) NOT NULL DEFAULT 0,  -- billed at ~50% of full prompt rate
  completion_cost_usd    NUMERIC(10,6) NOT NULL DEFAULT 0,
  total_cost_usd         NUMERIC(10,6) NOT NULL DEFAULT 0,
  finish_reason          TEXT,
  response_time_ms       INT,
  created_at             TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX llm_usage_logs_session_time_idx
  ON llm_usage_logs (session_id, created_at DESC);

Three non-obvious choices:

Store cost in USD as NUMERIC, not as cents in an integer. Token-priced cost has 4–6 significant decimal digits. If you store cents, you'll round most short calls to zero and the arithmetic gets useless.
Index on (session_id, created_at DESC). Every "is this user over budget?" query scans recent rows for a session. Without this index it's a sequential scan, and you'll regret it the day usage spikes.
Provision the cached-token columns from day one. Even if you're not using prompt caching yet, adding cached_prompt_tokens and cached_prompt_cost_usd up front saves you a migration later — they default to 0 and Step 2 wires them up.

Then a single logging function:

async function logLLMCall({
  session_id, model,
  prompt_tokens, completion_tokens, total_tokens,
  finish_reason, response_time_ms,
}) {
  // All prices below are USD per 1M tokens (NOT per 1K).
  const promptPricePerM     = parseFloat(process.env.LLM_PROMPT_PRICE_PER_M)     || 2.50;
  const completionPricePerM = parseFloat(process.env.LLM_COMPLETION_PRICE_PER_M) || 10.00;

  const promptCost     = (prompt_tokens     / 1_000_000) * promptPricePerM;
  const completionCost = (completion_tokens / 1_000_000) * completionPricePerM;
  const totalCost      = promptCost + completionCost;

  await pg.query(
    `INSERT INTO llm_usage_logs
       (session_id, model, prompt_tokens, completion_tokens, total_tokens,
        prompt_cost_usd, completion_cost_usd, total_cost_usd,
        finish_reason, response_time_ms)
     VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)`,
    [session_id, model, prompt_tokens, completion_tokens, total_tokens,
     promptCost, completionCost, totalCost, finish_reason, response_time_ms]
  );
}

Step 2 — Get the cost math right

Three things that bite people in cost calculation:

Different pricing per model. GPT-4o costs roughly 5× what GPT-4o-mini costs per million tokens. Don't hardcode prices; pull them from env vars (or a small model_pricing table) keyed by model name. When OpenAI announces new pricing — and they will — you change config, not code.

Trust the API's token count, not your local one. Tokenizers like tiktoken are close to what the API actually charges, but not identical. The number that matters is response.usage.{prompt,completion}_tokens returned in the API response. Log that, not your local pre-call estimate.

Streaming responses still report usage — if you ask for it. With stream: true, you must pass stream_options: { include_usage: true } to get a final usage chunk. Many people miss this and end up logging 0 tokens for every streamed call, which silently zeroes their cost dashboard.

const stream = await openai.chat.completions.create({
  model, messages,
  stream: true,
  stream_options: { include_usage: true },
});

let usage;
for await (const chunk of stream) {
  if (chunk.usage) usage = chunk.usage;        // arrives in the final chunk
  // ... yield chunk.choices[0].delta to client
}

await logLLMCall({ session_id, model, ...usage, ...timing });

Don't forget prompt caching — it changes the math

If you've enabled OpenAI prompt caching (it kicks in automatically once your prompt prefix is long and reused), part of your prompt tokens come back at roughly half price. They show up under usage.prompt_tokens_details.cached_tokens. If you ignore the field, your dashboard will overstate spend by 20–30% — and worse, you'll under-credit the optimizations you're actually doing.

Three-rate calculation:

function calculateCost(usage) {
  const cached    = usage.prompt_tokens_details?.cached_tokens || 0;
  const promptRaw = (usage.prompt_tokens || 0) - cached;          // billed at full rate
  const completion = usage.completion_tokens || 0;

  // All prices below are USD per 1M tokens.
  const fullPromptPerM   = parseFloat(process.env.LLM_PROMPT_PRICE_PER_M)        || 2.50;
  const cachedPromptPerM = parseFloat(process.env.LLM_CACHED_PROMPT_PRICE_PER_M) || 1.25;  // ~50% off full
  const completionPerM   = parseFloat(process.env.LLM_COMPLETION_PRICE_PER_M)    || 10.00;

  return {
    prompt_cost_usd:        (promptRaw  / 1_000_000) * fullPromptPerM,
    cached_prompt_cost_usd: (cached     / 1_000_000) * cachedPromptPerM,
    completion_cost_usd:    (completion / 1_000_000) * completionPerM,
    total_cost_usd:

Pitfall Guide

Confusing Rate Limits with Budget Caps: Rate limits control concurrency, not spend. LLM cost scales with tokens, not requests. Applying req/min limits to token-billing APIs leaves financial exposure completely unmanaged.
Storing Costs as Integers/Cents: Token pricing requires 4–6 decimal places. Using INT or CENTS rounds micro-transactions to zero, breaking budget arithmetic and making cost aggregation mathematically useless.
Ignoring Streaming Usage Chunks: stream: true suppresses token counts by default. Without stream_options: { include_usage: true }, the final chunk returns 0 tokens, silently zeroing out cost dashboards for all streaming endpoints.
Overlooking Prompt Caching Discounts: Cached tokens are billed at ~50% of the standard prompt rate. Failing to parse prompt_tokens_details.cached_tokens inflates reported spend by 20–30% and masks actual optimization ROI.
Hardcoding Model Prices: LLM pricing changes frequently. Embedding rates in code requires redeployment for every price update. Use environment variables or a model_pricing configuration table keyed by model name.
Relying on Local Tokenizers for Billing: tiktoken and similar libraries provide estimates, not billing truth. OpenAI's API response usage object contains the exact charged token counts. Always log API-reported values.
Missing Composite Indexes for Usage Queries: Budget enforcement requires scanning recent rows per session. Querying (session_id, created_at DESC) without a composite index triggers sequential scans during usage spikes, degrading enforcement latency and increasing DB load.

Deliverables

Blueprint: Production-ready Node.js + Express middleware architecture for real-time LLM cost tracking, featuring Postgres-backed structured logging, per-session budget enforcement, graceful degradation hooks, and streaming/caching-aware cost calculation.
Checklist:
- Provision llm_usage_logs schema with NUMERIC(10,6) cost columns
- Create composite index (session_id, created_at DESC)
- Configure LLM_PROMPT_PRICE_PER_M, LLM_CACHED_PROMPT_PRICE_PER_M, LLM_COMPLETION_PRICE_PER_M env vars
- Implement logLLMCall() funnel wrapping all OpenAI client calls
- Enable stream_options: { include_usage: true } for streaming endpoints
- Integrate calculateCost() with prompt_tokens_details.cached_tokens parsing
- Deploy budget cap middleware with configurable thresholds and degradation strategies