nteraction must emit a structured event containing:
user_id, org_id
model_id, model_version
input_tokens, output_tokens
latency_ms
cost_center (e.g., chat, embeddings, image_gen)
2. Architecture Decisions
- Synchronous Enforcement: Hard limits (e.g., free tier caps) must be checked synchronously before inference to prevent cost leaks.
- Asynchronous Billing: Soft limits and overage billing should be processed asynchronously via an event bus to avoid adding latency to the critical path.
- Idempotency: Usage reporting must be idempotent. Network retries should not result in double billing. Use
request_id as the deduplication key.
3. Code Implementation: Python/FastAPI Middleware
This example demonstrates a production-grade metering decorator using Redis for real-time quota enforcement and an event bus for billing.
import time
import uuid
import logging
from functools import wraps
from typing import Callable, Any
from fastapi import Request, HTTPException
from redis.asyncio import Redis
from ai_providers import LLMClient
logger = logging.getLogger(__name__)
class MonetizationMiddleware:
def __init__(self, redis: Redis, pricing_engine: PricingEngine, event_bus: EventBus):
self.redis = redis
self.pricing_engine = pricing_engine
self.event_bus = event_bus
def metered(self, cost_center: str, model: str):
def decorator(func: Callable):
@wraps(func)
async def wrapper(*args, **kwargs):
request: Request = kwargs.get('request')
user_id = request.state.user_id
org_id = request.state.org_id
req_id = str(uuid.uuid4())
# 1. Pre-flight Quota Check (Synchronous)
quota_key = f"quota:{org_id}:{cost_center}"
current_usage = await self.redis.get(quota_key)
limit = await self.pricing_engine.get_limit(org_id, cost_center)
if current_usage and int(current_usage) >= limit:
raise HTTPException(status_code=429, detail="Quota exceeded")
# 2. Execute Inference
start_time = time.perf_counter()
try:
# Assume func returns a response object with token counts
response = await func(*args, **kwargs)
latency = (time.perf_counter() - start_time) * 1000
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
# 3. Calculate Cost
cost = self.pricing_engine.calculate(
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens
)
# 4. Update Quota & Emit Event (Async/Fire-and-forget for billing)
await self.redis.incrby(quota_key, 1)
await self.redis.expire(quota_key, 86400) # Reset daily
await self.event_bus.publish({
"event": "ai_usage",
"request_id": req_id,
"user_id": user_id,
"org_id": org_id,
"cost_center": cost_center,
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": cost,
"timestamp": time.time()
})
return response
except Exception as e:
logger.error(f"Monetization failure for {req_id}: {e}")
# Fail open or closed based on policy?
# Recommendation: Fail closed for cost protection.
raise HTTPException(status_code=500, detail="Service unavailable")
return wrapper
return decorator
4. Pricing Engine Design
Hardcoding prices is an anti-pattern. Model providers change rates frequently. The pricing engine should load configurations from a versioned store.
class PricingEngine:
def __init__(self, config_store: ConfigStore):
self.config = config_store.load_pricing_config()
def calculate(self, model: str, input_tokens: int, output_tokens: int) -> float:
pricing = self.config.models.get(model)
if not pricing:
raise ValueError(f"Unknown model: {model}")
# Support for tiered pricing (e.g., first 1M tokens @ $X, next @ $Y)
input_cost = self._apply_tiers(input_tokens, pricing.input_tiers)
output_cost = self._apply_tiers(output_tokens, pricing.output_tiers)
return input_cost + output_cost
def _apply_tiers(self, tokens: int, tiers: list) -> float:
cost = 0.0
remaining = tokens
for tier in tiers:
if remaining <= 0:
break
volume = min(remaining, tier.limit)
cost += (volume / 1000) * tier.rate
remaining -= volume
return cost
Pitfall Guide
Most models charge significantly more for output tokens. If your AI generates long responses (e.g., code completion, summaries), costs scale non-linearly.
- Fix: Implement output token caps and stream responses to allow early termination if costs exceed thresholds.
2. The "Prompt Injection" Cost Attack
Malicious users can craft prompts designed to maximize output tokens or trigger infinite loops in agentic workflows.
- Fix: Enforce strict
max_tokens limits. Monitor for anomalous output lengths per request. Implement circuit breakers for cost-per-minute spikes.
3. Race Conditions in Quota Deduction
Using standard read-modify-write patterns for quotas without atomic operations leads to over-consumption.
- Fix: Use Redis
INCR or Lua scripts for atomic quota checks. Never read, calculate, and write separately in distributed environments.
4. Forgetting Non-Inference Costs
Monetization often focuses only on LLM tokens. Embeddings, vector database storage, and GPU compute for fine-tuning also incur costs.
- Fix: Expand the
cost_center taxonomy. Track embedding tokens and storage GBs. Include these in the total cost of ownership calculation.
5. Hardcoded Pricing Models
API providers update pricing monthly. Hardcoded values result in margin erosion or overcharging.
- Fix: Externalize pricing to a configuration service or database. Implement a pricing update webhook that refreshes the engine cache without redeployment.
6. Lack of User Transparency
Users churn when they receive a bill they cannot explain.
- Fix: Provide a real-time usage dashboard. Include a "Cost Estimator" before execution where possible. Send usage alerts at 50%, 80%, and 100% of quotas.
7. Model Drift Impacting Costs
As models improve, prompt effectiveness may change, altering token usage patterns. A prompt that worked efficiently on v3 might waste tokens on v4 due to different behavior.
- Fix: Monitor "Cost per Successful Outcome" metrics, not just token volume. A/B test prompts for cost efficiency alongside quality.
Production Bundle
Action Checklist
Decision Matrix: Billing Integration
| Feature | Stripe Metered Billing | Custom Ledger + Stripe | Hybrid (LemonSqueezy/Razorpay) |
|---|
| Flexibility | Medium | High | Medium |
| Dev Effort | Low | High | Low |
| Auditability | Medium | High | Medium |
| Real-time Limits | Requires Webhooks | Native | Medium |
| Best For | Standard SaaS | Complex AI Usage | Global/Creator Economy |
| Recommendation | Use for simple per-seat + flat overage. | Recommended for AI. Full control over token logic, complex tiers, and custom cost attribution. | Good for rapid validation, limited by API constraints. |
Configuration Template
Save this as pricing_config.yaml to drive your pricing engine. This structure supports multi-model, tiered pricing.
version: "2024-05-15"
models:
gpt-4o:
input_tiers:
- limit: 1000000
rate_per_1k: 0.005
- limit: 10000000
rate_per_1k: 0.004
output_tiers:
- limit: 1000000
rate_per_1k: 0.015
- limit: 10000000
rate_per_1k: 0.012
text-embedding-3-large:
input_tiers:
- limit: 100000000
rate_per_1k: 0.00013
output_tiers: [] # Embeddings usually input-only
quotas:
free_tier:
daily_limit: 100000 # tokens
monthly_limit: 2000000
pro_tier:
daily_limit: 5000000
monthly_limit: 100000000
Quick Start Guide
- Initialize Middleware:
pip install redis fastapi ai-monetizer
# Configure Redis connection and PricingEngine in your app entry point
- Wrap Endpoints:
Apply the
@metered decorator to all routes invoking AI models. Ensure request.state contains user/org context.
- Configure Pricing:
Create
pricing_config.yaml based on current provider rates. Load into your PricingEngine.
- Deploy & Monitor:
Deploy the service. Monitor the
ai_usage event stream. Verify that Redis quotas are decrementing and events are firing.
- Iterate:
After 7 days, analyze cost attribution data. Adjust pricing tiers and quotas based on actual usage distribution.
Codcompass Insight: Monetization is not just a revenue function; it is a system design constraint. By engineering monetization into the architecture from day one, you gain real-time visibility into unit economics, protect margins against usage volatility, and build a product that scales profitably. Treat tokens as currency, and your middleware as the central bank.