c resource allocation system, aligning technical architecture with business unit economics.
Core Solution
Implementing an AI freemium model requires a Metering and Routing Gateway that sits between the client and the model providers. This architecture enforces quotas, tracks token consumption, and dynamically selects models based on tier rules.
Architecture Decisions
- Decoupled Metering: Metering must be independent of inference. Use an event-driven architecture where inference emits metering events to a sidecar or message queue, ensuring quotas are updated even if inference fails.
- Token-Aware Quotas: Quotas must be defined in weighted tokens, not raw requests. Weights allow differentiation between input/output costs and model pricing.
- Graceful Degradation: When quotas are exhausted, the system should not return
429 Too Many Requests. Instead, it should return a structured error with upsell metadata or fallback to a cached/simplified response.
Technical Implementation
1. Quota Definition Schema
Define tiers using a configuration-driven approach. This allows product teams to adjust limits without code deployments.
// quota-config.ts
export interface TierRule {
tier: string;
maxDailyTokens: number;
maxDailyRequests: number;
allowedModels: string[];
priority: 'low' | 'medium' | 'high';
weightMultiplier: number; // Applied to token cost calculation
}
export const TIERS: Record<string, TierRule> = {
free: {
tier: 'free',
maxDailyTokens: 50_000,
maxDailyRequests: 20,
allowedModels: ['model-small-v1', 'model-fast-v1'],
priority: 'low',
weightMultiplier: 1.0,
},
pro: {
tier: 'pro',
maxDailyTokens: 1_000_000,
maxDailyRequests: 500,
allowedModels: ['model-small-v1', 'model-fast-v1', 'model-reasoning-v1'],
priority: 'medium',
weightMultiplier: 0.5, // Pro users get cost discounts on quota usage
},
enterprise: {
tier: 'enterprise',
maxDailyTokens: -1, // Unlimited
maxDailyRequests: -1,
allowedModels: ['*'],
priority: 'high',
weightMultiplier: 0.2,
},
};
2. Adaptive Router Middleware
The router intercepts requests, validates quotas, and selects the optimal model.
// router-middleware.ts
import { NextFunction, Request, Response } from 'express';
import { MeteringService } from './metering-service';
import { TIERS, TierRule } from './quota-config';
export class AIRouterMiddleware {
constructor(private metering: MeteringService) {}
async handle(req: Request, res: Response, next: NextFunction) {
const userId = req.user?.id;
if (!userId) return res.status(401).json({ error: 'Unauthorized' });
// 1. Fetch User Tier
const userTier = await this.getUserTier(userId);
const tierRule = TIERS[userTier];
// 2. Estimate Token Cost
const inputTokens = req.body.prompt?.length || 0; // Simplified estimation
const estimatedOutputTokens = 200; // Heuristic or LLM-based estimate
const estimatedCost = (inputTokens + estimatedOutputTokens) * tierRule.weightMultiplier;
// 3. Check Quotas
const usage = await this.metering.getDailyUsage(userId);
if (
tierRule.maxDailyTokens !== -1 &&
usage.tokens + estimatedCost > tierRule.maxDailyTokens
) {
return this.handleQuotaExhausted(res, userTier, userId);
}
// 4. Model Selection
const requestedModel = req.body.model;
const selectedModel = this.selectModel(requestedModel, tierRule);
if (!selectedModel) {
return res.status(403).json({
error: 'Model not available in current tier',
upgradeRequired: true,
});
}
// 5. Inject Routing Metadata
req.aiContext = {
userId,
tier: userTier,
model: selectedModel,
priority: tierRule.priority,
estimatedCost,
};
next();
}
private selectModel(requested: string, tier: TierRule): string | null {
if (tier.allowedModels.includes('*') || tier.allowedModels.includes(requested)) {
return requested;
}
// Fallback logic: Try to find a cheaper allowed model
const fallback = tier.allowedModels.find(m => m.includes('fast') || m.includes('small'));
return fallback || null;
}
private handleQuotaExhausted(res: Response, tier: string, userId: string) {
res.status(200).json({
status: 'quota_exceeded',
message: 'Daily token limit reached.',
upgrade_url: `/upgrade?tier=${tier}`,
retry_after: 86400,
data: null,
});
}
private async getUserTier(userId: string): Promise<string> {
// Fetch from DB/Cache
return 'free'; // Placeholder
}
}
3. Metering Service with Atomic Updates
Ensure metering updates are atomic to prevent race conditions during burst traffic.
// metering-service.ts
import { Redis } from 'ioredis';
export class MeteringService {
private redis: Redis;
constructor() {
this.redis = new Redis(process.env.REDIS_URL);
}
async recordUsage(userId: string, tokens: number, model: string) {
const key = `quota:${userId}:${this.getTodayKey()}`;
// Atomic increment using Lua script for consistency
const script = `
redis.call('HINCRBY', KEYS[1], 'tokens', ARGV[1])
redis.call('HINCRBY', KEYS[1], 'requests', 1)
redis.call('HSET', KEYS[1], 'model', ARGV[2])
redis.call('EXPIRE', KEYS[1], 86400)
`;
await this.redis.eval(script, 1, key, tokens, model);
}
async getDailyUsage(userId: string) {
const key = `quota:${userId}:${this.getTodayKey()}`;
const data = await this.redis.hgetall(key);
return {
tokens: parseInt(data.tokens || '0'),
requests: parseInt(data.requests || '0'),
lastModel: data.model,
};
}
private getTodayKey(): string {
return new Date().toISOString().split('T')[0];
}
}
Rationale
- Weight Multipliers: Allow business logic to adjust effective quotas without changing raw limits. For example, premium models can consume 3x quota weight, naturally throttling expensive usage.
- Model Fallback: Prevents hard failures. If a free user requests a premium model, the router silently downgrades to an allowed model, maintaining user flow while preserving cost controls.
- Redis Lua Scripts: Guarantee atomicity of quota updates, essential for high-concurrency AI APIs.
Pitfall Guide
1. The "Big Model" Trap
Mistake: Allowing free users to access the highest-quality model to improve initial UX.
Consequence: Free users consume disproportionate compute costs. A single free user running complex tasks can cost more than a paid user on a standard plan.
Best Practice: Reserve frontier models for paid tiers only. Use smaller, distilled models for freemium. Quality degradation should be the primary upsell trigger.
2. Ignoring Token Cost Asymmetry
Mistake: Quotas based on request count rather than tokens.
Consequence: Users can abuse the system by sending massive prompts with tiny outputs, or vice versa, depending on pricing disparities.
Best Practice: Implement token-weighted quotas. Track input and output tokens separately if pricing differs significantly.
3. Hard Quota Walls
Mistake: Returning 429 errors immediately upon quota exhaustion.
Consequence: High churn. Users feel punished and abandon the product rather than upgrading.
Best Practice: Implement soft limits. Allow slight overages with warnings, or provide a "sponsored" response with ads/branding. Always return an upgrade path in the response payload.
Mistake: Serving free and paid users on the same inference queue with identical priority.
Consequence: Free users degrade the experience for paid users during peak loads, leading to paid churn.
Best Practice: Implement priority queuing. Paid requests bypass free queues. Use rate limiting to smooth free traffic spikes.
5. Cache Mismanagement
Mistake: Caching responses indiscriminately or not caching at all.
Consequence: Either users receive stale data, or costs remain high for repetitive queries.
Best Practice: Cache aggressively for free users on deterministic queries. Use cache hits as a cost-saving mechanism. Paid users may require fresh responses, so adjust cache TTL by tier.
6. Security Blind Spots
Mistake: Applying weaker security filters to free tier due to cost concerns.
Consequence: Free accounts become vectors for prompt injection, data scraping, or abuse, risking model integrity and compliance.
Best Practice: Security and safety filters must be tier-agnostic. Cost savings should come from model selection and routing, not security reductions.
7. Lack of Usage Analytics
Mistake: Not tracking cost-per-user or conversion triggers.
Consequence: Inability to optimize the freemium model. You cannot adjust tiers without data on where users drop off or how much they cost.
Best Practice: Instrument every inference with cost metrics. Track the correlation between specific features/model usage and conversion events.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Volume, Low-Complexity Tasks | Token-Weighted Quotas with Small Models | Maximizes user acquisition while controlling costs via efficient models. | Low |
| Enterprise-Grade Reasoning | Strict Tier Gating + Priority Queuing | Protects expensive compute for high-value users; ensures SLA compliance. | High (Revenue Protected) |
| Viral Growth Phase | Adaptive Routing with Soft Limits | Balances user experience with cost control; prevents bankruptcy during spikes. | Medium |
| Embedding/Vector Heavy Workloads | Cached Quotas + Batch Processing | Reduces redundant inference costs; leverages cache hits for free tier. | Low |
Configuration Template
# freemium-config.yaml
tiers:
free:
daily_token_limit: 50000
daily_request_limit: 20
allowed_models:
- "model-fast-v1"
- "model-small-v1"
priority: low
weight_multiplier: 1.0
fallback_model: "model-fast-v1"
cache_ttl_seconds: 3600
pro:
daily_token_limit: 1000000
daily_request_limit: 500
allowed_models:
- "model-fast-v1"
- "model-small-v1"
- "model-reasoning-v1"
priority: medium
weight_multiplier: 0.5
fallback_model: null
cache_ttl_seconds: 600
enterprise:
daily_token_limit: -1
daily_request_limit: -1
allowed_models: ["*"]
priority: high
weight_multiplier: 0.2
fallback_model: null
cache_ttl_seconds: 0
monitoring:
cost_alert_threshold_per_user: 0.50
conversion_tracking_events:
- "quota_warning_shown"
- "model_downgrade_triggered"
- "upgrade_click"
Quick Start Guide
- Initialize Metering: Deploy the
MeteringService with Redis. Configure REDIS_URL in your environment.
- Define Tiers: Copy
freemium-config.yaml and adjust limits based on your unit economics. Update quota-config.ts with your tier definitions.
- Integrate Middleware: Add
AIRouterMiddleware to your API route handler chain before inference execution.
- Configure Inference: Update your inference client to read
req.aiContext.model and req.aiContext.priority for dynamic routing and queue selection.
- Validate: Run load tests with simulated free and paid users. Verify quota enforcement, model routing, and cost metrics in your monitoring dashboard.