ion
Before implementing controls, classify traffic by cost and statefulness. Not all requests require the same resources. Static assets, authenticated API calls, and unauthenticated public endpoints should be routed differently.
// traffic-classifier.ts
export enum TrafficClass {
STATIC = 'static',
AUTH_API = 'auth_api',
PUBLIC_API = 'public_api',
WEBHOOK = 'webhook'
}
export function classifyRequest(req: Request): TrafficClass {
if (req.path.startsWith('/assets/') || req.path.startsWith('/cdn/')) return TrafficClass.STATIC;
if (req.path.startsWith('/api/webhook/')) return TrafficClass.WEBHOOK;
if (req.headers.authorization) return TrafficClass.AUTH_API;
return TrafficClass.PUBLIC_API;
}
Route classified traffic to dedicated processing pipelines. Static and public API traffic should hit cache layers first. Webhooks require idempotency and async processing. Authenticated API calls require connection pooling and rate limiting.
Step 2: Adaptive Rate Limiting with Token Bucket Algorithm
Fixed rate limits break under burst traffic. Adaptive limiting adjusts thresholds based on backend health and queue depth.
// adaptive-rate-limiter.ts
import { Redis } from 'ioredis';
export class AdaptiveRateLimiter {
private redis: Redis;
private baseLimit: number;
private minLimit: number;
private maxLimit: number;
constructor(redis: Redis, baseLimit = 100, minLimit = 20, maxLimit = 300) {
this.redis = redis;
this.baseLimit = baseLimit;
this.minLimit = minLimit;
this.maxLimit = maxLimit;
}
async isAllowed(clientId: string, backendHealthScore: number): Promise<boolean> {
const adjustedLimit = Math.floor(
this.minLimit + (this.maxLimit - this.minLimit) * backendHealthScore
);
const key = `rate:${clientId}:${Math.floor(Date.now() / 1000)}`;
const current = await this.redis.incr(key);
await this.redis.expire(key, 1);
return current <= adjustedLimit;
}
}
Backend health score should derive from database connection pool utilization, error rates, and queue depth. When health drops below 0.4, the limiter tightens. When health recovers, it relaxes. This prevents cascade failures during partial degradation.
Step 3: Multi-Layer Caching with Stampede Prevention
Single-tier caching creates hot keys and cache stampedes. Implement CDN β Redis β In-Memory layering with probabilistic early expiration.
// multi-layer-cache.ts
import { Redis } from 'ioredis';
export class MultiLayerCache {
private redis: Redis;
private memoryCache: Map<string, { value: any; ttl: number; earlyExpire: number }>;
constructor(redis: Redis) {
this.redis = redis;
this.memoryCache = new Map();
}
async get(key: string): Promise<any | null> {
// L1: In-memory
const mem = this.memoryCache.get(key);
if (mem && Date.now() < mem.ttl) return mem.value;
// L2: Redis
const redisVal = await this.redis.get(key);
if (redisVal) {
const parsed = JSON.parse(redisVal);
this.memoryCache.set(key, { value: parsed, ttl: Date.now() + 5000, earlyExpire: Date.now() + 3000 });
return parsed;
}
return null;
}
async set(key: string, value: any, baseTTL: number): Promise<void> {
const jitteredTTL = baseTTL + Math.floor(Math.random() * 2000) - 1000;
await this.redis.setex(key, jitteredTTL, JSON.stringify(value));
this.memoryCache.set(key, { value, ttl: Date.now() + baseTTL * 1000, earlyExpire: Date.now() + (baseTTL - 2) * 1000 });
}
}
Jittered TTLs and early expiration windows prevent cache stampedes by staggering regeneration. L1 memory cache handles sub-millisecond reads for hot keys. Redis handles distributed state. CDN handles static and public endpoint caching.
Step 4: Database Connection Pooling and Read Replica Routing
Connection exhaustion is the primary traffic bottleneck. Use connection pooling with circuit breaking and route read-heavy traffic to replicas.
// db-pool-manager.ts
import { Pool } from 'pg';
export class DbPoolManager {
private primary: Pool;
private replica: Pool;
constructor(primaryConfig: any, replicaConfig: any) {
this.primary = new Pool({ ...primaryConfig, max: 20, idleTimeoutMillis: 30000 });
this.replica = new Pool({ ...replicaConfig, max: 50, idleTimeoutMillis: 30000 });
}
async query(sql: string, params?: any[], isWrite: boolean = false) {
const pool = isWrite ? this.primary : this.replica;
const client = await pool.connect();
try {
return await client.query(sql, params);
} finally {
client.release();
}
}
}
Set max connections based on database CPU cores and memory, not application instances. Use PgBouncer or equivalent in transaction mode to multiplex connections. Route 70β80% of traffic to read replicas under growth conditions.
Step 5: Observability-Driven Autoscaling
CPU-based autoscaling fails for I/O-bound traffic. Use custom metrics: request queue depth, database connection utilization, and cache hit ratio.
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-autoscaler
spec:
scaleTargetRef:
name: api-deployment
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
query: |
sum(rate(http_request_duration_seconds_count{status=~"5.."}[1m]))
/ sum(rate(http_request_duration_seconds_count[1m]))
threshold: "0.05"
target: "error_rate"
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
query: |
avg_over_time(pg_stat_activity_count{datname="appdb"}[5m])
/ pg_settings_max_connections
threshold: "0.7"
target: "db_connection_util"
KEDA evaluates custom metrics every 15β30 seconds, scaling replicas before queue buildup occurs. This eliminates the 4-minute autoscale lag inherent in CPU-based HPA.
Pitfall Guide
-
Treating rate limiting as a security feature only
Rate limiting is a traffic-shaping mechanism. Fixed limits ignore backend health and cause unnecessary 429s during legitimate bursts. Adaptive limiting tied to health scores preserves throughput during partial degradation.
-
Over-caching dynamic endpoints with auth context
Caching user-specific or session-dependent responses without cache key segmentation causes data leakage and stale auth states. Always include tenant/user hash in cache keys and set strict TTLs for auth-adjacent endpoints.
-
Ignoring connection pool exhaustion under burst traffic
Applications often scale horizontally while database connections remain static. Each new instance competes for the same connection pool, causing queueing. Use connection pooling proxies (PgBouncer, ProxySQL) and set pool limits based on database capacity, not application replicas.
-
Relying on CPU-based autoscaling for I/O-bound workloads
Traffic spikes increase database and cache I/O, not CPU. CPU autoscalers trigger too late or too aggressively. Use custom metrics: error rate, queue depth, connection utilization, and cache miss ratio.
-
Missing request correlation IDs in distributed tracing
Without correlation IDs, timeout storms cannot be traced to origin. Inject X-Request-ID at the edge, propagate through all services, and attach to database queries and cache operations. This reduces mean time to resolution (MTTR) by 60β80% during traffic incidents.
-
Using blanket TTLs instead of cache invalidation strategies
Fixed TTLs cause stampedes and stale data. Implement probabilistic early expiration, write-through invalidation for critical paths, and event-driven cache purging for user-specific data.
-
Scaling horizontally without addressing stateful session storage
Stateless applications require distributed session stores. If sessions remain in-memory, horizontal scaling breaks authentication and personalization. Use Redis or equivalent for session state, and configure sticky sessions only as a temporary mitigation.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Marketing campaign spike (3β5x, 2β4 hours) | Adaptive rate limiting + Redis L2 + KEDA custom metrics | Handles burst without over-provisioning; scales predictably | -40% vs baseline autoscaling |
| API-heavy B2B workload (steady 2x growth) | Connection pooling + read replicas + write-through cache | Database I/O is the bottleneck; horizontal scaling amplifies contention | -25% infrastructure, +15% cache spend |
| Global rollout with regional traffic | CDN edge caching + regional Redis clusters + geo-routing | Latency and cross-region DB calls dominate cost; edge caching reduces origin load | -60% origin compute, +10% CDN cost |
Configuration Template
# docker-compose.traffic-engineering.yml
version: "3.8"
services:
api:
build: .
environment:
- REDIS_URL=redis://redis:6379
- DB_PRIMARY=postgresql://user:pass@pg-primary:5432/appdb
- DB_REPLICA=postgresql://user:pass@pg-replica:5432/appdb
- RATE_LIMIT_BASE=100
- HEALTH_CHECK_INTERVAL=5000
depends_on:
- redis
- pg-bouncer
deploy:
resources:
limits:
memory: 512M
cpus: "0.5"
redis:
image: redis:7-alpine
command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
ports:
- "6379:6379"
pg-bouncer:
image: edoburu/pgbouncer:latest
environment:
- DB_HOST=pg-primary
- DB_PORT=5432
- POOL_MODE=transaction
- MAX_CLIENT_CONN=200
- DEFAULT_POOL_SIZE=20
ports:
- "6432:5432"
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
Quick Start Guide
- Deploy the stack: Run
docker compose -f docker-compose.traffic-engineering.yml up -d to spin up API, Redis, PgBouncer, and Prometheus.
- Instrument your app: Add
X-Request-ID middleware, integrate the adaptive rate limiter, and configure the multi-layer cache wrapper around your primary data fetchers.
- Configure health scoring: Expose a
/health endpoint returning database connection utilization, cache hit ratio, and error rate. Point Prometheus to scrape it every 15 seconds.
- Validate with load testing: Run
k6 run load-test.js simulating 3x traffic. Monitor P95 latency, cache hit ratio, and connection utilization. Adjust rate limit thresholds and pool sizes until P95 remains under 100ms.
- Switch to production autoscaling: Replace CPU-based HPA with the KEDA ScaledObject template. Verify scaling triggers fire at 70% connection utilization and 5% error rate.