se's max_connections divided by the number of application instances.
- Redis: Use a persistent connection pool. Avoid creating clients per request.
- HTTP: Enforce HTTP/1.1 Keep-Alive or HTTP/2 multiplexing. Ensure load balancers maintain connection state.
2. Cache Stampede Prevention
Standard cache-aside patterns fail under load when keys expire simultaneously. Implement one of two strategies:
- Probabilistic Early Expiration: Randomize TTLs slightly so keys expire at different times.
- Distributed Locking: When a cache miss occurs, acquire a lock before querying the database. Other requests wait for the lock holder to populate the cache.
// TypeScript: Cache Manager with Stampede Protection
import { Redis } from 'ioredis';
interface CacheManagerOptions {
redis: Redis;
defaultTTL: number;
lockTimeout: number;
}
export class ResilientCacheManager {
private redis: Redis;
private defaultTTL: number;
private lockTimeout: number;
constructor(options: CacheManagerOptions) {
this.redis = options.redis;
this.defaultTTL = options.defaultTTL;
this.lockTimeout = options.lockTimeout;
}
async getOrSet<T>(
key: string,
fetchFn: () => Promise<T>,
ttl: number = this.defaultTTL
): Promise<T> {
// 1. Check Cache
const cached = await this.redis.get(key);
if (cached) {
return JSON.parse(cached) as T;
}
// 2. Stampede Protection via Distributed Lock
const lockKey = `lock:${key}`;
const acquired = await this.redis.set(lockKey, '1', 'EX', this.lockTimeout, 'NX');
if (acquired) {
try {
// Double-check cache after acquiring lock (another worker might have populated it)
const doubleCheck = await this.redis.get(key);
if (doubleCheck) {
return JSON.parse(doubleCheck) as T;
}
// Fetch and Set
const result = await fetchFn();
// Add jitter to TTL to prevent synchronized expiration
const jitteredTTL = ttl + Math.floor(Math.random() * 60);
await this.redis.set(key, JSON.stringify(result), 'EX', jitteredTTL);
return result;
} finally {
await this.redis.del(lockKey);
}
} else {
// Lock held by another worker; wait and retry
await new Promise(resolve => setTimeout(resolve, 100));
return this.getOrSet(key, fetchFn, ttl);
}
}
}
3. Asynchronous Offloading
Synchronous execution of non-critical logic kills throughput. Identify side effects (analytics, notifications, audit logs) and move them to a message queue.
- Pattern: Request -> API -> Response + Queue Message.
- Queue: Kafka or RabbitMQ for durability. Redis Streams for lightweight use cases.
- Batching: Consumers should process messages in batches to reduce database write overhead.
4. Serialization Optimization
JSON is expensive at 100M requests. Benchmarking shows Protocol Buffers (Protobuf) or MessagePack can reduce payload size by 40-60% and CPU usage by 30%.
- Internal Services: Switch to gRPC/Protobuf for inter-service communication.
- Public API: If JSON is required, use streaming serializers (e.g.,
fast-json-stringify in Node.js) or compress responses with Brotli.
5. Database Read/Write Separation
- Reads: Route all read traffic to read replicas. Use a connection pooler that supports read/write splitting.
- Writes: Implement write-ahead logging and batch inserts. For write-heavy endpoints, consider sharding based on a tenant ID or user ID to distribute load.
Pitfall Guide
1. Cache Stampede Ignorance
Mistake: Using simple GET/SET without locking or jitter.
Result: When a popular key expires, thousands of requests hit the database simultaneously, causing connection exhaustion and timeouts.
Fix: Implement distributed locking or probabilistic expiration as shown in the Core Solution.
2. Connection Pool Exhaustion
Mistake: Configuring pool sizes based on development environment or ignoring the ratio between app instances and DB connections.
Result: Application threads block waiting for connections, causing latency to spike linearly with traffic.
Fix: Calculate pool size: PoolSize = (DB_Max_Connections / App_Instances) * 0.8. Monitor waiting_clients metrics.
3. Synchronous Side-Effects
Mistake: Sending emails, updating analytics, or writing audit logs within the request-response cycle.
Result: Request duration increases by 20-50ms per side effect. At scale, this reduces effective throughput by 30%.
Fix: Move all non-critical work to async queues. Return 202 Accepted immediately for operations that require processing.
4. N+1 Query Patterns in Microservices
Mistake: Fetching a list of items, then making individual RPC calls for each item's details.
Result: A single API request generates hundreds of downstream calls, causing RPC storms and cascading failures.
Fix: Implement DataLoader patterns or batch endpoints. Aggregate data in the gateway layer before returning.
5. Ignoring p99 Latency
Mistake: Optimizing for average latency.
Result: The system appears healthy, but a subset of requests experiences multi-second delays due to GC pauses, lock contention, or slow queries.
Fix: Monitor p95 and p99 latency exclusively. Set alerts on p99. Profile tail latency specifically.
6. Over-Caching Mutable Data
Mistake: Caching data that changes frequently without a robust invalidation strategy.
Result: Stale data issues and cache thrashing where data is evicted and re-fetched constantly.
Fix: Use write-through caching for mutable data or set short TTLs. Implement event-driven cache invalidation via pub/sub.
7. Lack of Backpressure
Mistake: Accepting all incoming requests regardless of downstream capacity.
Result: System overload leads to total failure. Memory leaks occur as requests queue up in memory.
Fix: Implement rate limiting at the edge and circuit breakers between services. Reject requests early with 429 Too Many Requests or 503 Service Unavailable.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Read / Low Write | Aggressive Caching + CDN | Reduces DB load by >90%; serves content from edge | Low compute, Medium cache cost |
| Write Heavy / Batchable | Async Queue + Batch Inserts | Maximizes DB write throughput; reduces connection usage | Medium queue cost, Low DB cost |
| Low Latency Requirement | Edge Compute + Local Cache | Minimizes network hops; serves from nearest node | Higher edge cost, Best performance |
| Multi-Tenant SaaS | Database Sharding by TenantID | Isolates noisy neighbors; scales writes linearly | High infra complexity, Linear cost |
| Unpredictable Spikes | Auto-scaling + Rate Limiting | Handles burst traffic without over-provisioning | Variable compute, Stable DB load |
Configuration Template
Redis Configuration for High Scale (redis.conf snippet):
# Max memory policy: Evict least recently used keys
maxmemory-policy allkeys-lru
# Enable lazy freeing to avoid blocking on large key deletions
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
# TCP Keepalive to detect dead connections
tcp-keepalive 60
# Timeout idle connections
timeout 300
# Enable AOF for durability if caching critical state
appendonly yes
appendfsync everysec
TypeScript Rate Limiter Middleware (Redis-backed):
import { Redis } from 'ioredis';
import { Request, Response, NextFunction } from 'express';
export function createRateLimiter(redis: Redis, limit: number, windowMs: number) {
return async (req: Request, res: Response, next: NextFunction) => {
const key = `rate_limit:${req.ip}:${req.path}`;
const current = await redis.incr(key);
if (current === 1) {
await redis.pexpire(key, windowMs);
}
if (current > limit) {
res.status(429).json({ error: 'Too Many Requests' });
return;
}
res.set('X-RateLimit-Limit', String(limit));
res.set('X-RateLimit-Remaining', String(Math.max(0, limit - current)));
next();
};
}
Quick Start Guide
- Baseline Profile: Run a load test simulating 100M requests/month traffic patterns. Record p99 latency, error rate, and CPU/memory usage. Identify the top 3 bottlenecks.
- Add Redis Cache Layer: Implement the
ResilientCacheManager for read-heavy endpoints. Configure TTLs with jitter. Verify cache hit ratio >80%.
- Offload Writes: Identify write operations that don't require immediate confirmation. Move them to a message queue. Update the API to return
202 Accepted.
- Tune Connection Pools: Adjust DB and Redis pool sizes based on Little's Law calculations. Monitor connection wait times during load testing.
- Deploy and Monitor: Roll out changes incrementally. Monitor p99 latency and error rates in real-time. Validate that throughput increases while latency decreases.
Scaling to 100M requests is a discipline of reduction: reduce work per request, reduce synchronization, and reduce dependencies. By implementing these patterns, you transform the API from a fragile monolith into a resilient, cost-efficient system capable of sustaining exponential growth.