a robust consumer pattern featuring retries, dead-letter queues, and idempotency checks.
import { SQSClient, ReceiveMessageCommand, DeleteMessageCommand } from "@aws-sdk/client-sqs";
import { v4 as uuidv4 } from "uuid";
import { Logger } from "./logger";
interface MessageHandler<T> {
handle(payload: T): Promise<void>;
}
class ResilientConsumer<T> {
private queueUrl: string;
private handler: MessageHandler<T>;
private dlqUrl: string;
private maxRetries: number;
constructor(config: {
queueUrl: string;
dlqUrl: string;
handler: MessageHandler<T>;
maxRetries?: number;
}) {
this.queueUrl = config.queueUrl;
this.dlqUrl = config.dlqUrl;
this.handler = config.handler;
this.maxRetries = config.maxRetries || 3;
}
async poll(): Promise<void> {
const client = new SQSClient({ region: process.env.AWS_REGION });
const command = new ReceiveMessageCommand({
QueueUrl: this.queueUrl,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20,
VisibilityTimeout: 30,
});
const response = await client.send(command);
if (!response.Messages) return;
const promises = response.Messages.map(async (msg) => {
try {
const payload = JSON.parse(msg.Body!) as T;
await this.executeWithRetry(payload, msg.ReceiptHandle!, client);
} catch (error) {
Logger.error("Message processing failed", { error, messageId: msg.MessageId });
await this.moveToDLQ(msg, client, error);
}
});
await Promise.allSettled(promises);
}
private async executeWithRetry(
payload: T,
receiptHandle: string,
client: SQSClient,
attempt: number = 1
): Promise<void> {
try {
await this.handler.handle(payload);
await client.send(new DeleteMessageCommand({
QueueUrl: this.queueUrl,
ReceiptHandle: receiptHandle,
}));
} catch (error) {
if (attempt >= this.maxRetries) throw error;
// Exponential backoff handled by SQS visibility timeout or requeue
await new Promise(resolve => setTimeout(resolve, Math.pow(2, attempt) * 1000));
return this.executeWithRetry(payload, receiptHandle, client, attempt + 1);
}
}
private async moveToDLQ(msg: any, client: SQSClient, error: any): Promise<void> {
// Logic to forward to DLQ with error context
Logger.warn("Message moved to DLQ", { messageId: msg.MessageId, error: error.message });
}
}
2. Database Read/Write Split Router
Implement a router that directs queries to read replicas and mutations to the primary node, reducing write contention.
import { Pool, PoolClient } from "pg";
class DatabaseRouter {
private writePool: Pool;
private readPools: Pool[];
private currentIndex: number = 0;
constructor(config: {
write: { connectionString: string };
reads: { connectionString: string }[];
}) {
this.writePool = new Pool(config.write);
this.readPools = config.reads.map(r => new Pool(r));
}
async getWriteClient(): Promise<PoolClient> {
return this.writePool.connect();
}
async getReadClient(): Promise<PoolClient> {
// Round-robin load balancing across read replicas
const pool = this.readPools[this.currentIndex % this.readPools.length];
this.currentIndex++;
return pool.connect();
}
async queryRead<T>(text: string, params?: any[]): Promise<T[]> {
const client = await this.getReadClient();
try {
const res = await client.query(text, params);
return res.rows as T[];
} finally {
client.release();
}
}
async queryWrite<T>(text: string, params?: any[]): Promise<T> {
const client = await this.getWriteClient();
try {
const res = await client.query(text, params);
return res.rows[0] as T;
} finally {
client.release();
}
}
}
3. Circuit Breaker for External Dependencies
Protect the system from cascading failures when third-party APIs or internal services degrade.
class CircuitBreaker {
private failures: number = 0;
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
private lastFailureTime: number = 0;
private threshold: number;
private resetTimeout: number;
constructor(options: { threshold?: number; resetTimeout?: number }) {
this.threshold = options.threshold || 5;
this.resetTimeout = options.resetTimeout || 30000;
}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailureTime > this.resetTimeout) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess(): void {
this.failures = 0;
this.state = 'CLOSED';
}
private onFailure(): void {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= this.threshold) {
this.state = 'OPEN';
}
}
}
Rationale
- Modular Structure: Allows teams to work on bounded contexts without merge conflicts while maintaining a single deployment pipeline, reducing CI/CD complexity.
- Async Processing: Decouples user-facing latency from background work. Notifications and analytics processing do not block the critical path.
- Circuit Breakers: Prevent resource exhaustion during partial outages, ensuring the core product remains available even if non-essential integrations fail.
Pitfall Guide
Mistake: Breaking the monolith into microservices before establishing clear domain boundaries and operational maturity.
Impact: Increases network latency, introduces distributed transaction complexity, and requires significant investment in service mesh, logging aggregation, and tracing. At 1M users, the operational overhead often outweighs the benefits unless distinct scaling requirements exist.
Best Practice: Use modular monoliths with internal event buses. Extract services only when independent deployment or scaling is demonstrably required.
2. N+1 Query Patterns at Scale
Mistake: Executing database queries inside loops or nested resolvers without batching.
Impact: A single request can trigger hundreds of database queries. At 1M users, this exhausts connection pools and causes database CPU saturation.
Best Practice: Implement DataLoader or similar batching mechanisms. Use IN clauses with limits and pagination. Audit all data access layers for batch loading.
3. Cache Stampedes
Mistake: Using a single cache key for hot data without locking or probabilistic early expiration.
Impact: When the cache expires, thousands of concurrent requests hit the database simultaneously, causing a thundering herd problem and database overload.
Best Practice: Implement cache locking, jittered expiration times, or background refresh patterns. Use SETNX for distributed locks during cache rebuilds.
4. Synchronous Third-Party Calls
Mistake: Making HTTP requests to external APIs during user request handling.
Impact: External API latency or downtime directly impacts user experience. Timeout configurations may be insufficient, leading to thread/connection blocking.
Best Practice: Offload all external calls to async workers. Implement retries with exponential backoff and circuit breakers. Design for eventual consistency where possible.
5. Ignoring Backpressure
Mistake: Allowing producers to send messages faster than consumers can process.
Impact: Message queues grow unbounded, memory usage spikes, and processing latency increases. System may crash due to OOM errors.
Best Practice: Implement consumer-side rate limiting. Monitor queue depth and scale consumers dynamically. Drop or prioritize messages based on SLA during overload.
6. Database Index Neglect
Mistake: Adding indexes reactively after performance degradation occurs.
Impact: Full table scans on large tables cause high I/O and latency. Write performance degrades due to excessive index maintenance if indexes are poorly chosen.
Best Practice: Analyze slow query logs continuously. Use covering indexes for frequent read patterns. Monitor index usage statistics to remove unused indexes.
7. Session State in Memory
Mistake: Storing user session data in application memory without replication.
Impact: Scaling application nodes requires sticky sessions or causes session loss. Node failures result in user logouts.
Best Practice: Store session state in Redis or a distributed cache. Use JWT for stateless authentication where appropriate.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Read / Low Write | CQRS + Read Replicas + Cache | Decouples read scaling; cache reduces DB load significantly. | Lowers DB costs; increases cache costs. Net savings >30%. |
| High Write / Low Read | Write Sharding + Async Ingestion | Distributes write load; queues buffer spikes. | Higher storage costs; improved throughput justifies cost. |
| Small Team (<10 devs) | Modular Monolith | Minimizes operational overhead; faster iteration. | Lowest infrastructure and ops cost. |
| Multi-Region Requirement | Active-Active with CRDTs | Reduces latency globally; handles region failures. | Higher data transfer costs; complex conflict resolution. |
| Budget Constraint | Vertical Scaling + Optimization | Maximizes existing resources; defactors horizontal scaling costs. | Low immediate cost; limited ceiling. |
Configuration Template
Kubernetes HPA for Queue Consumers:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: queue-consumer-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-consumer
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: queue_messages_per_second
target:
type: AverageValue
averageValue: "100"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120
Redis Cache Configuration:
# redis.conf optimizations for high throughput
maxmemory 4gb
maxmemory-policy allkeys-lru
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
tcp-backlog 511
timeout 300
Quick Start Guide
- Profile Baseline: Run a load test against current architecture. Record p99 latency, error rate, and database CPU usage. Identify the top 3 bottlenecks.
- Add Redis Cache: Implement cache-aside pattern for read-heavy endpoints. Configure TTLs with jitter. Verify cache hit rate exceeds 80%.
- Decouple Async Tasks: Identify synchronous external calls and background tasks. Move them to SQS/RabbitMQ. Implement the resilient consumer pattern.
- Enable Read Replicas: Provision read replicas. Update the database router to direct
SELECT queries to replicas. Verify replication lag is under 100ms.
- Deploy Observability: Install OpenTelemetry agents. Configure dashboards for latency, error rate, and saturation. Set alerts for p99 > 500ms and error rate > 1%.
Scaling to 1M users demands a shift from reactive firefighting to proactive architectural governance. By implementing event-driven modularity, rigorous caching strategies, and asynchronous processing, startups can achieve linear cost scaling while maintaining sub-second latency and high availability. The focus must remain on data access patterns, decoupling critical paths, and establishing comprehensive observability to sustain growth.