nforce business-level fallbacks.
2. State Isolation: Circuit breaker state, bulkhead pools, and timeout deadlines are encapsulated in dedicated modules. This prevents cross-client state contamination and enables per-dependency tuning.
3. Async Cancellation: Modern TypeScript runtimes support AbortController. Timeouts use native cancellation rather than setTimeout cleanup, preventing memory leaks and zombie requests.
4. Progressive Degradation: Fallback functions are explicitly defined per dependency. The system degrades gracefully rather than failing silently or throwing unhandled exceptions.
Step-by-Step Implementation
1. Timeout & Deadline Pattern
Hard boundaries prevent thread starvation. Timeouts must be shorter than the upstream caller's timeout to allow time for fallback execution.
import { setTimeout as sleep } from 'timers/promises';
export interface TimeoutConfig {
hardLimitMs: number;
fallback?: () => Promise<any>;
}
export async function withTimeout<T>(
fn: () => Promise<T>,
config: TimeoutConfig
): Promise<T> {
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), config.hardLimitMs);
try {
const result = await fn();
clearTimeout(timer);
return result;
} catch (err: any) {
if (err.name === 'AbortError') {
if (config.fallback) return config.fallback();
throw new Error(`Request timed out after ${config.hardLimitMs}ms`);
}
throw err;
}
}
2. Circuit Breaker Pattern
The circuit breaker monitors failure rates and opens the circuit when thresholds are exceeded. It transitions through three states: CLOSED (normal), OPEN (reject immediately), HALF_OPEN (probe recovery).
export type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
export interface CircuitBreakerConfig {
failureThreshold: number;
successThreshold: number;
resetTimeoutMs: number;
windowMs: number;
}
export class CircuitBreaker {
private state: CircuitState = 'CLOSED';
private failures: number[] = [];
private successes: number[] = [];
private resetTimer: NodeJS.Timeout | null = null;
constructor(private config: CircuitBreakerConfig) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
throw new Error('Circuit breaker is OPEN');
}
if (this.state === 'HALF_OPEN') {
const result = await fn();
this.recordSuccess();
return result;
}
try {
const result = await fn();
this.recordSuccess();
return result;
} catch (err) {
this.recordFailure();
throw err;
}
}
private recordFailure() {
this.failures.push(Date.now());
this.pruneWindow();
if (this.failures.length >= this.config.failureThreshold) {
this.openCircuit();
}
}
private recordSuccess() {
this.successes.push(Date.now());
this.pruneWindow();
if (this.state === 'HALF_OPEN' && this.successes.length >= this.config.successThreshold) {
this.state = 'CLOSED';
this.failures = [];
this.successes = [];
}
}
private openCircuit() {
this.state = 'OPEN';
if (this.resetTimer) clearTimeout(this.resetTimer);
this.resetTimer = setTimeout(() => {
this.state = 'HALF_OPEN';
this.successes = [];
}, this.config.resetTimeoutMs);
}
private pruneWindow() {
const cutoff = Date.now() - this.config.windowMs;
this.failures = this.failures.filter(t => t > cutoff);
this.successes = this.successes.filter(t => t > cutoff);
}
}
3. Bulkhead Pattern
Bulkheads isolate resource pools per dependency. Thread/connection exhaustion in one service cannot starve others.
export interface BulkheadConfig {
maxConcurrent: number;
queueSize: number;
}
export class Bulkhead {
private active: number = 0;
private queue: Array<{ resolve: (val: any) => void; reject: (err: any) => void; fn: () => Promise<any> }> = [];
constructor(private config: BulkheadConfig) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.active >= this.config.maxConcurrent) {
if (this.queue.length >= this.config.queueSize) {
throw new Error('Bulkhead queue full');
}
return new Promise((resolve, reject) => {
this.queue.push({ resolve, reject, fn });
});
}
this.active++;
try {
const result = await fn();
this.processQueue();
return result;
} catch (err) {
this.processQueue();
throw err;
} finally {
this.active--;
}
}
private processQueue() {
if (this.queue.length > 0 && this.active < this.config.maxConcurrent) {
const next = this.queue.shift()!;
this.execute(next.fn).then(next.resolve).catch(next.reject);
}
}
}
4. Composed Resilient Client
Combine patterns into a single interface. Order of execution matters: Timeout wraps the call, Bulkhead controls concurrency, Circuit Breaker prevents propagation.
export interface ResilientClientConfig {
timeout: TimeoutConfig;
circuitBreaker: CircuitBreakerConfig;
bulkhead: BulkheadConfig;
}
export class ResilientClient {
private circuit: CircuitBreaker;
private bulkhead: Bulkhead;
constructor(private config: ResilientClientConfig) {
this.circuit = new CircuitBreaker(config.circuitBreaker);
this.bulkhead = new Bulkhead(config.bulkhead);
}
async call<T>(fn: () => Promise<T>): Promise<T> {
return withTimeout(
() => this.circuit.execute(() => this.bulkhead.execute(fn)),
this.config.timeout
);
}
}
Architecture Rationale
The composition order enforces defense-in-depth. The bulkhead limits concurrency first, preventing resource exhaustion. The circuit breaker evaluates failure history and blocks requests when downstream is unhealthy. The timeout enforces hard latency boundaries. Fallbacks execute only after all protective layers are exhausted. This ordering prevents premature fallbacks, reduces false positives, and maintains system stability under partial failure conditions.
Pitfall Guide
1. Timeout Misalignment
Setting timeouts shorter than downstream processing time causes premature failures. Setting them longer than upstream deadlines creates cascading thread blocks. Timeouts must be calibrated against P95 latency plus 20% buffer, and must always be shorter than the caller's timeout.
2. Circuit Breaker Threshold Tuning Without Load Testing
Default thresholds (e.g., 5 failures in 10s) break under burst traffic or high-throughput services. Thresholds must be derived from historical success rates and adjusted per dependency. Static thresholds cause premature opening or delayed protection.
3. Bulkhead Resource Starvation
Over-provisioning bulkheads wastes memory; under-provisioning causes queue rejections during normal traffic. Queue sizes must account for retry storms. Implement backpressure by rejecting queued requests after a secondary timeout rather than blocking indefinitely.
4. Ignoring Fallback Degradation Paths
Circuit breakers and timeouts fail silently if fallbacks are undefined or throw exceptions. Fallbacks must be idempotent, cache-aware, and explicitly tested. Returning null or stale data without validation corrupts downstream state.
5. Half-Open State Flooding
When a circuit breaker transitions to HALF_OPEN, concurrent requests can flood a recovering service. Implement a single-probe policy or rate-limited half-open execution. Only one request should test recovery until the success threshold is met.
6. Treating Resilience as a Library Instead of a System Property
Importing a resilience package without configuring per-dependency boundaries creates a false sense of security. Each downstream service requires distinct timeout, circuit breaker, and bulkhead configurations. Shared configurations mask dependency-specific failure modes.
7. Missing Observability for State Transitions
Circuit breaker state changes, bulkhead queue rejections, and timeout occurrences are invisible without explicit metrics. Instrument circuit.state, bulkhead.active, timeout.exceeded, and fallback.invoked. Alert on state transitions, not just error rates.
Best Practice: Implement progressive degradation. Define explicit contracts for each failure layer. Test fallback paths in staging with fault injection. Validate that degraded responses maintain business correctness.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Downstream latency spikes to 5s+ | Timeout + Circuit Breaker | Prevents thread exhaustion and cascade propagation | Low (CPU/memory bounded) |
| High-throughput batch processing | Bulkhead + Queue Rejection | Isolates resource consumption per job type | Medium (queue memory) |
| Non-critical data fetch | Fallback + Cache | Maintains UX without blocking critical path | Low (cache hit rate dependent) |
| Payment/Transaction service | Circuit Breaker + Idempotent Queue | Prevents duplicate charges during recovery | High (queue storage + replay logic) |
Configuration Template
export const resilienceDefaults = {
paymentService: {
timeout: { hardLimitMs: 2500, fallback: () => queueTransaction() },
circuitBreaker: { failureThreshold: 3, successThreshold: 2, resetTimeoutMs: 15000, windowMs: 10000 },
bulkhead: { maxConcurrent: 20, queueSize: 50 }
},
catalogService: {
timeout: { hardLimitMs: 800, fallback: () => getStaleCatalog() },
circuitBreaker: { failureThreshold: 5, successThreshold: 3, resetTimeoutMs: 10000, windowMs: 10000 },
bulkhead: { maxConcurrent: 50, queueSize: 100 }
},
notificationService: {
timeout: { hardLimitMs: 1200, fallback: () => dropNotification() },
circuitBreaker: { failureThreshold: 8, successThreshold: 4, resetTimeoutMs: 8000, windowMs: 15000 },
bulkhead: { maxConcurrent: 30, queueSize: 200 }
}
};
Quick Start Guide
- Install dependencies: Add
@types/node and ensure TypeScript 5.0+ for native AbortController and timers/promises support.
- Define per-dependency config: Copy the configuration template and adjust thresholds based on downstream P95 latency and acceptable degradation.
- Wrap external calls: Replace direct
fetch/axios/HTTP client calls with ResilientClient.call() to enforce timeout, circuit breaker, and bulkhead boundaries.
- Instrument metrics: Export
circuit.state, bulkhead.active, timeout.exceeded, and fallback.invoked to your observability pipeline. Set alerts on state transitions.
- Validate with fault injection: Use tools like
toxiproxy or chaos-mesh to simulate latency spikes and 5xx errors. Verify that fallbacks execute, circuits open, and bulkheads reject without cascading failures.