:** Threshold breached. Requests fail fast without reaching the downstream. A recovery timer begins.
- Half-Open: Recovery probe phase. A limited number of requests are allowed through to test downstream health. Success resets to Closed; failure reopens the circuit.
Failure counting must use a sliding window rather than fixed intervals. Fixed intervals create boundary artifacts where a burst of failures at the end of one window and the start of the next can mask an ongoing outage. A sliding window continuously evaluates the failure ratio over the most recent N requests or T seconds.
Step 2: Implementation (TypeScript)
This example wraps an HTTP client call, implements explicit state tracking, and integrates fallback logic. It uses a modern async/await pattern and avoids the callback-heavy style of older implementations.
import { EventEmitter } from 'events';
export type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
export interface CircuitBreakerConfig {
name: string;
failureThreshold: number; // e.g., 0.5 for 50%
windowSize: number; // number of requests to evaluate
openTimeout: number; // ms before transitioning to HALF_OPEN
maxHalfOpenRequests: number;
}
export class CircuitBreaker extends EventEmitter {
private state: CircuitState = 'CLOSED';
private requestLog: Array<{ success: boolean; timestamp: number }> = [];
private openUntil: number = 0;
constructor(private config: CircuitBreakerConfig) {
super();
}
async execute<T>(fn: () => Promise<T>, fallback?: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
if (Date.now() >= this.openUntil) {
this.transitionTo('HALF_OPEN');
} else {
this.emit('rejected', this.config.name);
return fallback ? fallback() : Promise.reject(new Error(`Circuit ${this.config.name} is OPEN`));
}
}
if (this.state === 'HALF_OPEN' && this.getHalfOpenCount() >= this.config.maxHalfOpenRequests) {
this.emit('rejected', this.config.name);
return fallback ? fallback() : Promise.reject(new Error(`Circuit ${this.config.name} HALF_OPEN limit reached`));
}
try {
const result = await fn();
this.recordResult(true);
if (this.state === 'HALF_OPEN') this.transitionTo('CLOSED');
return result;
} catch (error) {
this.recordResult(false);
if (this.shouldTrip()) {
this.transitionTo('OPEN');
}
throw error;
}
}
private recordResult(success: boolean): void {
this.requestLog.push({ success, timestamp: Date.now() });
// Prune entries outside the window
const cutoff = Date.now() - (this.config.windowSize * 1000); // simplified time-based pruning
this.requestLog = this.requestLog.filter(r => r.timestamp > cutoff);
}
private shouldTrip(): boolean {
if (this.requestLog.length < this.config.windowSize) return false;
const failures = this.requestLog.filter(r => !r.success).length;
return (failures / this.requestLog.length) >= this.config.failureThreshold;
}
private getHalfOpenCount(): number {
return this.requestLog.filter(r => !r.success).length; // simplified tracking
}
private transitionTo(newState: CircuitState): void {
this.state = newState;
this.emit('stateChange', this.config.name, newState);
if (newState === 'OPEN') {
this.openUntil = Date.now() + this.config.openTimeout;
this.requestLog = []; // reset for next cycle
}
}
}
Step 3: Integration & Architecture Decisions
const paymentBreaker = new CircuitBreaker({
name: 'payment-gateway',
failureThreshold: 0.5,
windowSize: 10,
openTimeout: 15000,
maxHalfOpenRequests: 2
});
paymentBreaker.on('stateChange', (name, state) => {
console.warn(`[CircuitBreaker] ${name} -> ${state}`);
// Emit to Prometheus/Datadog here
});
async function processPayment(orderId: string, amount: number) {
return paymentBreaker.execute(
() => httpClient.post('/v1/charge', { orderId, amount }),
async () => {
// Fallback: queue for async processing or return cached estimate
await messageQueue.enqueue('payment-retry', { orderId, amount });
return { status: 'queued', orderId };
}
);
}
Why these choices matter:
- Explicit State Transitions: The breaker tracks state internally and emits events. This enables external monitoring without coupling metrics libraries to business logic.
- Fallback Contract: The
execute method accepts an optional fallback function. This enforces a design discipline: every protected call must define what happens when the circuit opens. Silent failures are eliminated.
- Half-Open Concurrency Limit:
maxHalfOpenRequests prevents a flood of probes from overwhelming a recovering downstream. Only a controlled trickle is allowed.
- Sliding Window Pruning: The request log is trimmed based on time, ensuring the failure ratio reflects current conditions, not historical noise.
Libraries like resilience4j (Java) or gobreaker (Go) abstract these mechanics, but understanding the underlying state machine is critical for tuning thresholds and diagnosing trip behavior in production.
Pitfall Guide
1. The Global Breaker Anti-Pattern
Explanation: Applying a single circuit breaker instance across multiple downstream dependencies. When one service fails, the breaker opens and blocks traffic to all other services sharing the instance.
Fix: Instantiate a dedicated breaker per dependency endpoint. Use a factory pattern or dependency injection container to manage lifecycle and configuration per service.
2. Retry-Breaker Feedback Loop
Explanation: Wrapping retries inside the breaker's execution scope. Each retry counts as a separate request, artificially inflating the failure counter and causing premature trips.
Fix: Apply retries at a lower layer (e.g., HTTP client interceptor) before the breaker evaluates the call. Alternatively, configure the breaker to ignore idempotent retry attempts by tagging them or using a separate retry policy that runs outside the breaker's failure accounting.
3. Half-Open Saturation
Explanation: Allowing unlimited requests during the half-open state. A fragile downstream may recover enough to accept connections but lack capacity to process them, causing immediate re-failure and extended outage windows.
Fix: Enforce a strict maxHalfOpenRequests limit (typically 1-3). Use exponential backoff between probes if multiple attempts are needed.
4. Static Thresholds in Volatile Workloads
Explanation: Using fixed failure ratios (e.g., 50%) regardless of traffic volume. During low-traffic periods, 2 failures out of 4 requests trips the breaker, while during spikes, 50 failures out of 1000 might not, masking real degradation.
Fix: Implement minimum request thresholds before evaluation (minRequests: 5) and consider adaptive thresholds that scale with traffic volume or use error budgets aligned with SLOs.
5. Silent Fallbacks
Explanation: Returning default values or empty responses without logging or metrics. This hides degradation from operators and makes debugging impossible.
Fix: Every fallback must emit a structured event containing the breaker name, trigger reason, and request context. Route these to your observability stack and alert on fallback invocation rates.
6. Ignoring Circuit State in Load Balancing
Explanation: Relying solely on application-level breakers while load balancers continue routing traffic to unhealthy instances. The breaker protects the calling service, but doesn't inform infrastructure routing.
Fix: Expose breaker state via health check endpoints (e.g., /health returns 200 when closed, 503 when open). Integrate with service mesh sidecars (Envoy, Linkerd) that can read application health and adjust routing dynamically.
7. Over-Protection of Idempotent Reads
Explanation: Applying aggressive breakers to cache lookups or read-only APIs where stale data is acceptable. This causes unnecessary fallbacks and reduces cache hit rates.
Fix: Use higher failure thresholds or longer open timeouts for read-heavy, idempotent endpoints. Consider stale-while-revalidate patterns instead of hard circuit breaks for cache layers.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Critical Payment API | Strict breaker (30% threshold, 10s open, async fallback) | Prevents transaction loss and thread exhaustion; async queue preserves revenue | Low (queue infrastructure) |
| Non-Critical Analytics | Loose breaker (70% threshold, 30s open, drop/fallback) | Tolerates higher failure rates; avoids over-engineering for non-revenue paths | Minimal |
| High-Latency Database | Bulkhead + breaker combo | Isolates connection pool exhaustion; breaker prevents cascading query timeouts | Medium (connection pool tuning) |
| Third-Party SaaS | Aggressive breaker (20% threshold, 60s open, cached fallback) | External dependencies are uncontrollable; long open timeout prevents probe storms | Low (cache storage) |
| Internal Cache Layer | Stale-while-revalidate + soft breaker | Prioritizes availability over consistency; reduces breaker trips on transient cache misses | Low (memory overhead) |
Configuration Template
circuit_breaker:
defaults:
failure_threshold: 0.5
window_size_requests: 10
open_timeout_ms: 15000
max_half_open_requests: 2
min_requests_before_evaluation: 5
endpoints:
- name: "user-service"
failure_threshold: 0.4
open_timeout_ms: 10000
fallback_strategy: "cached_profile"
- name: "inventory-check"
failure_threshold: 0.6
open_timeout_ms: 20000
fallback_strategy: "assume_available"
- name: "payment-processor"
failure_threshold: 0.3
open_timeout_ms: 30000
fallback_strategy: "async_queue"
observability:
metrics_prefix: "app.circuit_breaker"
state_change_alert: true
fallback_invocation_alert: true
log_level: "warn"
Quick Start Guide
- Install a production-ready library:
npm install cockatiel or npm install opossum. These provide battle-tested implementations with built-in metrics and TypeScript definitions.
- Wrap your HTTP client: Replace direct
fetch or axios calls with the breaker's execute method. Pass your request function and a fallback handler.
- Configure thresholds per dependency: Start with conservative values (50% failure rate, 10-request window, 15s open timeout). Adjust based on actual traffic patterns and SLOs.
- Wire observability: Subscribe to
stateChange and reject events. Forward them to your metrics pipeline (Prometheus, Datadog, CloudWatch) and set alerts for fallback invocation spikes.
- Validate with load testing: Run a controlled failure injection test. Verify that the breaker trips at the expected threshold, half-open probes are limited, and fallbacks execute without blocking the main thread.