hes and forces error awareness at the call site. The normalization step ensures consistent error typing regardless of what the underlying operation throws.
Step 2: Build a Typed Error Hierarchy with Context
Generic Error objects lack the metadata required for intelligent routing and observability. A structured hierarchy carries classification, retry eligibility, and correlation data.
interface ErrorMetadata {
correlationId: string;
timestamp: string;
retryable: boolean;
severity: 'low' | 'medium' | 'high' | 'critical';
}
abstract class DomainError extends Error {
public readonly metadata: ErrorMetadata;
constructor(message: string, metadata: Omit<ErrorMetadata, 'timestamp'>) {
super(message);
this.name = this.constructor.name;
this.metadata = {
...metadata,
timestamp: new Date().toISOString(),
};
}
}
export class TransientFailure extends DomainError {
constructor(message: string, correlationId: string) {
super(message, { correlationId, retryable: true, severity: 'medium' });
}
}
export class AuthorizationFailure extends DomainError {
constructor(message: string, correlationId: string) {
super(message, { correlationId, retryable: false, severity: 'high' });
}
}
Architecture Rationale: Errors become self-describing. The retryable flag drives automated recovery logic, while severity and correlationId integrate directly with observability platforms. Abstract base classes enforce consistent metadata injection across the codebase.
Step 3: Implement Transient Failure Recovery
Network blips and temporary resource contention require intelligent retry logic. Blind retries amplify outages; exponential backoff with jitter stabilizes recovery.
interface RetryConfig {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
jitterFactor: number;
}
export async function retryWithBackoff<T>(
operation: () => Promise<TaskResult<T>>,
config: RetryConfig
): Promise<TaskResult<T>> {
let attempt = 0;
while (attempt < config.maxAttempts) {
attempt++;
const result = await operation();
if (result.success) return result;
const isRetryable = result.error instanceof DomainError && result.error.metadata.retryable;
if (!isRetryable || attempt === config.maxAttempts) return result;
const exponentialDelay = config.baseDelayMs * Math.pow(2, attempt - 1);
const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
const jitter = Math.random() * config.jitterFactor * cappedDelay;
const waitTime = cappedDelay + jitter;
await new Promise(resolve => setTimeout(resolve, waitTime));
}
return { success: false, error: new Error('Max retry attempts exhausted') };
}
Architecture Rationale: Exponential backoff prevents retry storms. Jitter randomizes delays across concurrent clients, avoiding synchronized request spikes. The retryable check ensures client-side errors (4xx) fail fast, preserving system resources.
Step 4: Deploy Circuit Breakers for Downstream Dependencies
When a service degrades, continuous retries waste resources and increase latency. A circuit breaker monitors failure rates and temporarily halts requests to failing dependencies.
type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
export class ServiceCircuit {
private failures: number = 0;
private state: CircuitState = 'CLOSED';
private nextProbeTime: number = 0;
constructor(
private failureThreshold: number,
private recoveryTimeoutMs: number
) {}
async execute<T>(operation: () => Promise<TaskResult<T>>): Promise<TaskResult<T>> {
if (this.state === 'OPEN') {
if (Date.now() < this.nextProbeTime) {
return { success: false, error: new Error('Circuit open: dependency unavailable') };
}
this.state = 'HALF_OPEN';
}
const result = await operation();
if (result.success) {
this.reset();
} else {
this.recordFailure();
}
return result;
}
private recordFailure(): void {
this.failures++;
if (this.failures >= this.failureThreshold) {
this.state = 'OPEN';
this.nextProbeTime = Date.now() + this.recoveryTimeoutMs;
}
}
private reset(): void {
this.failures = 0;
this.state = 'CLOSED';
}
}
Architecture Rationale: The three-state model (CLOSED β OPEN β HALF_OPEN) balances fail-fast behavior with automatic recovery. HALF_OPEN allows a single probe request to validate service health before fully reopening. This prevents cascading failures during partial outages.
Step 5: Guarantee Consistency with Compensation Patterns
Multi-step operations risk partial state corruption. Compensation logic ensures the system returns to a valid baseline when intermediate steps fail.
interface CompensatableStep<T> {
execute: () => Promise<TaskResult<T>>;
compensate: (result: T) => Promise<void>;
}
export async function executeWithCompensation<T>(
steps: CompensatableStep<T>[],
correlationId: string
): Promise<TaskResult<T>> {
const completedResults: T[] = [];
for (const step of steps) {
const result = await step.execute();
if (!result.success) {
// Rollback completed steps in reverse order
for (let i = completedResults.length - 1; i >= 0; i--) {
await step.compensate(completedResults[i]).catch(err => {
console.error(`Compensation failed at step ${i}:`, err);
});
}
return { success: false, error: new TransientFailure('Workflow aborted', correlationId) };
}
completedResults.push(result.data);
}
return completedResults[completedResults.length - 1]
? { success: true, data: completedResults[completedResults.length - 1] }
: { success: false, error: new Error('No data returned') };
}
Architecture Rationale: Compensation is preferred over distributed transactions in async systems. By storing intermediate results and executing reverse operations on failure, the system maintains eventual consistency without locking resources. Reverse-order rollback ensures dependencies are cleaned up correctly.
Pitfall Guide
1. Silent Exception Swallowing
Explanation: Empty catch blocks or console.log-only handlers hide failures from users and monitoring systems. The application continues in an undefined state.
Fix: Always route errors through a structured handler. Return explicit failure states, trigger user notifications, and emit structured logs with correlation IDs.
2. Retrying Non-Idempotent Operations
Explanation: Blindly retrying POST/PUT requests can duplicate data, charge customers twice, or corrupt records. Not all failures are transient.
Fix: Tag operations with idempotency keys. Only retry operations marked as safe for repetition. Use idempotency headers in API clients to prevent duplicate processing.
3. Ignoring Error Context in Observability
Explanation: Logging only the error message strips stack traces, request IDs, and user context. Debugging becomes a manual forensic exercise.
Fix: Attach correlationId, userId, endpoint, and attemptCount to every error payload. Use structured logging formats (JSON) that integrate with APM platforms.
4. Over-Engineering Circuit Breakers
Explanation: Applying circuit breakers to local functions or compute-bound operations adds unnecessary latency and complexity. They're designed for external dependencies.
Fix: Reserve circuit breakers for network calls, database connections, and third-party APIs. Use simple timeouts and fallbacks for internal logic.
5. Mixing Sync and Async Error Boundaries
Explanation: Synchronous try/catch cannot capture Promise rejections. Unhandled rejections crash Node.js processes or leave UIs in broken states.
Fix: Use async/await consistently. Wrap Promise chains with .catch() or use the result pattern. Implement global unhandled rejection listeners as a safety net, not a primary handler.
6. Failing to Validate Error Types Before Branching
Explanation: Assuming every caught value is an Error object leads to runtime crashes when accessing .message or .stack.
Fix: Always normalize caught values. Check instanceof or use type guards before branching. The result wrapper pattern enforces this automatically.
7. Neglecting Partial State Recovery
Explanation: Multi-step workflows that fail midway leave databases, caches, and external services in inconsistent states.
Fix: Implement compensation handlers for every mutable step. Design operations to be reversible. Log compensation attempts separately for audit trails.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Transient network timeout (5xx) | Retry with exponential backoff | Temporary failures resolve quickly; backoff prevents storms | Low (minor latency increase) |
| Downstream service degradation | Circuit breaker + fallback | Prevents cascading failures; preserves local resources | Medium (requires state tracking) |
| Invalid user input (4xx) | Fail fast + structured error | Retrying wastes resources; user must correct input | None (immediate response) |
| Multi-step data mutation | Compensation pattern | Guarantees consistency without distributed locks | Medium (additional rollback logic) |
| Critical payment processing | Idempotency keys + manual review | Prevents duplicate charges; enables audit trails | High (requires infrastructure) |
Configuration Template
// src/infrastructure/error-handling/config.ts
import { ServiceCircuit } from './circuit-breaker';
import { RetryConfig } from './retry-policy';
export const errorHandlingConfig = {
retry: {
maxAttempts: 3,
baseDelayMs: 500,
maxDelayMs: 5000,
jitterFactor: 0.3,
} as RetryConfig,
circuits: {
paymentGateway: new ServiceCircuit(5, 30000),
inventoryService: new ServiceCircuit(3, 15000),
notificationProvider: new ServiceCircuit(4, 20000),
},
logging: {
enableStackTraces: true,
maskSensitiveFields: ['password', 'token', 'ssn'],
correlationHeader: 'X-Correlation-ID',
},
ui: {
fallbackTimeoutMs: 8000,
retryableErrorCodes: ['NETWORK_ERROR', 'TIMEOUT', 'SERVICE_UNAVAILABLE'],
userMessageMap: {
AUTH_ERROR: 'Session expired. Please log in again.',
RATE_LIMIT: 'Too many requests. Please wait a moment.',
DEFAULT: 'An unexpected error occurred. Our team has been notified.',
},
},
};
Quick Start Guide
- Install the result wrapper: Replace existing
try/catch blocks with executeTask() for all async operations. Update call sites to check result.success before proceeding.
- Define error types: Create domain-specific error classes extending
DomainError. Tag each with retryable: true/false and appropriate severity.
- Wire up recovery: Import
retryWithBackoff and ServiceCircuit from the config. Wrap external API calls with the circuit breaker, then chain the retry policy.
- Add compensation: For workflows modifying multiple resources, implement
CompensatableStep interfaces. Register reverse operations for each mutation.
- Connect observability: Attach correlation IDs to HTTP headers and error metadata. Configure your logging pipeline to ingest structured JSON payloads. Verify traces in your APM dashboard.
Error handling isn't about preventing failures; it's about controlling how they propagate. By treating exceptions as structured data rather than control flow, you transform fragile applications into resilient systems that degrade gracefully, recover automatically, and provide clear visibility when things go wrong.