ress)
Express requires explicit error forwarding. Middleware that returns promises must delegate rejections to the error-handling stack. A unified router wrapper eliminates repetitive try/catch blocks while maintaining Express's four-parameter error middleware contract.
import { Request, Response, NextFunction } from 'express';
type RouteHandler = (req: Request, res: Response, next: NextFunction) => Promise<void>;
function wrapRoute(handler: RouteHandler) {
return (req: Request, res: Response, next: NextFunction) => {
Promise.resolve(handler(req, res, next)).catch(next);
};
}
// Global error middleware (must be registered last)
function globalErrorHandler(err: Error, req: Request, res: Response, next: NextFunction) {
const isOperational = (err as any).isOperational === true;
const statusCode = (err as any).statusCode || 500;
if (!isOperational) {
console.error('UNHANDLED PROGRAMMER ERROR:', err.stack);
}
res.status(statusCode).json({
status: 'error',
message: isOperational ? err.message : 'Service unavailable',
...(process.env.NODE_ENV === 'development' && { stack: err.stack }),
});
}
Architecture Rationale: Express's error middleware only triggers when next(err) is called. The wrapper converts promise rejections into next() calls. The global handler distinguishes between operational failures (expected, client-facing) and programmer errors (unexpected, internal). Stack traces are suppressed in production to prevent information leakage.
Layer 3: Domain-Specific Error Taxonomy
Flat error objects make routing impossible. A typed error hierarchy enables precise handling logic. Operational errors represent expected failure states (validation, auth, not found). Programmer errors represent bugs (null references, type mismatches).
abstract class ServiceFault extends Error {
public readonly statusCode: number;
public readonly isOperational: boolean;
constructor(message: string, statusCode: number, isOperational: boolean) {
super(message);
this.statusCode = statusCode;
this.isOperational = isOperational;
Error.captureStackTrace(this, this.constructor);
}
}
class ValidationFault extends ServiceFault {
constructor(public readonly fields: string[]) {
super('Validation failed', 400, true);
}
}
class DependencyFault extends ServiceFault {
constructor(public readonly service: string, message: string) {
super(`Dependency ${service} failed: ${message}`, 502, true);
}
}
class CriticalFault extends ServiceFault {
constructor(message: string) {
super(message, 500, false);
}
}
Architecture Rationale: Separating isOperational allows the error handler to decide whether to alert on-call engineers or simply return a client-friendly response. Error.captureStackTrace ensures V8 optimizes stack generation. Custom properties (fields, service) carry structured context without polluting the message string.
Layer 4: Resilience Primitives
Retries and circuit breakers address different failure modes. Retries handle transient network blips. Circuit breakers prevent cascading failures when a dependency is genuinely degraded.
class BackoffStrategy {
constructor(
private readonly maxAttempts: number,
private readonly baseDelay: number,
private readonly jitter: boolean = true
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
for (let attempt = 1; attempt <= this.maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === this.maxAttempts) throw error;
const delay = this.baseDelay * Math.pow(2, attempt - 1);
const adjustedDelay = this.jitter ? delay * (0.5 + Math.random() * 0.5) : delay;
await new Promise(res => setTimeout(res, adjustedDelay));
}
}
throw new Error('Max retries exceeded');
}
}
class CircuitGuard {
private failures: number = 0;
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
private nextRetry: number = 0;
constructor(
private readonly threshold: number,
private readonly resetTimeout: number
) {}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
if (Date.now() < this.nextRetry) {
throw new DependencyFault('CircuitBreaker', 'Service temporarily unavailable');
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
private onFailure() {
this.failures++;
if (this.failures >= this.threshold) {
this.state = 'OPEN';
this.nextRetry = Date.now() + this.resetTimeout;
}
}
}
Architecture Rationale: Exponential backoff with jitter prevents thundering herd problems when multiple instances retry simultaneously. The circuit breaker's HALF_OPEN state allows a single probe request to test dependency health without overwhelming it. These primitives are composable: wrap external calls with CircuitGuard, and wrap database queries with BackoffStrategy.
Layer 5: Process Lifecycle Management
Node.js does not automatically recover from uncaught exceptions. The process state becomes undefined. The correct response is immediate termination after logging. Unhandled rejections should be treated similarly in production.
function initializeProcessSafeguards() {
process.on('unhandledRejection', (reason) => {
console.error('UNHANDLED REJECTION:', reason);
if (process.env.NODE_ENV === 'production') {
process.exit(1);
}
});
process.on('uncaughtException', (error) => {
console.error('UNCAUGHT EXCEPTION:', error.stack);
process.exit(1);
});
}
function attachGracefulShutdown(server: any, dbClient: any) {
const shutdown = async (signal: string) => {
console.log(`${signal} received. Initiating graceful shutdown.`);
server.close(async () => {
try {
await dbClient.disconnect();
console.log('Resources released. Exiting.');
process.exit(0);
} catch (err) {
console.error('Cleanup failed:', err);
process.exit(1);
}
});
setTimeout(() => {
console.error('Shutdown timeout exceeded. Forcing exit.');
process.exit(1);
}, 10000);
};
process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));
}
Architecture Rationale: uncaughtException indicates a bug that corrupted memory or event loop state. Continuing execution risks data inconsistency. Graceful shutdown drains existing connections, closes database pools, and enforces a hard timeout to prevent zombie processes. This pattern integrates cleanly with Kubernetes, Docker, or PM2.
Pitfall Guide
1. Silent Rejection Swallowing
Explanation: Empty catch blocks or .catch(() => {}) discard error context. The application continues with incomplete state, causing downstream failures that are nearly impossible to trace.
Fix: Always log or forward rejections. At minimum: catch(err => { logger.error(err); throw err; }). Use structured logging to attach request IDs and timestamps.
2. Continuing After Uncaught Exceptions
Explanation: Developers sometimes register uncaughtException handlers and attempt to keep the server running. V8's internal state is undefined; timers, sockets, and memory may be corrupted.
Fix: Treat uncaughtException as fatal. Log the stack, flush buffers, and call process.exit(1). Rely on the orchestrator to restart the container.
3. Mixing Operational and Programmer Errors
Explanation: Returning a 500 status for a missing parameter confuses monitoring systems. Alerting on expected validation failures creates noise and masks real outages.
Fix: Enforce the isOperational flag. Route operational errors to client responses. Route programmer errors to error tracking services (Sentry, Datadog) and trigger on-call alerts.
4. Retry Storms Without Jitter
Explanation: Multiple service instances retrying simultaneously with fixed delays create synchronized load spikes. This amplifies dependency degradation instead of recovering from it.
Fix: Implement exponential backoff with random jitter. Add a maximum delay cap. Consider distributed rate limiting if retries originate from multiple nodes.
5. Blocking Graceful Shutdown
Explanation: Long-running tasks or unclosed database connections prevent server.close() from completing. The orchestrator kills the process forcefully, dropping in-flight requests and corrupting transactions.
Fix: Track active requests. Cancel or timeout long operations during shutdown. Close connection pools explicitly. Enforce a hard timeout (e.g., 10s) to guarantee termination.
6. Exposing Internal Stack Traces in Production
Explanation: Returning full error stacks to clients leaks implementation details, file paths, and dependency versions. Attackers use this information for reconnaissance.
Fix: Conditionally attach stacks based on NODE_ENV. In production, return generic messages. Store full traces in internal logs or error tracking platforms.
7. Overusing Circuit Breakers for Idempotent Calls
Explanation: Applying circuit breakers to every external call introduces unnecessary latency and state management overhead. Some failures are transient and don't warrant isolation.
Fix: Reserve circuit breakers for non-idempotent or high-latency dependencies (payment gateways, third-party APIs). Use simple retries for idempotent reads. Monitor breaker state transitions to tune thresholds.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Internal microservice call | Retry with exponential backoff | Transient failures are common; idempotent retries are safe | Low (network overhead) |
| Third-party payment API | Circuit breaker + fallback cache | Prevents cascading failures; maintains UX during outages | Medium (cache infrastructure) |
| User input validation | Operational error class + 400 response | Expected failure; no retry needed; client must correct input | None |
| Database write transaction | Retry + idempotency key | Network blips occur; duplicate writes must be prevented | Low (index overhead) |
| Legacy monolith endpoint | Circuit breaker + strict timeout | Unpredictable latency; high risk of thread pool exhaustion | Medium (monitoring setup) |
Configuration Template
// resilience.config.ts
export const ResilienceConfig = {
retry: {
maxAttempts: 3,
baseDelayMs: 500,
jitter: true,
retryableStatuses: [408, 429, 500, 502, 503, 504],
},
circuitBreaker: {
failureThreshold: 5,
resetTimeoutMs: 30000,
monitoringWindowMs: 60000,
},
shutdown: {
drainTimeoutMs: 10000,
dbDisconnectTimeoutMs: 5000,
},
errorHandling: {
exposeStackInDev: true,
operationalErrorPrefix: 'OP_ERR_',
logger: 'pino' // or winston, bunyan
}
};
Quick Start Guide
- Install dependencies:
npm install express pino (or your preferred logger)
- Create the error taxonomy: Copy the
ServiceFault hierarchy into src/errors/
- Wrap your routes: Replace raw async handlers with
wrapRoute() from the core solution
- Register global middleware: Add
globalErrorHandler as the last app.use() call
- Initialize safeguards: Call
initializeProcessSafeguards() and attachGracefulShutdown() at application bootstrap
Run the service and trigger a deliberate failure (e.g., throw inside a route). Verify that the error is logged, the response returns a structured JSON payload, and the process remains stable. Deploy to a staging environment and simulate dependency timeouts to validate retry and circuit breaker behavior.