ault {
constructor(field: string, reason: string) {
super(Validation failed for ${field}: ${reason}, {
code: 'VALIDATION_FAILURE',
httpStatus: 422,
context: { field, reason },
isRetryable: false
});
}
}
// Infrastructure: Transient network or resource issues
class TransientInfraFault extends ServiceFault {
constructor(service: string, underlying: Error) {
super(Upstream dependency unavailable: ${service}, {
code: 'TRANSIENT_INFRA',
httpStatus: 503,
context: { service, originalMessage: underlying.message },
isRetryable: true
});
}
}
**Architecture Rationale**: Using an abstract base class enforces a contract. Every fault carries an HTTP status, a machine-readable code, retryability flag, and optional context. This eliminates guesswork in middleware and monitoring pipelines. The `isRetryable` flag is critical: it prevents retry logic from wasting resources on client errors (4xx) while automatically allowing transient failures (5xx) to recover.
### Step 2: Protect Async Boundaries with Pipeline Middleware
Express and similar frameworks do not automatically catch rejected promises in route handlers. Unhandled rejections bubble up to the process level, triggering crashes. A dedicated pipeline guard intercepts these failures and normalizes them before they reach the global handler.
```typescript
import { Request, Response, NextFunction } from 'express';
type AsyncRouteHandler = (req: Request, res: Response, next: NextFunction) => Promise<void>;
export function wrapAsync(handler: AsyncRouteHandler) {
return (req: Request, res: Response, next: NextFunction) => {
Promise.resolve(handler(req, res, next)).catch(next);
};
}
// Global fault router (requires exactly 4 parameters)
export function faultRouter(err: Error, _req: Request, res: Response, _next: NextFunction) {
const isProduction = process.env.NODE_ENV === 'production';
if (err instanceof ServiceFault) {
res.status(err.httpStatus).json({
error: {
code: err.code,
message: isProduction ? 'Service request failed' : err.message,
context: err.context
}
});
return;
}
// Fallback for unclassified programmer errors
console.error('[UNCLASSIFIED_FAULT]', err.stack);
res.status(500).json({
error: {
code: 'INTERNAL_SYSTEM_ERROR',
message: isProduction ? 'An unexpected error occurred' : err.message
}
});
}
Architecture Rationale: The wrapAsync higher-order function converts promise rejections into next(err) calls, ensuring Express's error middleware chain receives them. The global router inspects the error type first. If it's a ServiceFault, it returns structured JSON. If it's an unclassified Error, it logs the stack trace internally but returns a sanitized message to the client. This prevents information leakage while preserving debugging capability.
Step 3: Orchestrate Resilient Retries with Jitter
Network calls and database connections experience transient failures. Blindly retrying without backoff creates thundering herd scenarios that amplify outages. Exponential backoff with randomized jitter distributes retry attempts across time, allowing upstream services to recover.
interface RetryConfig {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
jitterFactor: number;
}
export async function executeWithResilience<T>(
operation: () => Promise<T>,
config: RetryConfig
): Promise<T> {
const { maxAttempts, baseDelayMs, maxDelayMs, jitterFactor } = config;
let lastError: Error | undefined;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await operation();
} catch (err) {
lastError = err as Error;
const fault = err as ServiceFault;
// Abort retry chain for non-retryable faults
if (fault?.isRetryable === false) throw fault;
if (attempt === maxAttempts) throw lastError;
const exponentialDelay = baseDelayMs * Math.pow(2, attempt - 1);
const cappedDelay = Math.min(exponentialDelay, maxDelayMs);
const jitter = Math.random() * jitterFactor;
const waitTime = cappedDelay + jitter;
console.warn(`[RETRY] Attempt ${attempt}/${maxAttempts} failed. Waiting ${Math.round(waitTime)}ms`);
await new Promise(resolve => setTimeout(resolve, waitTime));
}
}
throw lastError!;
}
Architecture Rationale: The retry orchestrator respects the isRetryable flag from the fault hierarchy. It caps delays to prevent excessive wait times, adds jitter to desynchronize concurrent retries, and logs attempt counts for observability. This pattern replaces fragile setTimeout hacks with a deterministic, configurable resilience primitive.
Step 4: Coordinate Graceful Process Termination
Cloud platforms and container orchestrators send SIGTERM signals before terminating instances. Failing to handle this signal drops active connections, corrupts in-flight transactions, and triggers false positive alerts. A shutdown coordinator drains traffic, closes resources, and exits cleanly.
import { Server } from 'http';
export class LifecycleManager {
private activeConnections: Set<any> = new Set();
private isShuttingDown = false;
constructor(private server: Server) {
this.server.on('connection', (socket) => {
this.activeConnections.add(socket);
socket.on('close', () => this.activeConnections.delete(socket));
});
}
public initialize() {
process.on('SIGTERM', () => this.terminate('SIGTERM'));
process.on('SIGINT', () => this.terminate('SIGINT'));
process.on('uncaughtException', (err) => {
console.error('[FATAL] Uncaught exception:', err);
this.terminate('UNCAUGHT_EXCEPTION');
});
process.on('unhandledRejection', (reason) => {
console.error('[FATAL] Unhandled rejection:', reason);
// Node 15+ terminates by default; explicit handling allows graceful logging
});
}
private terminate(signal: string) {
if (this.isShuttingDown) return;
this.isShuttingDown = true;
console.log(`[${signal}] Initiating graceful shutdown...`);
this.server.close(() => {
console.log('[SHUTDOWN] HTTP listener closed. Draining remaining connections...');
this.activeConnections.forEach(conn => conn.end());
setTimeout(() => {
console.log('[SHUTDOWN] Process exiting cleanly');
process.exit(0);
}, 5000);
});
// Safety valve: force exit if graceful drain stalls
setTimeout(() => {
console.error('[SHUTDOWN] Forced termination after timeout');
process.exit(1);
}, 15000);
}
}
Architecture Rationale: The manager tracks active sockets to ensure in-flight requests complete before exit. It registers handlers for both termination signals and fatal runtime errors. The dual-timer approach (5s for clean drain, 15s safety valve) prevents zombie processes in containerized environments. Note that unhandledRejection is logged but not used to trigger shutdown in modern Node.js, as the runtime already handles it predictably.
Pitfall Guide
1. Swallowing Errors in Promise Chains
Explanation: Using .catch(() => {}) or empty catch blocks silently discards failures, making debugging impossible and masking data corruption.
Fix: Always log or re-throw. If suppression is intentional, document the business reason and emit a structured warning event.
2. Omitting the Fourth Parameter in Express Error Middleware
Explanation: Express identifies error handlers by their function signature. A middleware with three parameters is treated as a standard route handler, causing errors to bypass it entirely.
Fix: Always define error routers with exactly four parameters: (err, req, res, next). TypeScript will enforce this if you use ErrorRequestHandler.
3. Retrying Client-Side (4xx) Failures
Explanation: Retrying validation errors, authentication failures, or malformed requests wastes bandwidth, increases latency, and can trigger rate limits.
Fix: Implement a retryability flag in your error model. Only retry on 5xx or explicit transient network codes. Validate inputs before making upstream calls.
4. Blocking the Event Loop During Shutdown
Explanation: Running synchronous heavy computation or waiting on unresponsive external services during SIGTERM prevents the process from exiting, causing orchestrators to force-kill it and lose telemetry.
Fix: Keep shutdown handlers lightweight. Close database pools, cancel pending timers, and terminate HTTP listeners. Offload heavy cleanup to background workers if necessary.
5. Mixing Operational and Programmer Errors
Explanation: Treating a TypeError the same as a missing file leads to incorrect retry behavior or inappropriate HTTP status codes.
Fix: Enforce strict error classification at the source. Use custom classes for operational faults. Let programmer errors bubble up uncaught to trigger process restarts and alerting.
6. Leaking Stack Traces to External Clients
Explanation: Returning raw error stacks exposes internal architecture, dependency versions, and file paths, creating security vulnerabilities.
Fix: Implement environment-aware serialization. Return sanitized messages and error codes in production. Preserve full stacks only in internal logging pipelines.
7. Ignoring Connection Tracking in HTTP Servers
Explanation: Calling server.close() without tracking active sockets immediately drops ongoing requests, causing client-side timeouts and data inconsistency.
Fix: Maintain a Set of active connections. Call .end() on each socket after closing the listener. Implement a hard timeout to prevent indefinite hangs.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Third-party API with intermittent timeouts | Retry with exponential backoff + jitter | Transient failures recover automatically; jitter prevents thundering herd | Low compute, reduced failure rate |
| User submits malformed JSON payload | Validation fault with 422 response | Client error; retrying wastes resources and delays feedback | Zero infrastructure cost |
| Database connection pool exhausted | Circuit breaker + fallback cache | Prevents cascading failures; maintains read availability during write outages | Moderate caching overhead |
| Internal service throws TypeError | Uncaught exception handler + process restart | Programmer bug indicates unstable state; continuing risks data corruption | Restart cost, but prevents corruption |
| Production API returning 500s | Structured fault router + sanitized messages | Hides internal details while providing machine-readable codes for monitoring | Zero, improves security posture |
Configuration Template
// fault-config.ts
import { RetryConfig } from './retry-orchestrator';
export const RETRY_POLICIES: Record<string, RetryConfig> = {
default: {
maxAttempts: 3,
baseDelayMs: 500,
maxDelayMs: 5000,
jitterFactor: 200
},
paymentGateway: {
maxAttempts: 5,
baseDelayMs: 1000,
maxDelayMs: 10000,
jitterFactor: 500
},
analytics: {
maxAttempts: 2,
baseDelayMs: 200,
maxDelayMs: 2000,
jitterFactor: 100
}
};
export const SHUTDOWN_CONFIG = {
gracefulDrainTimeoutMs: 5000,
forcedExitTimeoutMs: 15000,
logLevel: process.env.LOG_LEVEL || 'info'
};
export const API_ERROR_FORMAT = {
production: {
maskMessage: true,
includeStack: false,
includeContext: false
},
development: {
maskMessage: false,
includeStack: true,
includeContext: true
}
};
Quick Start Guide
- Initialize the fault hierarchy: Copy the
ServiceFault abstract class and create domain-specific subclasses (ValidationFault, TransientInfraFault, AuthFault). Export them from a central errors/ directory.
- Wrap your Express routes: Replace direct async handlers with
wrapAsync(handler). Register the faultRouter as the last middleware in your Express stack.
- Inject retry logic: Import
executeWithResilience and wrap external HTTP calls or database queries. Pass environment-specific RetryConfig from your configuration module.
- Attach lifecycle management: Instantiate
LifecycleManager with your HTTP server instance and call .initialize() before starting the application.
- Verify in staging: Trigger a simulated timeout, a validation error, and a
SIGTERM signal. Confirm that retries respect backoff, errors return structured JSON, and the process drains connections before exiting.