ations, and provides a clear observability surface for connection pool metrics.
Core Solution
Resolving TCP RST race conditions requires explicit control over connection lifecycles at both the client and server boundaries. The implementation follows a three-phase architecture: diagnostic alignment, pool configuration, and safe retry orchestration.
Phase 1: Enforce the Timeout Hierarchy
The fundamental rule is strict: client_idle_timeout < proxy_idle_timeout < origin_server_idle_timeout. This ordering guarantees that the connection owner retires the socket before intermediate layers invalidate it.
For a Node.js client operating behind an AWS ALB (60s default), the client must retire idle sockets at 50-55 seconds. For a Node.js server receiving traffic through the same ALB, the server must keep connections alive for at least 61 seconds to outlast the proxy's window.
Legacy http.Agent implementations lack granular lifecycle tracking. Node.js 18+ routes globalThis.fetch through undici, which provides explicit idle retirement and server-side Keep-Alive header parsing. Using undici directly allows precise control over socket expiration without relying on implicit agent behavior.
import { Pool, setGlobalDispatcher } from 'undici';
import { Agent } from 'undici';
// Client-side pool configuration
const outboundDispatcher = new Pool('https://api.upstream-service.io', {
connections: 25,
pipelining: 0,
keepAliveTimeout: 50_000, // Retire before ALB's 60s window
idleTimeout: 45_000, // Force socket closure after inactivity
maxIdleTime: 40_000, // Hard limit for pooled socket lifespan
});
setGlobalDispatcher(outboundDispatcher);
The keepAliveTimeout dictates how long an idle socket remains in the pool. Setting it below the proxy's threshold ensures the client initiates closure. The idleTimeout and maxIdleTime parameters provide secondary safeguards against socket drift.
Phase 3: Implement Idempotency-Aware Retries
Even with aligned timeouts, network partitions and rolling deployments will occasionally trigger resets. A retry strategy must distinguish between safe and unsafe operations.
import { fetch } from 'undici';
const RETRYABLE_METHODS = new Set(['GET', 'HEAD', 'PUT', 'DELETE']);
const MAX_RETRIES = 1;
async function resilientRequest(url: string, options: RequestInit = {}) {
let attempt = 0;
while (attempt <= MAX_RETRIES) {
try {
const response = await fetch(url, options);
if (!response.ok) throw new Error(`HTTP ${response.status}`);
return response;
} catch (error: any) {
const isReset = error.code === 'ECONNRESET' || error.message.includes('socket hang up');
const isRetryable = RETRYABLE_METHODS.has(options.method?.toUpperCase() || 'GET');
if (isReset && isRetryable && attempt < MAX_RETRIES) {
attempt++;
// Clear connection state to force fresh socket allocation
await outboundDispatcher.close();
continue;
}
throw error;
}
}
}
This wrapper explicitly checks for reset conditions, validates idempotency, and forces pool cleanup before retrying. Forcing a fresh connection prevents the client from reusing a potentially corrupted socket state. POST requests are excluded by default to prevent duplicate resource creation. When POST retries are required, an idempotency key header must be implemented upstream.
Phase 4: Server-Side Timeout Alignment
Origin servers must outlast proxy idle windows. Node.js requires explicit configuration to override the 5-second default.
import { createServer } from 'http';
import { applicationRouter } from './routes';
const originServer = createServer(applicationRouter);
// Must exceed ALB/nginx idle timeout
originServer.keepAliveTimeout = 61_000;
// Headers timeout must strictly exceed keepAliveTimeout
originServer.headersTimeout = 62_000;
originServer.listen(3000, () => {
console.log('Origin server listening on port 3000');
});
The headersTimeout parameter controls how long the server waits for complete HTTP headers after a connection is established. It must always exceed keepAliveTimeout to prevent premature header parsing failures during slow client transmissions.
Pitfall Guide
1. Confusing TCP Keepalive with HTTP Keep-Alive
Explanation: TCP keepalive (socket.setKeepAlive) sends OS-level probes to detect dead connections. HTTP keep-alive manages connection reuse at the application layer. Enabling TCP keepalive does not prevent proxy-initiated socket closures.
Fix: Rely on explicit HTTP timeout alignment and connection pool configuration. Disable TCP keepalive unless diagnosing deep network partition issues.
2. Retrying Non-Idempotent Operations Blindly
Explanation: Automatically retrying POST or PATCH requests after a reset can create duplicate resources or corrupt state, especially if the request reached the server before the connection dropped.
Fix: Restrict automatic retries to GET, HEAD, PUT, and DELETE. Implement idempotency keys for POST/PATCH operations and verify server-side deduplication before retrying.
3. Setting Client Timeout Higher Than Proxy Timeout
Explanation: If the client holds sockets longer than the load balancer, the proxy wins the race and sends RST packets. This is the primary cause of intermittent production resets.
Fix: Enforce client_idle_timeout < proxy_idle_timeout. Use configuration management to sync these values across infrastructure and application layers.
Explanation: Node.js requires headersTimeout to exceed keepAliveTimeout. Failing to set this causes the server to reject valid connections during slow header transmissions or TLS renegotiations.
Fix: Always configure headersTimeout = keepAliveTimeout + 1000. Document this relationship in deployment runbooks.
5. Assuming All Resets Are Network Flakes
Explanation: Bursts of ECONNRESET during deployments or memory pressure events are expected. Treating them as random failures leads to unnecessary infrastructure scaling or retry storm configurations.
Fix: Correlate reset spikes with deployment timestamps and OOM killer logs. Implement connection draining during rolling updates to gracefully terminate in-flight requests.
6. Over-Retiring Connections
Explanation: Setting idle timeouts too aggressively (e.g., <10 seconds) forces constant TCP/TLS handshakes, increasing latency and CPU overhead.
Fix: Align timeouts with actual proxy windows. Add a 5-10 second safety margin below the proxy limit rather than using arbitrary low values.
7. Missing Observability on Pool Metrics
Explanation: Without tracking connection creation, retirement, and reset rates, timeout alignment becomes guesswork. Teams cannot validate whether configuration changes actually reduce errors.
Fix: Instrument connection pools with metrics for active sockets, idle retirements, and RST occurrences. Export to Prometheus or Datadog for trend analysis.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Behind AWS ALB | Client: 50s, Server: 61s | Matches ALB 60s default with safety margin | Low (config only) |
| Direct to Origin | Client: 30s, Server: 35s | No proxy layer; optimize for connection reuse | Low |
| High-Throughput API | Undici pool + idempotency keys | Prevents retry storms and duplicate writes | Medium (dev time) |
| Batch Worker | Short idle timeout + explicit close | Workers don't reuse connections; minimize resource hold | Low |
| Multi-Region Mesh | Region-aware routing + aligned timeouts | Cross-region latency requires longer windows | High (infra) |
Configuration Template
// infrastructure/network-config.ts
import { Pool, setGlobalDispatcher } from 'undici';
import { createServer } from 'http';
export const PROXY_IDLE_WINDOW = 60_000; // AWS ALB default
export const CLIENT_RETIREMENT = PROXY_IDLE_WINDOW - 10_000;
export const SERVER_KEEPALIVE = PROXY_IDLE_WINDOW + 1_000;
// Client dispatcher
export function createOutboundPool(baseUrl: string) {
const pool = new Pool(baseUrl, {
connections: 30,
pipelining: 0,
keepAliveTimeout: CLIENT_RETIREMENT,
idleTimeout: CLIENT_RETIREMENT - 5_000,
maxIdleTime: CLIENT_RETIREMENT - 10_000,
});
setGlobalDispatcher(pool);
return pool;
}
// Server configuration
export function configureOriginServer(server: ReturnType<typeof createServer>) {
server.keepAliveTimeout = SERVER_KEEPALIVE;
server.headersTimeout = SERVER_KEEPALIVE + 1_000;
server.timeout = 0; // Disable default socket timeout
return server;
}
Quick Start Guide
- Identify your proxy window: Check your load balancer or reverse proxy documentation for the default idle timeout (ALB: 60s, nginx: 75s, Cloudflare: 100s).
- Configure client retirement: Set your HTTP client's
keepAliveTimeout to 5-10 seconds below the proxy window. Use undici for explicit control.
- Configure server longevity: Set your Node.js server's
keepAliveTimeout to 1-2 seconds above the proxy window. Ensure headersTimeout exceeds it.
- Validate in staging: Temporarily reduce the proxy timeout to 2 seconds. Send requests, wait 3 seconds, and send again. Verify deterministic resets disappear after alignment.
- Deploy with metrics: Roll out configuration changes alongside connection pool observability. Monitor
ECONNRESET frequency for 24 hours to confirm elimination.