Back to KB
Difficulty
Intermediate
Read Time
8 min

Zero-Downtime Deployments for a React + Node App

By Codcompass TeamΒ·Β·8 min read

Atomic Deployment Patterns for Full-Stack JavaScript Applications

Current Situation Analysis

Traditional deployment workflows treat service updates as instantaneous switches. Developers push code, trigger a build, restart the runtime, and assume the transition is seamless. In reality, process termination and initialization introduce a measurable interruption window. During this window, load balancers route traffic to unresponsive instances, Nginx returns 502/503 errors, and React clients render blank screens or stale assets.

This problem is frequently misunderstood because local development environments mask the issue. Single-process servers on localhost restart in milliseconds, and developers rarely test concurrent user sessions during updates. In production, however, in-flight HTTP requests, open WebSocket connections, and in-memory session stores create state that cannot survive a hard restart. Teams often attribute deployment failures to "network issues" or "browser caching" when the root cause is an uncoordinated process lifecycle.

Industry reliability benchmarks indicate that even a 15-second deployment window can trigger a 12–18% spike in client-side error rates and increase bounce rates for authenticated users. Modern SLAs demand 99.9%+ availability, which requires treating deployments as continuous transitions rather than atomic switches. The solution lies in decoupling artifact preparation from traffic routing, externalizing transient state, and orchestrating process lifecycle events with explicit drain periods.

WOW Moment: Key Findings

The difference between a traditional restart and an atomic deployment strategy is not theoretical; it is measurable across three critical dimensions: downtime duration, error propagation, and operational overhead.

ApproachDowntime WindowError Rate During DeploySession ContinuityOperational Complexity
Traditional Process Restart5–30 seconds15–25% (502/503 spikes)Broken (in-memory loss)Low
Atomic Symlink + Clustered Reload0 seconds<0.1% (graceful drain)Preserved (external store)Medium

This finding matters because it shifts deployment from a risk-mitigation exercise to a routine operational task. By isolating build artifacts, clustering runtime processes, and externalizing session state, teams can deploy at any hour without user impact. The architectural trade-off is slightly higher disk I/O during versioning and a modest increase in configuration complexity, both of which are negligible compared to the reliability gains.

Core Solution

Achieving zero-downtime deployments requires coordinating three subsystems: the frontend asset pipeline, the backend process manager, and the routing layer. The following implementation uses versioned directories, PM2 clustering, Nginx atomic routing, and explicit signal handling.

1. Artifact Isolation and Versioning

Overwriting live directories creates race conditions. If Nginx reads a file while the build process is replacing it, clients receive corrupted responses. Instead, build artifacts into timestamped directories.

// deploy-scripts/prepare-artifacts.ts
import { execSync } from 'child_process';
import { mkdirSync, copyFileSync, existsSync } from 'fs';
import { join } from 'path';

const TIMESTAMP = Math.floor(Date.now() / 1000).toString();
const BUILD_ROOT = '/opt/apps/platform';
const FRONTEND_DEST = join(BUILD_ROOT, 'frontend', TIMESTAMP);
const BACKEND_DEST = join(BUILD_ROOT, 'backend', TIMESTAMP);

mkdirSync(FRONTEND_DEST, { recursive: true });
mkdirSync(BACKEND_DEST, { recursive: true });

// Build frontend to isolated directory
execSync(`npx vite build --outDir ${FRONTEND_DEST}`, { stdio: 'inherit' });

// Copy backend source to versioned path
execSync(`rsync -a --delete src/ ${BACKEND_DEST}/`, { stdio: 'inherit' });

console.log(`Artifacts prepared: ${TIMESTAMP}`);

Rationale: Timestamped directories guarantee that no two deployments share the same path. This eliminates file-lock contention and allows instant rollback by reverting the symlink.

2. Clustered Process Management with Rolling Reloads

PM2's cluster mode spawns multiple worker processes bound to the same port. When pm2 reload is invoked, PM2 starts a new worker, waits for it to become healthy, routes new connections to it, and gracefully terminates the old worker. This ensures at least one process handles traffic at all times.

# Start clustered API
pm2 start src/server.js -i 4 --name "platform-api" --max-memory-restart 512M

# Trigger rolling reload
pm2 reload platform-api --update-env

Rationale: The -i 4 flag matches typical CPU core counts, maximizing throughput. The --update-env flag ensures environment variables injected during deployment are propagated to workers without a full restart.

Nginx should never read directly from a build directory. Instead, maintain a current symlink that points to the active version. Updating the symlink is an atomic filesystem operation that completes in microseconds.

# Initial setup
ln -sfn /opt/apps/platform/frontend/1715000000 /opt/apps/platform/frontend/current

# After new build completes
ln -sfn /opt/apps/platform/frontend/1715000060 /opt/apps/platform/frontend/current

Nginx configuration:

location / {
    root /opt/apps/platform/frontend/current;
    try_files $uri $uri/ /index.html;
}

Rationale: ln -sfn atomically replaces the symlink target. Nginx reads the new path on the next request cycle without reloading or restarting. Combined with chunk hashing in Vite, this prevents stale asset delivery.

4. Graceful Termination Handling

PM2 sends SIGINT during reloads. The Express server must intercept this signal, stop accepting new connections, allow in-flight requests to complete, and then exit.

import express from 'express';
import http from 'http';

const app = express();
const server = http.createServer(app);
const PORT = process.env.API_P

ORT || 3000;

server.listen(PORT, () => { console.log(API listening on port ${PORT}); });

const gracefulShutdown = (signal: string) => { console.log(Received ${signal}. Initiating graceful shutdown...); server.close((err) => { if (err) { console.error('Forced shutdown due to timeout'); process.exit(1); } console.log('All connections drained. Exiting.'); process.exit(0); });

// Safety valve: force exit after 10 seconds setTimeout(() => { console.error('Shutdown timeout exceeded. Forcing exit.'); process.exit(1); }, 10000); };

process.on('SIGINT', () => gracefulShutdown('SIGINT')); process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));


**Rationale:** Handling both `SIGINT` and `SIGTERM` ensures compatibility with PM2, systemd, and container orchestrators. The 10-second timeout prevents zombie processes from blocking deployments indefinitely.

### 5. Externalized Session Management

In-memory session stores break during process reloads because worker memory is not shared. Externalizing sessions to Redis ensures continuity across rolling updates.

```typescript
import session from 'express-session';
import RedisStore from 'connect-redis';
import { createClient } from 'redis';

const redisClient = createClient({ url: process.env.REDIS_URL });
redisClient.connect();

app.use(
  session({
    store: new RedisStore({ client: redisClient }),
    secret: process.env.SESSION_SECRET!,
    resave: false,
    saveUninitialized: false,
    cookie: { secure: process.env.NODE_ENV === 'production', httpOnly: true, maxAge: 86400000 }
  })
);

Rationale: Redis acts as a single source of truth for session state. When PM2 rotates workers, new processes read existing sessions from Redis, preserving authentication and user context.

6. Database Migration Strategy

Schema changes cannot be applied atomically alongside code deployments. Use the expand/contract pattern:

  1. Deploy code that supports both old and new schema (expand)
  2. Run migrations to add columns/tables
  3. Deploy code that removes deprecated schema references (contract)
// migration-runner.ts
import { runMigrations } from './db/migrate';

async function preDeployValidation() {
  const isCompatible = await checkSchemaCompatibility();
  if (!isCompatible) {
    console.warn('Schema mismatch detected. Running safe migrations...');
    await runMigrations({ direction: 'up', lockTimeout: 30000 });
  }
}

preDeployValidation().catch(console.error);

Rationale: This approach prevents runtime errors caused by missing columns or type mismatches. The lock timeout prevents concurrent migration processes from corrupting the database state.

Pitfall Guide

1. In-Memory Session Storage

Explanation: Storing sessions in Node.js process memory means every rolling reload invalidates active user sessions. Users are forced to re-authenticate, triggering support tickets and trust erosion. Fix: Externalize session state to Redis, Memcached, or a managed session service. Configure connect-redis or equivalent adapters with connection pooling and retry logic.

2. Ignoring SIGTERM vs SIGINT

Explanation: PM2 sends SIGINT during reloads, but cloud platforms (AWS, GCP, Kubernetes) send SIGTERM during scaling events or health check failures. Handling only one signal leaves the process vulnerable to hard kills. Fix: Register handlers for both SIGINT and SIGTERM. Ensure the drain logic is identical and includes a hard timeout to prevent deployment hangs.

3. Stale Nginx Cache Delivery

Explanation: Browsers cache index.html aggressively. If the symlink updates but the client holds a cached version, it loads outdated JavaScript chunks, causing runtime errors or missing features. Fix: Set Cache-Control: no-cache for index.html and rely on content-hash filenames for JS/CSS assets. Vite and Webpack handle chunk hashing automatically; verify the Nginx config does not override it.

4. Database Schema Incompatibility

Explanation: Deploying code that expects a new column before the migration runs causes immediate 500 errors. Conversely, running migrations before backward-compatible code is deployed breaks the old version. Fix: Adopt the expand/contract pattern. Always deploy compatible code first, run migrations, then deploy the cleanup version. Use feature flags to toggle new schema usage.

5. WebSocket Connection Drops

Explanation: Rolling reloads terminate TCP connections. Clients using raw WebSockets experience abrupt disconnections without automatic recovery. Fix: Implement client-side reconnection logic with exponential backoff. For Socket.IO, use the Redis adapter to broadcast state across workers and enable automatic reconnection handling.

6. Environment Variable Staleness

Explanation: PM2 caches environment variables at startup. Updating .env files or system variables without reloading the process manager leaves workers using outdated configuration. Fix: Always use pm2 reload --update-env or define variables in an ecosystem file (ecosystem.config.js). Validate variable propagation in CI/CD logs before routing traffic.

7. Insufficient Grace Period

Explanation: The default 10-second shutdown timeout may be too short for long-running requests (file uploads, report generation, third-party API calls). Premature termination causes data loss and client errors. Fix: Tune kill_timeout in the PM2 ecosystem configuration to match your longest expected request. Monitor average response times and add a 20% buffer.

Production Bundle

Action Checklist

  • Isolate build artifacts: Route all frontend/backend builds to timestamped directories
  • Configure PM2 clustering: Set -i flag to CPU core count, enable --update-env
  • Implement atomic routing: Use ln -sfn for Nginx root, verify symlink updates
  • Add graceful shutdown: Handle SIGINT/SIGTERM, drain connections, enforce timeout
  • Externalize sessions: Replace in-memory stores with Redis or managed session backend
  • Validate migrations: Apply expand/contract pattern, test schema compatibility in staging
  • Tune kill_timeout: Align PM2 timeout with longest request + 20% buffer
  • Verify cache headers: Set no-cache for HTML, rely on chunk hashing for assets

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
Small team, single serverPM2 cluster + Nginx symlinkLow operational overhead, proven reliabilityMinimal (disk I/O increase)
High traffic, multi-corePM2 cluster + Redis sessions + HAProxyDistributes load, preserves state across nodesModerate (Redis instance, LB config)
Containerized/KubernetesRolling updates + readiness probesNative orchestration, no process manager neededHigher (cluster resources, monitoring)
Strict compliance/auditBlue-green deployment + immutable artifactsInstant rollback, full version traceabilityHigh (duplicate infrastructure, storage)

Configuration Template

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'platform-api',
    script: 'src/server.js',
    instances: 'max',
    exec_mode: 'cluster',
    max_memory_restart: '512M',
    kill_timeout: 15000,
    wait_ready: true,
    listen_timeout: 5000,
    env_production: {
      NODE_ENV: 'production',
      API_PORT: 3000,
      REDIS_URL: 'redis://127.0.0.1:6379',
      SESSION_SECRET: process.env.SESSION_SECRET
    }
  }]
};
# /etc/nginx/sites-available/platform.conf
server {
    listen 80;
    server_name api.example.com;

    location / {
        root /opt/apps/platform/frontend/current;
        try_files $uri $uri/ /index.html;
        
        # Prevent HTML caching
        if ($uri ~* \.html$) {
            add_header Cache-Control "no-cache, no-store, must-revalidate";
        }
    }

    location /api/ {
        proxy_pass http://127.0.0.1:3000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
        proxy_read_timeout 60s;
    }
}

Quick Start Guide

  1. Initialize versioned directories: Create /opt/apps/platform/frontend and /opt/apps/platform/backend. Set up a cron job or CI step to build artifacts into timestamped subdirectories.
  2. Configure PM2 ecosystem: Place ecosystem.config.js in your project root. Run pm2 start ecosystem.config.js --env production to launch clustered workers.
  3. Set up Nginx symlink: Create the current symlink pointing to your initial build. Update Nginx config to serve from /opt/apps/platform/frontend/current and reload Nginx.
  4. Test graceful reload: Run pm2 reload platform-api --update-env while sending continuous requests (while true; do curl http://localhost/api/health; sleep 0.1; done). Verify zero errors in logs.
  5. Deploy pipeline integration: Wrap artifact preparation, symlink swap, and PM2 reload into a single CI/CD script. Add pre-deploy migration checks and post-deploy health verification.