Back to KB
Difficulty
Intermediate
Read Time
7 min

Database disaster recovery

By Codcompass Team··7 min read

Current Situation Analysis

Database disaster recovery (DR) is routinely treated as a backup strategy rather than a recovery architecture. Teams configure automated snapshots, enable point-in-time recovery (PITR), and declare themselves protected. The gap emerges when failure conditions intersect with operational reality: cross-region latency breaks synchronous replication, WAL retention policies expire before a corruption is detected, or failover scripts assume network topology that no longer exists. Backup systems are designed for data preservation; recovery systems are designed for service continuity. Conflating the two is the primary reason DR drills fail in production.

The problem is overlooked because managed cloud databases abstract replication and backup mechanics behind single-click toggles. Engineers assume platform guarantees translate to application-level recovery guarantees. They rarely account for:

  • Extension compatibility during physical restore (PostGIS, TimescaleDB, Citus)
  • Connection pooler state synchronization during promotion
  • Transaction log gaps caused by checkpoint stalls or storage throttling
  • Credential rotation windows that invalidate replication slots mid-failover

Industry data confirms the drift between perception and reality. Veeam’s 2024 Data Protection Trends Report indicates that 68% of organizations fail their most recent recovery test. Gartner estimates that unplanned downtime costs average $5,600 per minute, with database-centric outages representing 41% of enterprise incidents. The median RPO (Recovery Point Objective) for production systems is 15 minutes, yet 73% of teams retain WAL logs for only 24 hours, creating a silent compliance and operational gap. Recovery is not a storage problem; it is a state synchronization problem.

WOW Moment: Key Findings

The critical insight is that recovery capability does not scale linearly with backup frequency. Architecture choice dictates actual RTO/RPO, not storage volume.

ApproachMetric 1Metric 2Metric 3
Daily Snapshot BackupRTO: 4–12 hrsRPO: 24 hrsStorage: Low
WAL Archiving + PITRRTO: 15–45 minRPO: 1–5 minStorage: Medium
Streaming Replication + Auto-FailoverRTO: 10–30 secRPO: 0–1 secStorage: High
Multi-Master Active-ActiveRTO: 0 secRPO: 0 secStorage: Very High

Why this matters: Teams routinely provision daily snapshots while claiming "near-zero RPO" capabilities. The table forces alignment between business tolerance and technical implementation. Streaming replication reduces RTO to seconds but introduces split-brain risks and network dependency. WAL archiving offers deterministic recovery windows with minimal replication overhead. Active-active eliminates downtime but demands conflict resolution, distributed consensus, and significantly higher operational complexity. The optimal choice is not the most advanced; it is the one that matches measurable failure tolerance.

Core Solution

A production-grade database disaster recovery architecture requires deterministic state capture, immutable archival, automated promotion, and continuous validation. The following implementation targets PostgreSQL but applies conceptually to any WAL/transaction-log-driven system.

Step 1: Baseline Physical Backup with Checksum Verification

Physical backups capture the exact on-disk state. Use pg_basebackup with streaming and checksum validation to ensure block-level integrity.

pg_basebackup -h primary-db.internal -U backup_user -D /var/lib/pg-backup/base \
  --checkpoint=fast --wal-method=stream --verify-checksums --progress

Archive the output to immutable object storage with versioning enabled. Never overwrite historical baselines.

Step 2: Continuous WAL Archiving with Retention Policy

Configure postgresql.conf to ship WAL segments to durable storage. Implement a retention window that exceeds your maximum acceptable RPO.

wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://pg-wal-archive/%f --sse aws:kms'
archive_timeout = 60

Set a lifecycle policy to expire WALs only after confirming successful restore drills. Premature expiration is the leading cause of PITR failure.

Step 3: Standby Promotion with Automated Failover

Deploy a standby node in a separate availability zone. Use a lightweight orchestrator to monitor replication lag, detect primary failure, and promote safely.

import { Client } from 'pg';
import { EC2, RDS } from '@aws-sdk/client-rds';

export class PostgresFailoverOrchestrator {
  private rds: RDS;
  private standbyHost: string;
  private primaryHost: string;

  constructor(config: { standby: string; primary: string }) {
    this.rds = new RDS({ region: 'us-east-1' });
    this.standbyHost = config.standby;
    this.primaryHost = config.primary;
  }

  async checkReplicationLag(): Promise<number> {
    const client = new Client({ host: this.standbyHost, database: 'postgres' });
    await client.connect();
    const res = await client.query(
      `SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag`
    );
    await client.end();
    return Number(res.rows[0].lag) || 0;
  }

  async promoteStandby(): Promise<void> {
    const lag = await this.checkReplicationLag();
    if (lag > 30) throw new Error(`Replication lag too high: ${lag}s. Aborting promotion.`);

    await this.rds.promoteReadReplica({
      DBInstanceIdentifier: this.standbyHost,
    });

    // Upd

ate connection poolers, DNS, or service mesh routing await this.updateRoutingTable(); }

private async updateRoutingTable(): Promise<void> { // Implementation depends on infrastructure (Route53, HAProxy, K8s Service) console.log('Routing updated to promoted standby'); } }


### Step 4: Recovery Validation Pipeline
Automate restore testing on isolated infrastructure. Validate data consistency, extension compatibility, and query performance before promoting to production readiness.

```typescript
import { execSync } from 'child_process';

export async function validateRecovery(backupPath: string): Promise<boolean> {
  const testDir = `/tmp/pg-restore-test-${Date.now()}`;
  
  execSync(`pg_restore -d postgres ${backupPath} --single-transaction --exit-on-error`, {
    cwd: testDir,
    stdio: 'inherit'
  });

  // Run checksum validation and critical query smoke tests
  const result = execSync('pg_checksums --check -D /var/lib/postgresql/data', { encoding: 'utf8' });
  return result.includes('checksums verified');
}

Architecture Decisions and Rationale

  • Immutable WAL storage: Prevents accidental deletion or overwriting during ransomware or operator error.
  • Separate replication network: Isolates WAL traffic from application traffic, preventing backup storms from degrading production latency.
  • Idempotent promotion logic: Ensures failover scripts can be retried safely without corrupting replication slots or connection pools.
  • Automated validation: Recovery is only as reliable as the last successful test. Pipeline integration catches extension mismatches and schema drift before they hit production.

Pitfall Guide

  1. Assuming backup size equals recovery capability Large backups do not guarantee fast recovery. I/O bottlenecks during restore, missing WAL segments, or incompatible extensions will stall promotion regardless of storage capacity. Always measure restore time, not backup size.

  2. Ignoring RPO/RTO alignment with business metrics Engineering teams often default to technical defaults (e.g., 24-hour WAL retention) without mapping to actual data loss tolerance. A 5-minute RPO requires continuous archiving and replication; a 24-hour RPO does not. Mismatched objectives create compliance failures and false confidence.

  3. Skipping cross-region latency validation Synchronous replication across regions introduces unacceptable write latency. Asynchronous replication introduces data loss windows. Teams frequently deploy cross-region standby nodes without modeling network jitter, packet loss, or DNS TTL propagation delays. Validate failover under degraded network conditions.

  4. Mixing logical and physical replication without boundaries Logical replication supports cross-version upgrades and selective table sync but cannot recover system catalogs or extensions. Physical replication captures full cluster state but blocks version upgrades. Using both simultaneously without clear separation causes slot conflicts, lag spikes, and inconsistent promotion behavior.

  5. Neglecting credential rotation during recovery Replication slots, connection poolers, and monitoring agents often hardcode credentials. During failover, rotated secrets invalidate connections, causing promotion to hang or fail. Implement dynamic credential injection (e.g., HashiCorp Vault, AWS Secrets Manager) with short TTLs and automatic renewal.

  6. Over-relying on managed service abstractions Cloud providers handle infrastructure redundancy, but application-level recovery remains your responsibility. Managed databases do not automatically recover from logical corruption, schema drift, or extension incompatibility. Understand the underlying replication mechanics; do not treat the console toggle as a recovery strategy.

  7. Failing to test DNS/TTL propagation Promotion is instantaneous; routing updates are not. High TTL values delay client redirection, causing connection storms to the old primary. Use low TTLs (30–60 seconds) for database endpoints, implement connection retry logic with exponential backoff, and validate routing updates during drills.

Best Practice: Run recovery drills quarterly on isolated infrastructure. Measure actual RTO/RPO, document failure modes, and update runbooks. Recovery is a muscle, not a configuration.

Production Bundle

Action Checklist

  • Define explicit RPO and RTO targets per database tier and document business impact
  • Enable WAL archiving with immutable storage and retention exceeding maximum RPO
  • Deploy standby nodes in separate failure domains with automated lag monitoring
  • Implement idempotent promotion logic with pre-checks for replication lag and disk health
  • Configure low TTL routing and connection retry mechanisms for seamless client redirection
  • Automate recovery validation pipelines with checksum verification and query smoke tests
  • Rotate replication credentials dynamically and avoid hardcoded secrets in failover scripts
  • Schedule quarterly cross-region failover drills and publish post-mortem metrics

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
Startup / MVPDaily snapshots + 24h WAL retentionLow operational overhead, acceptable data loss window for non-critical workloadsLow storage, minimal compute
Mid-scale SaaSWAL archiving + PITR + async standbyDeterministic recovery, balances cost and RPO/RTO for customer-facing appsMedium storage, moderate compute
Financial / ComplianceStreaming replication + auto-failover + sync WALZero or near-zero data loss, strict audit trails, regulatory alignmentHigh storage, premium compute, network costs
Global Multi-RegionMulti-master active-active with conflict resolutionEliminates single-region dependency, supports geo-distributed workloadsVery high compute, complex networking, licensing

Configuration Template

postgresql.conf (Primary)

wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
archive_mode = on
archive_command = 'aws s3 cp %p s3://pg-wal-archive/%f --sse aws:kms'
archive_timeout = 60
hot_standby = on

recovery.signal (Standby)

restore_command = 'aws s3 cp s3://pg-wal-archive/%f %p'
recovery_target_timeline = 'latest'
promote_trigger_file = '/tmp/pg_promote'

TypeScript Recovery Runner

import { execSync } from 'child_process';
import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3';

export async function executePITR(targetTime: string, bucket: string) {
  const s3 = new S3Client({ region: 'us-east-1' });
  
  // 1. Restore base backup
  execSync('pg_basebackup -D /var/lib/postgresql/data -Fp -Xs -P');
  
  // 2. Configure PITR
  const recoveryConf = `
    restore_command = 'aws s3 cp s3://${bucket}/%f %p'
    recovery_target_time = '${targetTime}'
    recovery_target_action = 'promote'
  `;
  execSync(`echo '${recoveryConf}' > /var/lib/postgresql/data/recovery.conf`);
  
  // 3. Start recovery
  execSync('pg_ctl start -D /var/lib/postgresql/data');
  
  // 4. Validate
  const check = execSync('pg_isready -t 30', { encoding: 'utf8' });
  if (!check.includes('accepting connections')) {
    throw new Error('PITR validation failed');
  }
}

Quick Start Guide

  1. Provision baseline infrastructure: Deploy primary and standby PostgreSQL instances in separate availability zones. Configure security groups to allow replication traffic on port 5432.
  2. Enable WAL archiving: Update postgresql.conf with archive_mode=on and point archive_command to an S3 bucket with versioning and object lock enabled.
  3. Initialize standby: Run pg_basebackup from the standby node, copy recovery.signal, and configure restore_command. Start the standby and verify replication lag stays under 5 seconds.
  4. Deploy failover orchestrator: Install the TypeScript failover script, configure AWS credentials, and set up a cron job or Kubernetes CronJob to monitor lag and trigger promotion if the primary becomes unreachable for >30 seconds.
  5. Validate: Run a controlled failover drill. Measure promotion time, verify data consistency, update DNS/TTL, and document the actual RTO/RPO against targets.

Sources

  • ai-generated