continuity requirements:
- Active-Passive: Primary region handles all traffic; secondary region remains idle or runs minimal workloads. Lowest cost, highest RTO.
- Active-Active: Traffic split across regions with synchronous or near-synchronous replication. Highest cost, lowest RTO/RPO.
- Pilot Light: Core data and minimal compute run in DR region; scale up on failover. Balanced cost/performance.
Define RPO and RTO explicitly. RPO dictates replication frequency and consistency mode. RTO dictates how quickly compute, networking, and routing can be provisioned. These values drive architectural choices, not the reverse.
Step 2: Decouple State from Compute
State must never reside in ephemeral compute layers. Use managed services with cross-region replication capabilities:
- Object storage: Cross-region replication with versioning
- Databases: Read replicas, logical replication, or managed cross-region failover
- Secrets: Multi-region vault replication or dynamic secret generation
State isolation prevents cascading failures during failover and ensures that DR infrastructure can be destroyed and recreated without data loss.
Step 3: Provision DR Infrastructure via IaC
Infrastructure must be declarative, version-controlled, and environment-parameterized. The following TypeScript/Pulumi example demonstrates a production-ready active-passive DR setup with cross-region S3 replication, RDS cross-region read replica, and Route53 failover routing.
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
const config = new pulumi.Config();
const primaryRegion = config.get("primaryRegion") || "us-east-1";
const drRegion = config.get("drRegion") || "us-west-2";
const environment = config.get("environment") || "production";
// Provider configuration for DR region
const drProvider = new aws.Provider("dr", { region: drRegion });
// Cross-region S3 replication with versioning
const primaryBucket = new aws.s3.BucketV2("primary-bucket", {
bucket: `${environment}-app-data-${primaryRegion}`,
region: primaryRegion,
versioning: { enabled: true },
});
const drBucket = new aws.s3.BucketV2("dr-bucket", {
bucket: `${environment}-app-data-${drRegion}`,
region: drRegion,
provider: drProvider,
versioning: { enabled: true },
});
new aws.s3.BucketReplicationConfiguration("replication-config", {
bucket: primaryBucket.id,
role: pulumi.interpolate`arn:aws:iam::${aws.getCallerOutput().accountId}:role/s3-replication-role`,
rules: [{
id: "dr-replication",
status: "Enabled",
destination: {
bucket: drBucket.arn,
storageClass: "STANDARD",
},
}],
});
// RDS cross-region read replica
const primaryDb = new aws.rds.Instance("primary-db", {
engine: "postgres",
engineVersion: "15.4",
instanceClass: "db.r6g.large",
allocatedStorage: 100,
dbName: "appdb",
username: config.require("dbUsername"),
password: config.requireSecret("dbPassword"),
skipFinalSnapshot: true,
multiAz: true,
region: primaryRegion,
});
const drDb = new aws.rds.Instance("dr-db-replica", {
engine: "postgres",
engineVersion: "15.4",
instanceClass: "db.r6g.large",
allocatedStorage: 100,
replicateSourceDb: primaryDb.id,
skipFinalSnapshot: true,
region: drRegion,
provider: drProvider,
});
// Route53 failover routing policy
const primaryRecord = new aws.route53.Record("primary-dns", {
name: `app.${config.require("domain")}`,
type: "A",
zoneId: config.require("zoneId"),
failoverRoutingPolicy: {
type: "PRIMARY",
},
setIdentifier: "primary",
alias: {
name: primaryBucket.websiteDomain,
zoneId: primaryBucket.hostedZoneId,
evaluateTargetHealth: true,
},
});
const drRecord = new aws.route53.Record("dr-dns", {
name: `app.${config.require("domain")}`,
type: "A",
zoneId: config.require("zoneId"),
failoverRoutingPolicy: {
type: "SECONDARY",
},
setIdentifier: "secondary",
alias: {
name: drBucket.websiteDomain,
zoneId: drBucket.hostedZoneId,
evaluateTargetHealth: true,
},
provider: drProvider,
});
export const primaryEndpoint = primaryRecord.fqdn;
export const drEndpoint = drRecord.fqdn;
Step 4: Automate Failover Orchestration
DNS failover alone is insufficient. Production failover requires:
- Health check validation (application-level, not just TCP)
- Database promotion (read replica to primary)
- Cache invalidation and session migration
- CI/CD pipeline trigger to scale DR compute
- Post-failover validation and rollback capability
Implement failover as a GitHub Actions or GitLab CI pipeline that executes pulumi up against the DR stack, promotes the database, updates DNS, and runs synthetic transaction tests. Automate the entire sequence to eliminate manual intervention during incidents.
Step 5: Establish Continuous DR Validation
DR must be tested in production-like conditions without impacting live traffic. Use chaos engineering principles:
- Schedule weekly synthetic failover drills in a isolated DR environment
- Inject network partitions and region degradation
- Validate RPO/RPO against SLA thresholds
- Measure cost impact of DR provisioning
Continuous validation transforms DR from a reactive contingency into a measurable reliability metric.
Pitfall Guide
-
Confusing HA with DR: Availability zones protect against hardware failures, not region outages or control plane degradation. DR requires cross-region topology, independent state replication, and separate networking. Treating multi-AZ deployments as DR leaves organizations vulnerable to regional AWS/GCP/Azure incidents.
-
Ignoring Replication Lag: Synchronous replication guarantees consistency but adds latency and cost. Asynchronous replication reduces latency but risks data loss during failover. Failing to measure replication lag against RPO requirements results in silent data corruption or failed transactions post-failover. Implement lag monitoring and automated throttling when thresholds are breached.
-
Hardcoding Region Identifiers: Infrastructure that references specific availability zones or region codes breaks during cross-region failover. Use region-parameterized variables, dynamic zone selection, and infrastructure-as-code abstractions that resolve endpoints at deployment time.
-
Over-Provisioning Failover Capacity: Running full-scale DR infrastructure continuously inflates costs by 40–60%. Pilot light or warm standby models reduce idle spend while maintaining rapid scale-up capability. Right-size DR compute based on peak traffic multipliers, not current production load.
-
Skipping Chaos-Driven Validation: Manual runbooks fail under stress because they assume perfect conditions. Automated DR drills that simulate region loss, database corruption, or DNS hijacking expose configuration drift, missing dependencies, and routing gaps. Schedule game days quarterly and integrate failover tests into CI/CD pipelines.
-
Assuming Cloud Providers Handle App-Level DR: Cloud providers guarantee infrastructure availability, not application continuity. Stateful services, custom caching layers, background job queues, and third-party API dependencies require explicit DR strategies. Map every external dependency to a failover or degradation path.
-
Neglecting DNS and CDN Propagation: Route53 failover routing has TTL constraints. CDN caches may serve stale content from the primary region after failover. Implement short TTLs during incidents, purge CDN caches programmatically, and validate edge node routing before declaring failover complete.
Best Practices from Production:
- Treat DR infrastructure as immutable. Never patch a failed region; provision a new one.
- Separate state storage from compute provisioning. Use managed replication with explicit consistency controls.
- Automate runbooks as code. Replace PDFs with CI/CD pipelines that execute, validate, and rollback.
- Implement observability during failover. Track replication lag, DNS propagation, transaction success rates, and error budgets in real time.
- Align DR testing with release cycles. Every major deployment must include a DR validation step.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Regulatory compliance with strict RPO (<1 min) | Active-Active with synchronous replication | Guarantees zero data loss across regions | High (40-60% infrastructure premium) |
| Mid-tier SaaS with 15-min RTO tolerance | Pilot Light | Core data runs idle; compute scales on failover | Medium (20-30% premium, pay-as-you-scale) |
| Cost-sensitive workloads with 4-hr RTO | Warm Standby | Minimal compute runs, full data replication | Low (10-15% premium, idle resource optimization) |
| Multi-cloud vendor lock-in avoidance | Active-Passive with abstraction layer | Cloud-agnostic IaC enables cross-provider failover | Medium-High (abstraction overhead, dual-cloud licensing) |
Configuration Template
// pulumi/stacks/dr-config.ts
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
export class DRConfig {
readonly primaryRegion: string;
readonly drRegion: string;
readonly environment: string;
readonly rpoSeconds: number;
readonly rtoMinutes: number;
constructor(opts: pulumi.ComponentResourceOptions) {
const config = new pulumi.Config();
this.primaryRegion = config.get("primaryRegion") || "us-east-1";
this.drRegion = config.get("drRegion") || "us-west-2";
this.environment = config.get("environment") || "production";
this.rpoSeconds = config.getNumber("rpoSeconds") || 120;
this.rtoMinutes = config.getNumber("rtoMinutes") || 15;
}
getDrProvider(): aws.Provider {
return new aws.Provider("dr", { region: this.drRegion });
}
validateBoundaries(): void {
if (this.rpoSeconds < 30) {
throw new Error("Synchronous replication required for RPO < 30s");
}
if (this.rtoMinutes > 30) {
console.warn("RTO > 30min requires manual intervention fallback");
}
}
}
Quick Start Guide
- Initialize the DR stack: Run
pulumi stack init dr-production and set configuration values with pulumi config set primaryRegion us-east-1 and pulumi config set drRegion us-west-2.
- Preview infrastructure: Execute
pulumi preview to validate cross-region resource mapping, replication rules, and routing policies before provisioning.
- Deploy DR environment: Run
pulumi up to provision isolated DR infrastructure, configure cross-region replication, and establish failover DNS records.
- Execute synthetic failover: Trigger the CI/CD failover pipeline, monitor replication lag, validate database promotion, and confirm DNS routing shifts within RTO/RPO boundaries.