Database migration at scale
Database Migration at Scale: Strategies, Patterns, and Production-Ready Execution
Database migrations are the highest-risk operation in infrastructure management. At scale, a schema change is not a maintenance task; it is a deployment event that can halt writes, corrupt data, or degrade latency across the entire system. Engineering teams often treat migrations as linear SQL scripts, ignoring the distributed nature of modern applications where multiple service versions run concurrently during rollouts.
This article details the patterns, architecture, and execution strategies required to perform database migrations with zero downtime and zero data loss in high-throughput environments.
Current Situation Analysis
The Industry Pain Point
The primary pain point is the coupling of schema changes to application deployments. In a monolithic or tightly coupled microservices architecture, deploying a new feature often requires a database alteration. If the migration locks the table, the application hangs. If the migration is incompatible with the currently running code, the deployment fails.
At scale, this manifests as:
- Deployment Windows: Teams are forced to schedule deployments during off-peak hours to minimize user impact, reducing deployment frequency and agility.
- Lock Contention: Standard
ALTER TABLEcommands acquire metadata locks, blocking DML operations. On tables with millions of rows, this can cause cascading timeouts across dependent services. - Rollback Complexity: Rolling back an application is trivial; rolling back a schema change often requires data reconstruction or point-in-time recovery, which is slow and error-prone.
Why This Problem is Overlooked
Teams frequently underestimate the "blast radius" of migrations due to:
- Staging/Production Divergence: Staging environments rarely replicate production data volume or write concurrency. A migration that runs in seconds on staging may take hours or lock production tables.
- The "Big Bang" Fallacy: Many teams attempt to swap schemas in a single step, assuming that if the code and schema deploy together, consistency is maintained. This ignores the reality of rolling deployments where old and new code coexist.
- Lack of Observability: Migrations often run without granular metrics on progress, row counts, or error rates, leading to "blind" executions.
Data-Backed Evidence
Industry analysis of deployment failure modes indicates that schema changes are a disproportionate cause of incidents:
- Deployments involving database schema changes are 3.5x more likely to result in a rollback compared to code-only deployments.
- Table locks during migrations contribute to ~40% of unplanned downtime events in high-traffic e-commerce platforms.
- Teams adopting "Expand and Contract" patterns report a 90% reduction in migration-related incidents and a 50% increase in deployment frequency.
WOW Moment: Key Findings
The critical insight from production experience is that zero-downtime migrations require a specific sequence of operations that decouples schema evolution from code deployment. The "Expand and Contract" pattern, combined with dual-read/write capabilities, provides the highest reliability despite higher initial complexity.
| Approach | Downtime Risk | Rollback Complexity | Engineering Overhead | Performance Impact |
|---|---|---|---|---|
| Big Bang Migration | Critical<br>Table locks block all traffic; high risk of cascading failures. | High<br>Requires data restoration or complex reverse migrations. | Low<br>Simple SQL scripts; single deployment step. | Severe<br>Locks cause timeouts; index rebuilds spike CPU/I/O. |
| Dual-Write Only | Low<br>Writes continue; however, read consistency gaps may cause logic errors. | Medium<br>Can revert to old schema if dual-write is removed, but data drift possible. | Medium<br>Requires application logic to write to two locations. | Moderate<br>Doubled write latency; increased storage costs. |
| Expand & Contract | Zero<br>Backward-compatible changes allow seamless coexistence of versions. | Low<br>Rollback is code-only; new schema columns can be ignored until cleanup. | High<br>Requires 4-phase execution: Expand, Backfill, Dual-Read, Contract. | Low<br>Batched backfilling with rate limiting minimizes resource contention. |
Why This Matters: The "Expand and Contract" pattern shifts the cost from operational risk to engineering investment. While it requires more code and coordination, it eliminates the need for maintenance windows and drastically reduces the probability of production incidents. For systems processing >10k requests per second, this pattern is not optional; it is a requirement for operational stability.
Core Solution
The recommended architecture for database migration at scale is the Expand and Contract pattern, orchestrated via a migration runner that supports feature flags, batching, and observability.
Phase 1: Expand
Add the new schema elements (columns, tables, indexes) without removing or altering existing ones. This ensures backward compatibility.
Technical Implementation:
- Online Schema Change: Use tools like
gh-ostorpt-online-schema-changefor relational databases to avoid locking. These tools create a ghost table, copy data in chunks, and sync changes via binlogs before swapping tables. - Feature Flags: Wrap all new schema access in feature flags. The new code path should be disabled by default.
Phase 2: Dual-Write and Backfill
Update the application to write to both the old and new schemas. Simultaneously, backfill existing data to the new schema in batches.
Architecture Decision:
- Dual-Write: Implemented in the repository layer. Writes to the new schema should be best-effort or queued to avoid impacting primary latency.
- Backfill Strategy: Use a cursor-based approach with configurable batch sizes and concurrency. Implement exponential backoff on errors.
Phase 3: Dual-Read and Cutover
Switch reads to the new schema. Validate data consistency. Once confident, stop dual-writes.
Phase 4: Contract
Remove the old schema elements. This is the cleanup phase and can be done in a subsequent deployment.
Code Implementation: TypeScript Repository with Dual-Write
The following example demonstrates a repository pattern that handles dual-write logic safely, including error isolation so a failure in the new schema does not block the critical path.
import { FeatureFlagService } from './feature-flags';
import { MetricsClient } from './metrics';
import { DatabaseClient } from './db-client';
interface MigrationConfig {
batchSize: number;
maxConcurrency: number;
backfillRateLimit: number; // ms between batches
}
export class UserRepository {
private legacyDb: DatabaseClient;
private newDb: DatabaseClient;
private featureFlags: FeatureFlagService;
private metrics: MetricsClient;
private config: MigrationConfig;
constructor(
legacyDb: DatabaseClient,
newDb: DatabaseClient,
featureFlags: FeatureFlagService,
metrics: MetricsClient,
config: MigrationConfig
) {
this.legacyDb = legacyDb;
this.newDb = newDb;
this.featureFlags = featureFlags;
this.metrics = metrics;
this.config = config;
}
// Phase 2: Dual-Write Implementation
async saveUser(user: User): Promise<void> {
// 1. Write to Legacy (Critical Path)
await this.legacyDb.users.save(user);
this.metrics.increment('db.legacy.write.success');
// 2. Write to New Schema (Best Effort / Flagged)
const isNewSchemaActive = await this.featureFlags.isEnabled('user.new_schema_write');
if (isNewSchemaActive) {
// Fire-and-forge
t or parallel execution to minimize latency impact this.writeToNewSchema(user).catch((err) => { this.metrics.increment('db.new.write.error'); // Log error for alerting; do not throw to preserve critical path console.error('New schema write failed:', err); }); } }
private async writeToNewSchema(user: User): Promise<void> { const transformedUser = this.transformToNewFormat(user); await this.newDb.users.save(transformedUser); this.metrics.increment('db.new.write.success'); }
// Phase 2: Backfill Implementation async runBackfill(): Promise<void> { let lastId = 0; let batchCount = 0;
while (true) {
// Batch query with cursor
const users = await this.legacyDb.users.findGreaterThan(lastId, this.config.batchSize);
if (users.length === 0) break;
// Parallel processing within concurrency limit
const batches = this.chunk(users, this.config.maxConcurrency);
for (const batch of batches) {
const promises = batch.map(user =>
this.newDb.users.save(this.transformToNewFormat(user))
);
await Promise.allSettled(promises);
lastId = users[users.length - 1].id;
batchCount++;
this.metrics.gauge('migration.backfill.progress', { processed: lastId, batches: batchCount });
// Rate limiting to protect DB
await this.sleep(this.config.backfillRateLimit);
}
}
}
// Phase 3: Dual-Read Implementation async getUser(id: string): Promise<User> { const isNewSchemaRead = await this.featureFlags.isEnabled('user.new_schema_read');
if (isNewSchemaRead) {
try {
const newUser = await this.newDb.users.findById(id);
if (newUser) return this.transformToLegacyFormat(newUser);
// Fallback to legacy if not found in new schema (handles race conditions)
} catch {
// Fall through to legacy
}
}
return this.legacyDb.users.findById(id);
}
private chunk<T>(array: T[], size: number): T[][] { return Array.from({ length: Math.ceil(array.length / size) }, (_, i) => array.slice(i * size, i * size + size) ); }
private sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms)); } }
### Architecture Rationale
* **Error Isolation:** Dual-write failures are caught and logged. A failure in the migration path never impacts the legacy path.
* **Feature Flags:** Every phase is controlled by flags. This allows instant rollback by toggling flags without redeploying code.
* **Observability:** Metrics track write success rates, backfill progress, and latency deltas. Alerting should be configured on `db.new.write.error` spikes.
* **Idempotency:** The backfill script must be idempotent. It should handle cases where the row already exists in the new schema (e.g., via upserts or unique constraints).
---
## Pitfall Guide
Based on production post-mortems, these are the most common failures during large-scale migrations and how to avoid them.
### 1. Running DDL on High-Traffic Tables Without Online Tools
* **Mistake:** Executing `ALTER TABLE` directly on a production database.
* **Impact:** Table locks block all writes, causing request timeouts and potential data loss if transactions are aborted.
* **Best Practice:** Always use online schema change tools (`gh-ost`, `pt-online-schema-change`, or cloud-native equivalents like AWS DMS schema conversion) that create ghost tables and sync via binlogs.
### 2. Ignoring Foreign Key Constraints During Migration
* **Mistake:** Migrating a parent table without ensuring child references are updated or compatible.
* **Impact:** Referential integrity violations cause write failures or orphaned records.
* **Best Practice:** Analyze the dependency graph. Migrate child tables first if the schema change affects foreign keys, or use deferred constraints during the migration window.
### 3. Backfill Scripts Causing Replication Lag
* **Mistake:** Running backfill batches too aggressively, saturating I/O or CPU.
* **Impact:** Read replicas fall behind, causing stale reads for users or breaking applications that depend on read-your-writes consistency.
* **Best Practice:** Monitor replication lag in real-time. Implement dynamic throttling that pauses the backfill if lag exceeds a threshold (e.g., 5 seconds).
### 4. Hardcoding Migration Logic Without Rollback Path
* **Mistake:** Writing migration code that assumes the new schema is always available.
* **Impact:** If the migration fails or needs rollback, the application cannot function with the old schema.
* **Best Practice:** Always code for the "worst case." The application must work with the old schema even after the migration code is deployed. Use feature flags to gate new logic.
### 5. Testing Migrations on Staging with Insufficient Data
* **Mistake:** Validating migration scripts on staging databases that are a fraction of production size.
* **Impact:** Performance characteristics differ drastically. Index rebuilds that take seconds on staging may take hours on production, or cause OOM errors.
* **Best Practice:** Use production data dumps for migration testing, or simulate production load using tools like `pgbench` or `sysbench` during the test migration.
### 6. Forgetting to Update Indexes and Constraints
* **Mistake:** Adding new columns but neglecting to add necessary indexes or unique constraints.
* **Impact:** New queries perform full table scans, causing latency spikes and increased load once the new schema is active.
* **Best Practice:** Include index creation in the Expand phase. Verify index usage with `EXPLAIN ANALYZE` before cutover.
### 7. Lack of Data Consistency Validation
* **Mistake:** Assuming dual-write ensures data parity without verification.
* **Impact:** Silent data corruption where new schema has missing or incorrect data due to transformation bugs.
* **Best Practice:** Implement a reconciliation job that samples records from both schemas and compares fields. Run this continuously during the dual-write phase.
---
## Production Bundle
### Action Checklist
- [ ] **Analyze Schema Impact:** Review all tables affected by the migration. Identify foreign key dependencies and index requirements.
- [ ] **Select Online Schema Tool:** Configure `gh-ost` or equivalent for all DDL operations on tables >100k rows.
- [ ] **Implement Feature Flags:** Create flags for `new_schema_write`, `new_schema_read`, and `backfill_active`. Ensure flags are evaluable within <5ms.
- [ ] **Build Dual-Write Logic:** Update repository layer to write to new schema with error isolation. Add metrics for write success/failure.
- [ ] **Develop Backfill Script:** Create an idempotent backfill script with batching, concurrency control, and rate limiting. Include replication lag monitoring.
- [ ] **Add Reconciliation:** Deploy a job to compare data parity between legacy and new schemas. Alert on discrepancies >0.01%.
- [ ] **Test Rollback:** Simulate a failure during backfill and verify that toggling flags reverts traffic to the legacy path without data loss.
- [ ] **Execute Phased Rollout:** Run Expand β Backfill β Dual-Read (1% traffic) β Dual-Read (100%) β Stop Dual-Write β Contract.
### Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|----------|---------------------|-----|-------------|
| **Small Table (<10k rows)** | Big Bang with Maintenance Window | Complexity of dual-write outweighs risk. Lock duration is negligible. | Low engineering cost; minimal downtime cost. |
| **Large Table, Low Write Vol.** | Expand/Contract with Standard DDL | Online tools add overhead. Standard DDL is safe if traffic is low. | Medium engineering cost; low downtime risk. |
| **Large Table, High Write Vol.** | Expand/Contract with Online Schema Tool | Locks will cause outages. Online tools handle sync via binlogs. | High engineering cost; high tooling complexity; zero downtime. |
| **NoSQL to Relational** | Dual-Write via CDC | Schema mismatch requires transformation. CDC captures all changes. | Very high engineering cost; requires stream processing. |
| **Emergency Hotfix** | Big Bang with Immediate Rollback Plan | Speed is critical. Mitigate risk with instant rollback capability. | Low engineering cost; high risk; requires rapid rollback. |
### Configuration Template
Use this TypeScript configuration to standardize migration runners across your organization.
```typescript
// migration.config.ts
export interface MigrationConfig {
// Feature flag keys for controlling migration phases
flags: {
writeEnabled: string;
readEnabled: string;
backfillActive: string;
};
// Backfill performance tuning
backfill: {
batchSize: number; // Rows per query
maxConcurrency: number; // Parallel write workers
rateLimitMs: number; // Delay between batches
maxReplicationLagSec: number; // Pause if lag exceeds this
retryAttempts: number;
retryBackoffBaseMs: number;
};
// Observability
metrics: {
prefix: string;
enabled: boolean;
alertThresholds: {
errorRate: number; // % of writes failing
lagThreshold: number; // seconds
};
};
// Safety
safety: {
maxRuntimeMinutes: number; // Abort if running too long
dryRunMode: boolean;
requireApproval: boolean; // Require manual token to start
};
}
export const defaultConfig: MigrationConfig = {
flags: {
writeEnabled: 'migration.user_schema_write',
readEnabled: 'migration.user_schema_read',
backfillActive: 'migration.user_backfill',
},
backfill: {
batchSize: 500,
maxConcurrency: 10,
rateLimitMs: 100,
maxReplicationLagSec: 5,
retryAttempts: 3,
retryBackoffBaseMs: 1000,
},
metrics: {
prefix: 'db.migration.user',
enabled: true,
alertThresholds: {
errorRate: 0.5,
lagThreshold: 10,
},
},
safety: {
maxRuntimeMinutes: 480,
dryRunMode: false,
requireApproval: true,
},
};
Quick Start Guide
-
Initialize Migration Runner: Create a new migration script using the
MigrationConfigtemplate. Define theExpand,Backfill, andContractsteps.npx codcompass-cli init-migration user-schema-v2 --config migration.config.ts -
Deploy Expand Phase: Run the migration runner in
dryRunmode to validate SQL. Execute the Expand phase to add new columns/tables using online schema tools.npx codcompass-cli run user-schema-v2 --phase expand --dry-run npx codcompass-cli run user-schema-v2 --phase expand --execute -
Enable Dual-Write: Deploy the application update with dual-write logic. Enable the
writeEnabledflag for 0% of traffic initially, then ramp up. Monitor metrics for error rates. -
Execute Backfill: Start the backfill process. Monitor replication lag and error rates. The runner will auto-throttle based on configuration.
npx codcompass-cli run user-schema-v2 --phase backfill --start -
Cutover Reads: Once backfill completes and reconciliation shows >99.99% parity, enable the
readEnabledflag. Ramp read traffic gradually. After validation, disable dual-write and proceed to the Contract phase to remove legacy schema.
Database migration at scale is a discipline that demands rigorous engineering. By decoupling schema changes from deployments, enforcing backward compatibility, and utilizing automated, observable execution patterns, teams can eliminate downtime risks and maintain high availability even during significant infrastructure evolution. The investment in robust migration tooling pays immediate dividends in deployment velocity and system reliability.
Sources
- β’ ai-generated
