Architecture Decisions
- Blue-Green over Canary: Selected due to the requirement for strict transactional consistency and the need to validate the entire Green environment against production load before full cutover.
- Dual-Write Strategy: The database cannot be locked during migration. The application writes to both the legacy schema (Blue) and the new schema (Green) simultaneously during the transition phase.
- Expand-Contract Pattern: Database changes follow the expand-contract methodology. New columns are added (expand), data is backfilled, traffic is switched, and old columns are removed (contract) in a subsequent deployment.
- Traffic Management: External DNS/Load Balancer controls the cutover, allowing instant reversal by pointing traffic back to the Blue service.
Step-by-Step Implementation
Phase 1: Database Expansion
Add new columns to the payments table without removing existing ones. Ensure backward compatibility.
-- Migration 1: Expand
ALTER TABLE payments ADD COLUMN metadata_json JSONB;
-- Do NOT drop legacy columns yet
Phase 2: Dual-Write Implementation
The application service must write to both schemas. Errors in the Green write must not block the Blue transaction but must trigger alerting.
// payment-service/src/infrastructure/dual-write.repository.ts
import { Injectable, Logger } from '@nestjs/common';
import { MetricsService } from './metrics.service';
@Injectable()
export class DualWritePaymentRepository {
private readonly logger = new Logger(DualWritePaymentRepository.name);
constructor(
private blueRepo: BluePaymentRepository,
private greenRepo: GreenPaymentRepository,
private metrics: MetricsService,
) {}
async createPayment(dto: CreatePaymentDto): Promise<Payment> {
// 1. Write to Blue (Legacy) - Blocking
const blueResult = await this.blueRepo.save(dto);
// 2. Write to Green (New) - Fire-and-forget with error handling
// We do not await this to avoid latency impact on the primary path
this.greenRepo.save(this.mapToGreenDto(dto)).catch((error) => {
this.logger.error('Dual-write failure to Green schema', error.stack);
// Increment metric for alerting; data can be backfilled later
this.metrics.increment('dual_write_failure_total', {
table: 'payments',
error_type: error.name
});
});
return blueResult;
}
async findPayment(id: string): Promise<Payment> {
// During transition, read from Blue to ensure consistency
// Once cutover is complete, switch read path to Green
return this.blueRepo.findById(id);
}
}
Phase 3: Deployment Orchestration
Use a deployment controller to manage the Blue and Green environments.
# k8s/rollouts/payment-service-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
spec:
replicas: 10
strategy:
blueGreen:
activeService: payment-active-svc
previewService: payment-preview-svc
autoPromotionEnabled: false
scaleDownDelaySeconds: 600 # Keep Blue running for 10 mins post-promotion
prePromotionAnalysis:
templates:
- templateName: latency-check
postPromotionAnalysis:
templates:
- templateName: error-rate-check
Phase 4: Traffic Cutover
- Deploy Green version alongside Blue.
- Enable dual-write via configuration flag.
- Backfill historical data to Green schema using a batch job.
- Verify data consistency between Blue and Green.
- Switch traffic from
payment-active-svc (Blue) to payment-preview-svc (Green).
- Monitor metrics. If healthy, promote Green to Active.
- Disable dual-write in Green code.
- Schedule Contract migration to remove legacy columns.
Rationale
The dual-write pattern decouples deployment from migration. It allows the team to deploy the new schema logic without risking data loss. The fire-and-forget approach ensures that latency remains dominated by the legacy path, preventing performance regression during the transition. The scale-down delay provides a safety window for immediate rollback without redeployment.
Pitfall Guide
1. The Dual-Write Idempotency Trap
Mistake: Failing to ensure dual-write operations are idempotent. If the Green write fails and retries, it may create duplicate records in the new schema.
Best Practice: Use deterministic IDs or upsert semantics in the Green repository. Implement a deduplication key based on the Blue transaction ID.
2. Schema Divergence During Rollback
Mistake: Rolling back traffic to Blue without cleaning up dual-write state. If the Blue code still attempts to write to Green (due to config lag), data divergence occurs.
Best Practice: Bind configuration changes to the deployment version. Ensure rollback reverts the dual-write flag atomically with the image swap.
3. Aggressive Health Checks Killing Old Instances
Mistake: Kubernetes readiness probes marking Blue instances as unhealthy immediately after traffic switch, causing them to terminate before in-flight requests complete.
Best Practice: Implement terminationGracePeriodSeconds and ensure load balancers drain connections before removing endpoints. Verify that health checks account for the cutover latency.
4. Client-Side Caching Staleness
Mistake: Assuming zero-downtime at the server level. Client SDKs or CDNs may cache DNS resolutions or API responses, continuing to hit Blue endpoints or serving stale data.
Best Practice: Set appropriate TTLs on DNS records. Implement version headers in API responses to help clients detect changes. Use feature flags for client-side behavior if necessary.
5. Ignoring Stateful Session Affinity
Mistake: Deploying Blue-Green for a service that relies on in-memory session state without externalizing sessions. Users may lose session data upon traffic switch.
Best Practice: Externalize all session state to Redis or a database. If state must be local, use sticky sessions with a migration strategy, though this limits scalability.
6. Metrics Blind Spots During Transition
Mistake: Monitoring aggregates hide errors in the Green environment. If Green handles only 1% of traffic initially, error spikes may be diluted in average metrics.
Best Practice: Implement per-version metric tagging. Alert on error rates specific to the version=green tag, not just global aggregates.
7. Contract Testing Gaps
Mistake: Relying solely on integration tests for database compatibility. Contract tests often miss edge cases in data transformation between schemas.
Best Practice: Implement shadow traffic testing where requests are mirrored to Green, and responses are compared against Blue responses to detect semantic drift before cutover.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Stateless API, High Traffic | Canary | Resource efficient; gradual rollout allows fine-grained validation. | Low |
| Stateful Service, Critical Path | Blue-Green | Predictable rollback; dual-write ensures data safety; faster recovery. | Medium |
| Small Team, Low Risk Feature | Rolling | Minimal infrastructure overhead; simple to implement. | Lowest |
| Multi-Region, Compliance Constraints | Blue-Green per Region | Regional isolation; allows regional cutover control; meets data residency. | High |
| Database Schema Change Required | Blue-Green + Dual-Write | Eliminates schema lock risk; allows safe migration without downtime. | Medium-High |
| Feature Experimentation | Feature Flags | Instant toggling; no infrastructure changes; supports A/B testing. | Low |
Configuration Template
Kubernetes Rollout Manifest with Argo Rollouts:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: {{ .Release.Name }}-service
labels:
app: {{ .Release.Name }}
spec:
replicas: {{ .Values.replicaCount }}
revisionHistoryLimit: 3
selector:
matchLabels:
app: {{ .Release.Name }}
template:
metadata:
labels:
app: {{ .Release.Name }}
spec:
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
imagePullPolicy: Always
ports:
- name: http
containerPort: 8080
protocol: TCP
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
strategy:
blueGreen:
activeService: {{ .Release.Name }}-active
previewService: {{ .Release.Name }}-preview
autoPromotionEnabled: false
autoPromotionSeconds: 300
scaleDownDelaySeconds: 900
prePromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: success-rate
value: "99.9"
postPromotionAnalysis:
templates:
- templateName: latency-p99
args:
- name: latency-p99
value: "200ms"
---
apiVersion: v1
kind: Service
metadata:
name: {{ .Release.Name }}-active
spec:
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app: {{ .Release.Name }}
---
apiVersion: v1
kind: Service
metadata:
name: {{ .Release.Name }}-preview
spec:
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app: {{ .Release.Name }}
Quick Start Guide
-
Install Argo Rollouts:
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
-
Create Services:
Define the active and preview services targeting the same selector. Traffic will be managed by the Rollout controller switching the selector labels.
-
Apply Rollout Manifest:
Save the configuration template as rollout.yaml and apply:
kubectl apply -f rollout.yaml
-
Deploy New Version:
Update the image tag in the Rollout manifest or via kubectl:
kubectl argo rollouts set image payment-service payment-service=registry.io/payment:v2.0.0
-
Monitor and Promote:
Watch the rollout status. The controller will spin up the Green instances and expose them via the preview service. Verify using the preview service endpoint. Once validated:
kubectl argo rollouts promote payment-service
Traffic switches to Green. Blue instances remain running for the configured scale-down delay, allowing immediate rollback if needed.