anagement:** Blue-green requires handling sticky sessions. If sessions are in-memory, blue-green will drop user state upon switch. Use externalized session storage (Redis, DynamoDB) or header-based routing to maintain session affinity during canary.
Implementation Steps
1. Define Deployment Strategy Interface
Create a TypeScript interface to standardize deployment configurations across environments.
export interface DeploymentStrategy {
type: 'blue-green' | 'canary';
rollbackOnFailure: boolean;
healthCheck: {
path: string;
port: number;
interval: number; // seconds
timeout: number; // seconds
};
}
export interface CanaryStrategy extends DeploymentStrategy {
type: 'canary';
steps: {
weight: number; // 0 to 100
analysis: {
metric: string;
threshold: number;
duration: number; // seconds
};
}[];
}
export interface BlueGreenStrategy extends DeploymentStrategy {
type: 'blue-green';
autoPromote?: boolean;
prePromotionTests?: string[]; // URLs for smoke tests
}
2. Kubernetes Resource Configuration
Use Pulumi or CDK8s to generate resources. Below is a TypeScript example using a Pulumi-style approach to configure weighted routing for canary.
import * as k8s from "@pulumi/kubernetes";
export function createCanaryDeployment(
name: string,
namespace: string,
stableVersion: string,
canaryVersion: string,
weight: number
) {
// Stable Deployment
const stable = new k8s.apps.v1.Deployment(`${name}-stable`, {
metadata: { name: `${name}-stable`, namespace },
spec: {
replicas: 3,
selector: { matchLabels: { app: name, track: "stable" } },
template: {
metadata: { labels: { app: name, track: "stable" } },
spec: {
containers: [{
name,
image: `${name}:${stableVersion}`,
ports: [{ containerPort: 8080 }]
}]
}
}
}
});
// Canary Deployment
const canary = new k8s.apps.v1.Deployment(`${name}-canary`, {
metadata: { name: `${name}-canary`, namespace },
spec: {
replicas: 1, // Start with minimal capacity
selector: { matchLabels: { app: name, track: "canary" } },
template: {
metadata: { labels: { app: name, track: "canary" } },
spec: {
containers: [{
name,
image: `${name}:${canaryVersion}`,
ports: [{ containerPort: 8080 }]
}]
}
}
}
});
// Service with Weighted Routing (Ingress Controller Dependent)
// Example assumes NGINX Ingress Controller annotation support
const ingress = new k8s.networking.v1.Ingress(`${name}-ingress`, {
metadata: {
name: `${name}-ingress`,
namespace,
annotations: {
"nginx.ingress.kubernetes.io/canary": "true",
"nginx.ingress.kubernetes.io/canary-weight": weight.toString(),
// Optional: Header-based routing for specific users
"nginx.ingress.kubernetes.io/canary-by-header": "x-canary",
"nginx.ingress.kubernetes.io/canary-by-header-value": "always"
}
},
spec: {
rules: [{
http: {
paths: [{
path: "/",
pathType: "Prefix",
backend: {
service: {
name: `${name}-canary`,
port: { number: 80 }
}
}
}]
}
}]
}
});
return { stable, canary, ingress };
}
3. Blue-Green Switch Implementation
Blue-green uses two distinct services. The switch is atomic, changing the selector or Ingress backend.
export function switchBlueGreenTraffic(
serviceName: string,
targetVersion: 'blue' | 'green'
) {
// Update Service selector to point to target version pods
// This is an atomic operation in Kubernetes
return new k8s.core.v1.Service(`${serviceName}-router`, {
metadata: { name: serviceName },
spec: {
selector: {
app: serviceName,
version: targetVersion
},
ports: [{ port: 80, targetPort: 8080 }]
}
});
}
4. Automated Canary Analysis Loop
Implement a controller or CI/CD step that evaluates metrics before advancing weights.
async function evaluateCanaryProgress(
currentWeight: number,
metricClient: MetricClient,
config: CanaryStrategy
): Promise<'advance' | 'abort' | 'hold'> {
const step = config.steps.find(s => s.weight === currentWeight);
if (!step) return 'abort';
const metrics = await metricClient.query({
metric: step.analysis.metric,
window: `${step.analysis.duration}s`
});
if (metrics.errorRate > step.analysis.threshold) {
console.error(`Canary abort: Error rate ${metrics.errorRate} exceeds threshold ${step.analysis.threshold}`);
return 'abort';
}
if (metrics.latencyP99 > step.analysis.threshold * 1.5) {
return 'hold'; // Wait for stabilization
}
return 'advance';
}
Pitfall Guide
Common Mistakes
- Incompatible Database Migrations: Deploying a schema change that breaks the previous version while traffic is split. Impact: Blue-green rollback fails because the old code cannot read the new schema. Fix: Always use expand/contract. Never drop columns until 100% traffic is on the new version.
- Ignoring Sticky Sessions in Blue-Green: Switching traffic immediately drops users with in-memory sessions. Impact: User frustration, lost cart data, forced re-authentication. Fix: Externalize sessions or implement a drain period with header-based routing for active users.
- Metric Blindness in Canary: Relying on HTTP 5xx rates only. Impact: Performance regressions or business logic errors (e.g., incorrect pricing) that return 200 OK go undetected. Fix: Implement business-level metrics (e.g., "checkout_success_rate") and latency distributions, not just error counts.
- Resource Starvation on Canary Nodes: Setting canary replicas too low to represent production load characteristics. Impact: Memory leaks or concurrency bugs only manifest under load, which the canary cannot reproduce. Fix: Ensure canary has sufficient replicas to handle sampled traffic volume, or use load injection tools.
- Manual Rollback Delays: Assuming rollback is manual. Impact: Canary analysis detects failure, but human intervention takes minutes, exposing more users. Fix: Automate rollback triggers based on SLO breaches. The system must auto-revert weight to 0% on critical threshold violation.
- Stateful Service Misalignment: Applying canary to stateful services without partitioning. Impact: Data inconsistency if writes go to different versions with different logic. Fix: Canary is generally unsafe for stateful services unless the state layer supports versioned writes or the canary is read-only.
- Configuration Drift: Blue and green environments drift in configuration over time. Impact: "It works in blue but fails in green." Fix: Manage all configuration via IaC and secrets managers. Ensure both environments are identical except for the application version.
Best Practices
- Idempotent Deployments: Ensure deployment scripts can be re-run without side effects.
- Synthetic Monitoring: Inject synthetic traffic to validate health checks immediately after deployment, independent of real user traffic.
- Header-Based Canary: Use
x-canary headers to route internal users or specific tenants to canary versions for early validation before weight-based rollout.
- Cost Attribution: Tag canary resources to track incremental costs. Canary should rarely exceed 10-15% of total infrastructure cost during rollout.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Risk Monolith | Blue-Green | Instant rollback capability; full isolation reduces blast radius to zero upon switch. | High (2x infra during switch) |
| High-Traffic Microservices | Canary | Granular control minimizes user impact; auto-scaling aligns cost with validated traffic. | Low (Dynamic scaling) |
| Stateful Backend Services | Blue-Green | Canary on stateful services risks data corruption; blue-green allows full validation before cutover. | High (2x infra) |
| Frequent Small Updates | Canary | Overhead of provisioning full blue-green environment outweighs benefits; canary is lightweight. | Low |
| Regulatory Compliance | Blue-Green | Easier to audit and validate entire environment before promotion; deterministic state. | High |
| A/B Testing Required | Canary | Native support for traffic splitting based on user attributes or headers. | Low |
Configuration Template
TypeScript template for Pulumi defining a reusable deployment strategy component.
import * as pulumi from "@pulumi/pulumi";
import * as k8s from "@pulumi/kubernetes";
export interface DeploymentArgs {
appName: string;
namespace: string;
image: string;
strategy: 'blue-green' | 'canary';
canaryWeight?: number; // Only for canary
replicas: number;
port: number;
}
export class DeploymentStrategy extends pulumi.ComponentResource {
constructor(name: string, args: DeploymentArgs, opts?: pulumi.ComponentResourceOptions) {
super("codcompass:deployments:Strategy", name, opts);
const labels = { app: args.appName };
// Base Deployment
const deployment = new k8s.apps.v1.Deployment(`${name}-dep`, {
metadata: { name: args.appName, namespace: args.namespace },
spec: {
replicas: args.replicas,
selector: { matchLabels: labels },
template: {
metadata: { labels },
spec: {
containers: [{
name: args.appName,
image: args.image,
ports: [{ containerPort: args.port }],
readinessProbe: {
httpGet: { path: "/healthz", port: args.port },
initialDelaySeconds: 5,
periodSeconds: 10
}
}]
}
}
}
}, { parent: this });
// Service
const service = new k8s.core.v1.Service(`${name}-svc`, {
metadata: { name: `${args.appName}-svc`, namespace: args.namespace },
spec: {
selector: labels,
ports: [{ port: 80, targetPort: args.port }]
}
}, { parent: this });
// Conditional Routing
if (args.strategy === 'canary' && args.canaryWeight !== undefined) {
new k8s.networking.v1.Ingress(`${name}-ingress`, {
metadata: {
name: `${args.appName}-ingress`,
namespace: args.namespace,
annotations: {
"nginx.ingress.kubernetes.io/canary": "true",
"nginx.ingress.kubernetes.io/canary-weight": args.canaryWeight.toString()
}
},
spec: {
rules: [{
http: {
paths: [{
path: "/",
pathType: "Prefix",
backend: {
service: { name: `${args.appName}-svc`, port: { number: 80 } }
}
}]
}
}]
}
}, { parent: this });
}
this.registerOutputs({ serviceEndpoint: service.status.loadBalancer.ingress[0].ip });
}
}
Quick Start Guide
- Instrument Application: Add
/healthz and /metrics endpoints. Ensure metrics expose error rates and latency histograms.
- Configure Router: Deploy NGINX Ingress Controller or install Istio. Verify weight-based routing annotations are supported.
- Define Policy: Set SLO thresholds (e.g., Error Rate < 0.1%, P99 Latency < 200ms). Configure rollback automation to trigger on breach.
- Deploy Canary: Push new version with
canary: true and weight 5. Monitor metrics for 5 minutes.
- Advance or Abort: If metrics pass, increase weight to 25%, then 50%, then 100%. If metrics fail, automation reverts weight to 0% and alerts on-call.