nfrastructure to accept a broad range of instance families that meet your CPU/Memory requirements. This increases the surface area for available capacity.
2. Interruption Handling
Spot instances provide a 2-minute warning before termination via the instance metadata service (IMDS) and a SIGTERM signal. Your architecture must handle this gracefully:
- Stateless Services: Rely on the orchestrator to reschedule pods/containers. Ensure health checks allow time for draining.
- Stateful/Tolerant Jobs: Implement checkpointing. Save progress to external storage (S3, EBS snapshots, databases) periodically so interrupted jobs can resume rather than restart.
3. Implementation Example: AWS CDK with Diversified Spot Fleet
The following TypeScript example using AWS CDK demonstrates how to configure a diversified Spot Fleet with a fallback to On-Demand capacity. This ensures that if Spot capacity is unavailable, the fleet provisions On-Demand instances to maintain throughput, preventing starvation.
import * as cdk from 'aws-cdk-lib';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as autoscaling from 'aws-cdk-lib/aws-autoscaling';
import { Construct } from 'constructs';
export class SpotOptimizedStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Define a diverse set of instance types across generations
const instanceTypes = [
'm5.large', 'm5.xlarge', 'm6i.large', 'm6i.xlarge',
'c5.large', 'c5.xlarge', 'c6i.large', 'c6i.xlarge'
];
// Create a Launch Template with Spot optimization
const launchTemplate = new ec2.CfnLaunchTemplate(this, 'SpotLaunchTemplate', {
launchTemplateData: {
instanceType: 'm5.large', // Base type, overridden by fleet strategy
imageId: ec2.MachineImage.latestAmazonLinux2().getImage(this).imageId,
securityGroupIds: [this.createSecurityGroup().securityGroupId],
// Enable detailed monitoring for interruption analysis
monitoring: { enabled: true },
instanceMarketOptions: {
marketType: 'spot',
spotOptions: {
spotInstanceType: 'one-time', // Request terminates on interruption
instanceInterruptionBehavior: 'terminate'
}
}
}
});
// Configure Spot Fleet with Diversification
const spotFleet = new autoscaling.CfnSpotFleet(this, 'DiversifiedSpotFleet', {
spotFleetRequestConfigData: {
targetCapacity: 10,
minSize: 2,
maxSize: 50,
// On-Demand fallback ensures capacity availability
onDemandTargetCapacity: 2,
allocationStrategy: 'diversified',
// Distribute across multiple AZs to prevent zone-level failures
availabilityZonePolicy: 'diversified',
launchTemplateConfigs: [{
launchTemplateSpecification: {
launchTemplateId: launchTemplate.ref,
version: launchTemplate.attrLatestVersionNumber
},
overrides: instanceTypes.map(type => ({
instanceType: type
}))
}],
// Replace unhealthy instances automatically
replaceUnhealthyInstances: true
}
});
// Output the Fleet ID for integration
new cdk.CfnOutput(this, 'SpotFleetId', { value: spotFleet.ref });
}
private createSecurityGroup(): ec2.SecurityGroup {
return new ec2.SecurityGroup(this, 'SpotSG', {
vpc: ec2.Vpc.fromLookup(this, 'Vpc', { isDefault: true }),
allowAllOutbound: true
});
}
}
Architecture Rationale:
allocationStrategy: 'diversified': Spreads instances across the provided types, minimizing the blast radius of a single instance type interruption.
onDemandTargetCapacity: Guarantees a baseline of stable capacity. The orchestrator fills the remainder with Spot, balancing cost and reliability.
replaceUnhealthyInstances: true: Ensures self-healing. When a Spot instance is reclaimed, the fleet automatically launches a replacement.
Pitfall Guide
-
Ignoring the 2-Minute Warning
- Mistake: Failing to listen to the IMDS endpoint (
http://169.254.169.254/latest/meta-data/spot/instance-action) or handling SIGTERM.
- Consequence: Abrupt termination leads to data loss in long-running processes and failed requests in services.
- Fix: Implement a daemon or sidecar that polls IMDS and triggers graceful shutdown procedures (drain connections, flush buffers) immediately upon detection.
-
State on Ephemeral Storage
- Mistake: Storing application state, logs, or database files on instance store volumes or root EBS volumes without snapshotting.
- Consequence: Interruption results in permanent state loss.
- Fix: Externalize all state. Use managed databases, object storage, or distributed caches. If using local storage, implement frequent checkpointing to remote storage.
-
Single Availability Zone Pinning
- Mistake: Restricting Spot requests to a single AZ to reduce latency or simplify networking.
- Consequence: AZ-level capacity reclamation events take down the entire workload.
- Fix: Configure multi-AZ deployments. Use latency-based routing or global load balancers to distribute traffic across regions if necessary.
-
Over-Provisioning without Auto-Scaling
- Mistake: Setting a fixed target capacity for Spot fleets that matches peak load.
- Consequence: Paying for idle capacity during low-traffic periods, negating Spot savings.
- Fix: Integrate Spot fleets with auto-scaling policies based on CPU, memory, or custom queue depth metrics. Scale in aggressively when demand drops.
-
Using Spot for Latency-Sensitive Singletons
- Mistake: Running a single instance of a latency-critical service on Spot.
- Consequence: Interruption causes immediate downtime until replacement provisions.
- Fix: Never run singletons on Spot. Use redundancy (N+1) so that the interruption of one node does not impact availability. Reserve Spot for horizontal scale-out workloads.
-
Neglecting Interruption Rate Monitoring
- Mistake: Assuming Spot is always cheap and stable.
- Consequence: Instance types can spike in price or become unavailable. A "cheap" instance type with 50% interruption rate may be more expensive than a slightly more expensive type with 1% interruption due to re-computation costs.
- Fix: Monitor interruption rates and pricing trends. Rotate instance types dynamically based on availability scores.
-
EBS Cost Blindness
- Mistake: Focusing only on compute savings while leaving large EBS volumes attached to terminated instances.
- Consequence: "Zombie" storage costs accumulate.
- Fix: Configure
DeleteOnTermination for ephemeral volumes. Use lifecycle policies for persistent volumes. Automate orphaned volume cleanup.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Batch Processing / ML Training | Pure Diversified Spot | Workloads are fault-tolerant; checkpointing handles interruptions. | ~85-90% |
| Web Frontend / API Gateway | Spot + On-Demand Fallback | High availability required; auto-scaling absorbs interruptions. | ~60-65% |
| CI/CD Build Agents | Spot with Short TTL | Agents are ephemeral; interruption only affects a single build. | ~75% |
| Stateful Database | On-Demand / Reserved | State cannot be moved; consistency requirements preclude Spot. | 0% |
| Data Processing Pipeline | Spot with Checkpointing | Intermediate results saved externally; resume capability essential. | ~70-80% |
| Development / QA Environments | Pure Spot | Tolerance for downtime is high; cost minimization is priority. | ~90% |
Configuration Template
Karpenter Provisioner for Kubernetes
Karpenter is the modern standard for node provisioning. This YAML configures a diversified Spot provisioner with automatic fallback and consolidation.
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: spot-diversified
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
# Diversify across families to maximize availability
values: ["m5.large", "m5.xlarge", "m6i.large", "c5.large", "c5.xlarge", "c6i.large"]
- key: topology.kubernetes.io/zone
operator: In
# Spread across AZs
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
limits:
resources:
cpu: "200"
memory: 800Gi
ttlSecondsAfterEmpty: 30
ttlSecondsUntilExpired: 2592000 # 30 days, forces recycling for updates
consolidation:
enabled: true
provider:
subnetSelector:
karpenter.sh/discovery: my-cluster
securityGroupSelector:
karpenter.sh/discovery: my-cluster
tags:
managed-by: karpenter
cost-center: engineering
weight: 100 # Priority weight for scheduling
Quick Start Guide
- Identify a Candidate: Select a stateless deployment or a batch job that currently runs on On-Demand. Ensure it has no local state dependencies.
- Add Interruption Handling: If using Kubernetes, ensure your pods have
terminationGracePeriodSeconds set appropriately and handle SIGTERM. If using EC2, add a script to /etc/rc.local or a systemd service that monitors IMDS for the interruption warning.
- Update Provisioning: Modify your IaC (Terraform/CDK/CloudFormation) to use the diversified Spot configuration template. Set an initial On-Demand fallback of 10-20% of target capacity.
- Deploy and Observe: Roll out the configuration. Monitor the orchestration logs for interruption events. Verify that replacement instances spin up within seconds and that traffic is drained gracefully.
- Optimize: After 24 hours, review the interruption rate. If interruptions are frequent on specific instance types, remove them from the diversification list. Adjust the On-Demand fallback ratio based on stability requirements to maximize savings.