Extract and Normalize Historical Usage
Cost Explorer provides aggregated data, but precise baseline calculation requires hourly granularity. We query the Cost Explorer API for the past 90 days, normalize the data, and identify the consistent floor.
import { CostExplorerClient, GetReservationUtilizationCommand } from "@aws-sdk/client-cost-explorer";
interface UsagePoint {
timestamp: Date;
hourlySpend: number;
}
export class BaselineCalculator {
private client: CostExplorerClient;
constructor(region: string) {
this.client = new CostExplorerClient({ region });
}
async extractHourlySpend(days: number = 90): Promise<UsagePoint[]> {
const endDate = new Date();
const startDate = new Date();
startDate.setDate(endDate.getDate() - days);
const command = new GetReservationUtilizationCommand({
TimePeriod: {
Start: startDate.toISOString().split("T")[0],
End: endDate.toISOString().split("T")[0],
},
Granularity: "HOURLY",
Metrics: ["UnblendedCost"],
});
const response = await this.client.send(command);
const points: UsagePoint[] = [];
for (const day of response.UtilizationsByTime ?? []) {
for (const metric of day.Total ?? []) {
points.push({
timestamp: new Date(day.TimePeriod?.Start ?? ""),
hourlySpend: parseFloat(metric.Amount ?? "0"),
});
}
}
return points.sort((a, b) => a.timestamp.getTime() - b.timestamp.getTime());
}
}
Architecture Rationale: We use HOURLY granularity instead of DAILY because daily aggregation masks intra-day scaling events. The 90-day window captures seasonal patterns and deployment cycles. Sorting chronologically enables time-series analysis for baseline detection.
Step 2: Calculate Commitment Floor
The commitment floor represents the hourly spend level that remains stable across 80-90% of the observation window. We calculate this using percentile filtering rather than simple averages.
export class CommitmentSizingEngine {
calculateBaselineFloor(usagePoints: UsagePoint[]): number {
const spendValues = usagePoints.map((p) => p.hourlySpend).sort((a, b) => a - b);
const percentileIndex = Math.floor(spendValues.length * 0.85);
return spendValues[percentileIndex] || 0;
}
generateRecommendation(baseline: number, safetyMargin: number = 0.9): number {
return baseline * safetyMargin;
}
}
Architecture Rationale: Using the 85th percentile filters out transient spikes while preserving steady-state coverage. The 0.9 safety margin prevents overcommitment by leaving 10% headroom for unexpected baseline drift. This approach replaces manual Cost Explorer estimation with deterministic statistical modeling.
Step 3: Automate Utilization Monitoring
Post-purchase drift detection requires real-time CloudWatch integration. We configure alarms that trigger when utilization drops below the 85% threshold, bypassing the 72-hour dashboard refresh lag.
import { CloudWatchClient, PutMetricAlarmCommand } from "@aws-sdk/client-cloudwatch";
export class UtilizationWatcher {
private cwClient: CloudWatchClient;
constructor(region: string) {
this.cwClient = new CloudWatchClient({ region });
}
async deployDriftAlarm(commitmentArn: string, threshold: number = 85) {
const alarmConfig = new PutMetricAlarmCommand({
AlarmName: `SavingsPlan-Utilization-Drift-${commitmentArn.slice(-8)}`,
ComparisonOperator: "LessThanThreshold",
EvaluationPeriods: 2,
MetricName: "UtilizationPercentage",
Namespace: "AWS/SavingsPlans",
Period: 3600,
Statistic: "Average",
Threshold: threshold,
AlarmDescription: "Triggers when Savings Plan utilization drops below target, indicating commitment drift.",
Dimensions: [
{
Name: "SavingsPlanArn",
Value: commitmentArn,
},
],
TreatMissingData: "breaching",
});
await this.cwClient.send(alarmConfig);
}
}
Architecture Rationale: TreatMissingData: "breaching" ensures alarms fire immediately if metric ingestion pauses, preventing silent drift. Two evaluation periods with hourly frequency balances alert responsiveness against noise. The alarm targets UtilizationPercentage directly, bypassing Cost Explorer's delayed aggregation.
Step 4: Plan Type Selection Logic
The final architectural decision involves choosing between EC2 Instance Savings Plans and Compute Savings Plans. The selection matrix should be codified into deployment pipelines.
export class PlanTypeSelector {
static evaluate(
workloadStabilityMonths: number,
familyLockRequired: boolean,
regionConstraint: boolean
): "COMPUTE" | "INSTANCE" {
if (workloadStabilityMonths >= 24 && familyLockRequired && regionConstraint) {
return "INSTANCE";
}
return "COMPUTE";
}
}
Architecture Rationale: EC2 Instance Savings Plans offer ~10-15% higher discounts but require proven stability across family, region, and tenancy. Compute Savings Plans sacrifice marginal discount depth for architectural mobility. The selector enforces policy: only lock to instance families when stability exceeds 24 months and migration paths are explicitly blocked.
Pitfall Guide
1. Pre-Commitment Right-Sizing Neglect
Explanation: Purchasing commitments before optimizing instance sizes locks in discounted rates for over-provisioned capacity. The discount applies to the committed hourly rate, not to unused CPU or memory headroom. An m5.4xlarge running at 15% CPU will still consume the full committed rate, while an m5.large would satisfy the workload at a fraction of the cost.
Fix: Run AWS Compute Optimizer for 14 days to analyze CloudWatch metrics. Rightsize instances first, operate for 30 days to establish post-optimization baselines, then purchase commitments.
2. Peak-Load Overcommitment
Explanation: Sizing commitments to cover maximum scaling events forces you to pay committed rates for capacity that only exists intermittently. Auto Scaling groups naturally create usage volatility. Committing to the peak converts temporary demand into permanent financial obligation.
Fix: Extract 60-90 days of hourly usage from Cost Explorer. Identify the consistent floor (the spend level that never drops below a threshold). Commit to 80-90% of that floor. Route peak demand through On-Demand or Spot pricing.
3. Rigid Instance Family Locking
Explanation: EC2 Instance Savings Plans bind commitments to a specific instance family and region. Workloads that migrate to newer generations (e.g., m5 to m6i), shift regions, or transition to Fargate immediately invalidate the commitment. The discount disappears while billing continues.
Fix: Default to Compute Savings Plans for any workload with a documented migration roadmap. Reserve EC2 Instance Savings Plans for workloads that have remained unchanged for 12+ months and include a mandatory 6-month architectural review checkpoint.
4. Term Length Mismatch
Explanation: 3-year commitments offer approximately 10-15% additional discount over 1-year terms. This math only holds if workloads remain architecturally stable for the full 36 months. Mid-term migrations or refactoring efforts convert the remaining term into pure waste. A $50,000/month commitment with 18 months remaining after migration generates $540,000 in non-offset spend.
Fix: Align term length with architectural stability horizons. If Kubernetes migration, major refactoring, or significant scale changes are planned within 24 months, restrict purchases to 1-year terms.
5. Post-Purchase Utilization Blindness
Explanation: Commitments continue billing regardless of utilization. Default AWS dashboards refresh every 72+ hours, creating a detection lag where 40-50% utilization drops go unnoticed for multiple days. At $10,000/month commitment, 60% utilization generates $4,000/month in waste, accumulating to $48,000 annually.
Fix: Deploy CloudWatch alarms on UtilizationPercentage metrics. Configure alerts for thresholds below 85%. Implement weekly utilization reviews instead of quarterly audits.
6. Cross-Account Fragmentation
Explanation: Savings Plans apply at the account level by default. In multi-account organizations without sharing enabled, one account's unused commitment remains isolated while another account pays full On-Demand rates for eligible usage. This fragmentation routinely wastes 20-30% of total commitment value.
Fix: Enable commitment sharing in the AWS Organizations management account via Billing and Cost Management. Verify coverage reports show linked accounts consuming shared commitments before purchasing additional plans per account.
7. Stale Recommendation Reliance
Explanation: AWS Cost Explorer recommendations update on a 72-hour cycle. Purchasing based on weekend-stale data commits to usage patterns that may have already shifted. Batch workload completions, decommissioned environments, or scaling policy changes render recommendations obsolete within hours.
Fix: Cross-reference Cost Explorer recommendations against the Cost and Usage Report (CUR). Analyze the past 7 days of actual hourly usage. Recalculate manually if material changes occurred within the last 72 hours.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Stable monolith, single region, 3+ year roadmap | EC2 Instance Savings Plan (3-Year) | Maximum discount capture on predictable, unchanging footprint | -40% to -72% vs On-Demand |
| Microservices, frequent family upgrades, multi-region | Compute Savings Plan (1-Year) | Preserves discount coverage during migrations and scaling shifts | -35% to -60% vs On-Demand, near-zero waste |
| Batch processing with daily spikes, unpredictable load | Baseline Commitment + On-Demand/Spot for peaks | Avoids paying committed rates for intermittent capacity | -25% to -45% vs On-Demand, eliminates peak waste |
| Multi-account organization with uneven usage distribution | Shared Compute Savings Plan across OU | Consolidates unused commitment to cover eligible spend elsewhere | Recovers 20-30% of fragmented commitment value |
Configuration Template
# CloudWatch Alarm Configuration for Savings Plan Drift Detection
# Deploy via CloudFormation or CDK
Resources:
SavingsPlanUtilizationAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "SavingsPlan-Utilization-Drift-${AWS::StackName}"
ComparisonOperator: LessThanThreshold
EvaluationPeriods: 2
MetricName: UtilizationPercentage
Namespace: AWS/SavingsPlans
Period: 3600
Statistic: Average
Threshold: 85
AlarmDescription: "Triggers when Savings Plan utilization drops below 85%, indicating commitment drift or workload migration."
Dimensions:
- Name: SavingsPlanArn
Value: !Ref SavingsPlanArn
TreatMissingData: breaching
AlarmActions:
- !Ref UtilizationDriftSNSTopic
OKActions:
- !Ref UtilizationDriftSNSTopic
UtilizationDriftSNSTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: !Sub "finops-savings-plan-drift-${AWS::Region}"
Subscription:
- Protocol: email
Endpoint: !Ref AlertEmailAddress
Quick Start Guide
- Audit Current Footprint: Run AWS Compute Optimizer for 14 days. Export recommendations and rightsize all target workloads. Wait 30 days to establish post-optimization baselines.
- Calculate Baseline: Use the
BaselineCalculator and CommitmentSizingEngine modules to extract 90 days of hourly spend. Identify the 85th percentile floor and apply a 0.9 safety margin.
- Purchase Commitment: Select Compute Savings Plans for flexible workloads or EC2 Instance Savings Plans only for proven stable workloads. Commit to the calculated baseline, not peak.
- Deploy Monitoring: Apply the CloudWatch alarm template to track
UtilizationPercentage. Configure SNS notifications for <85% threshold breaches. Schedule weekly utilization reviews.
- Enable Sharing: In the AWS Organizations management account, navigate to Billing and Cost Management β Savings Plans β enable commitment sharing. Verify coverage reports show linked accounts consuming shared commitments.