ttlenecks.
-
Metric Ingestion & Normalization
Pull CPU, memory, disk IOPS, network throughput, and latency metrics from cloud-native monitoring services. Normalize across instance families using baseline performance indices (e.g., AWS vCPU performance tiers, Azure Compute Units). Store raw metrics in a time-series database with consistent resolution (1-minute or 5-minute intervals).
-
Utilization Modeling
Calculate 95th percentile, 50th percentile, and burst ratios over a rolling 30-day window. Identify underutilized resources using multi-dimensional thresholds: CPU < 30%, memory < 40%, network < 20%, and I/O < 25% for sustained periods. Flag resources with high variance for burst-capable instance families instead of linear downgrades.
-
Policy Engine & Recommendation
Map utilization profiles to target instance types using a constraint solver. Apply business rules: license vCPU limits, GPU/accelerator requirements, availability zone affinity, and minimum baseline specs. Generate recommendations with confidence scores and projected savings.
-
Safe Execution
Deploy in shadow mode first. Compare recommended changes against current performance SLAs. Roll out via canary deployments or infrastructure-as-code drift detection. Implement automated rollback on error rate spikes, latency degradation, or health check failures.
Architecture Decisions and Rationale
- Event-driven serverless pipeline: Scales with account size, isolates failure domains, and aligns cost with usage. EventBridge routes metric thresholds to Lambda analyzers, preventing monolithic polling.
- Separation of analysis and execution: Analysis runs in read-only mode. Execution is gated by policy approval and change windows. This prevents race conditions and accidental production disruption.
- DynamoDB for state tracking: Low-latency, partition-key optimized storage for recommendation state, rollout history, and rollback triggers. Supports conditional writes for idempotent operations.
- OpenTelemetry metric standardization: Abstracts cloud provider differences. Enables cross-cloud rightsizing without rewriting ingestion logic.
- Infrastructure-as-code integration: Recommendations translate to Terraform/CDK diff outputs. Rightsizing becomes a declarative drift correction rather than imperative API calls.
TypeScript Implementation Example
import { CloudWatchClient, GetMetricDataCommand } from "@aws-sdk/client-cloud-watch";
import { EC2Client, DescribeInstancesCommand } from "@aws-sdk/client-ec2";
interface UtilizationProfile {
instanceId: string;
cpu95th: number;
mem95th: number;
net95th: number;
burstRatio: number;
recommendedFamily: string;
confidence: number;
}
const cloudWatch = new CloudWatchClient({ region: "us-east-1" });
const ec2 = new EC2Client({ region: "us-east-1" });
async function fetchMetrics(instanceIds: string[]): Promise<Record<string, number[]>> {
const metricQueries = instanceIds.map((id, idx) => ({
Id: `metric_${idx}`,
MetricDataQuery: {
Expression: `SEARCH('{AWS/EC2,InstanceId} CPUUtilization', 'Average', 300)`,
Label: id,
ReturnData: true,
},
}));
const command = new GetMetricDataCommand({
StartTime: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000),
EndTime: new Date(),
MetricDataQueries: metricQueries,
});
const response = await cloudWatch.send(command);
const metrics: Record<string, number[]> = {};
response.MetricDataResults?.forEach((res) => {
metrics[res.Label!] = res.Values ?? [];
});
return metrics;
}
function calculatePercentile(values: number[], percentile: number): number {
const sorted = values.sort((a, b) => a - b);
const index = Math.ceil((percentile / 100) * sorted.length) - 1;
return sorted[index] ?? 0;
}
export async function generateRightsizingRecommendations(): Promise<UtilizationProfile[]> {
const instances = await ec2.send(new DescribeInstancesCommand({}));
const instanceIds = instances.Reservations?.flatMap((r) =>
r.Instances?.filter((i) => i.InstanceId).map((i) => i.InstanceId!) ?? []
) ?? [];
const cpuMetrics = await fetchMetrics(instanceIds);
const recommendations: UtilizationProfile[] = [];
for (const id of instanceIds) {
const values = cpuMetrics[id] ?? [];
const cpu95 = calculatePercentile(values, 95);
const burstRatio = Math.max(...values) / cpu95;
let recommendedFamily = "current";
let confidence = 0.5;
if (cpu95 < 20 && burstRatio < 1.5) {
recommendedFamily = "t4g.medium";
confidence = 0.85;
} else if (cpu95 < 35 && burstRatio > 2.0) {
recommendedFamily = "m6i.large";
confidence = 0.75;
}
recommendations.push({
instanceId: id,
cpu95th: cpu95,
mem95th: 0, // Placeholder: integrate memory metric collection
net95th: 0,
burstRatio,
recommendedFamily,
confidence,
});
}
return recommendations;
}
The example demonstrates metric collection, percentile calculation, and baseline recommendation logic. Production systems extend this with memory/network ingestion, constraint validation, DynamoDB state persistence, and Step Functions orchestration for safe rollout.
Pitfall Guide
-
Single-metric myopia
Rightsizing based solely on CPU ignores memory, I/O, and network bottlenecks. A database instance may show 15% CPU but saturate disk IOPS, causing latency spikes after downgrade. Always model multi-dimensional utilization.
-
Ignoring temporal baselines
Seasonal traffic, weekend dips, and batch windows create artificial underutilization. Rightsizing against a 7-day snapshot instead of a 30β60 day rolling window triggers false positives. Align baseline periods with business cycles.
-
Over-automating without guardrails
Direct execution of recommendations without shadow mode or canary validation causes production outages. Implement approval gates, change windows, and automated rollback on SLA breach.
-
Neglecting licensing and dependency constraints
Enterprise software (Oracle, SQL Server, VMware) often licenses per vCPU or socket. Downgrading instance size can violate contracts or trigger audit penalties. Map license boundaries into the policy engine.
-
Skipping post-change validation
Rightsizing is not complete when the API call succeeds. Validate error rates, p95 latency, and health check status for 24β48 hours. Silent degradation is more expensive than idle capacity.
-
Treating rightsizing as episodic
Cloud workloads drift. New deployments, traffic shifts, and code changes invalidate static audits within 30 days. Embed rightsizing as a continuous control plane, not a quarterly exercise.
Best Practices from Production
- Run recommendations in shadow mode for 14 days before execution.
- Use multi-dimensional thresholds with weighted scoring (CPU 40%, memory 30%, I/O 20%, network 10%).
- Align change windows with low-traffic periods and deploy via infrastructure-as-code drift correction.
- Implement automated rollback on error rate > 0.5% or p95 latency increase > 20%.
- Integrate rightsizing feedback into CI/CD to prevent over-provisioning at deployment time.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Stateless web tier with predictable traffic | Predictive scaling + rightsize to burstable family | Stable baseline, low variance, elastic demand | 30β40% reduction |
| Batch processing with nightly spikes | Reactive automation + scheduled upscaling | High variance, time-bound workload | 20β25% reduction |
| Stateful database with strict latency SLAs | Manual audit + shadow recommendations | High risk tolerance, license constraints | 10β15% reduction |
| Legacy monolith with unknown dependencies | Tagging + metric baselining before action | Hidden bottlenecks, no rollback safety | 5β10% reduction |
Configuration Template
rightsizing:
version: "1.0"
collection:
window_days: 30
resolution_minutes: 5
metrics: [cpu, memory, disk_iops, network_throughput]
thresholds:
cpu_95th: 30
memory_95th: 40
io_95th: 25
network_95th: 20
burst_ratio_max: 1.8
policies:
license_constraints:
max_vcpu: 8
require_licensing_check: true
change_control:
mode: shadow_first
canary_percentage: 10
rollback_on:
error_rate_threshold: 0.005
latency_p95_increase: 0.20
change_window: "02:00-04:00 UTC"
output:
format: terraform_diff
state_backend: dynamodb
confidence_minimum: 0.70
Quick Start Guide
- Deploy the metric collector: Attach an IAM role with
cloudwatch:GetMetricData and ec2:DescribeInstances permissions. Run the TypeScript analyzer in a Lambda or containerized job.
- Initialize state storage: Create a DynamoDB table with
recommendationId as partition key and status (shadow/approved/applied/rolled_back) as sort key.
- Apply the configuration template: Save the YAML above as
rightsizing-policy.yaml. Load it into your policy engine or CI/CD pipeline.
- Run in shadow mode: Execute the analyzer. Review generated Terraform/CDK diffs. Validate against p95 latency and error rate baselines.
- Approve and apply: Promote recommendations to
approved status. Trigger infrastructure-as-code apply during the defined change window. Monitor rollback triggers for 48 hours.