metrics, cloud pricing APIs, and Kubernetes scheduling primitives.
Step 1: Instrument GPU Telemetry
Deploy NVIDIA DCGM or cloud-native GPU metrics exporters. Expose the following Prometheus metrics:
dcgm_sm_utilization (percentage)
dcgm_mem_utilization (percentage)
dcgm_gpu_power_usage (watts)
kube_pod_status_phase (for job lifecycle tracking)
node_gpu_memory_total_bytes / node_gpu_memory_used_bytes
Configure Prometheus to scrape at 15-second intervals. Retain raw data for 7 days, aggregated data for 90 days.
Step 2: Build the Cost Engine
The cost engine translates utilization metrics into spend-aware scheduling decisions. It uses a pricing abstraction layer to remain cloud-agnostic.
import { PrometheusMetricsClient } from './metrics';
import { PricingProvider } from './pricing';
import { SchedulerAdapter } from './scheduler';
interface JobCostProfile {
jobId: string;
gpuType: string;
requestedVramGB: number;
currentUtilization: number;
memoryPressure: number;
estimatedHourlyCost: number;
scalingAction: 'SCALE_UP' | 'SCALE_DOWN' | 'MAINTAIN' | 'MIGRATE_TO_SPOT';
}
export class GPUCostController {
constructor(
private metrics: PrometheusMetricsClient,
private pricing: PricingProvider,
private scheduler: SchedulerAdapter
) {}
async evaluateJobs(): Promise<JobCostProfile[]> {
const jobs = await this.metrics.getActiveGPUJobs();
const profiles: JobCostProfile[] = [];
for (const job of jobs) {
const gpuMetrics = await this.metrics.getGPUUtilization(job.podName);
const pricing = await this.pricing.getCurrentRate(job.gpuType, job.region);
const utilization = gpuMetrics.smUtilization;
const memoryPressure = gpuMetrics.memoryUsed / gpuMetrics.memoryTotal;
// Cost-aware policy logic
let action: JobCostProfile['scalingAction'] = 'MAINTAIN';
let estimatedCost = pricing.onDemand;
if (utilization < 0.25 && memoryPressure < 0.4) {
action = 'SCALE_DOWN';
estimatedCost = pricing.onDemand * 0.5;
} else if (utilization > 0.85 && memoryPressure > 0.8) {
action = 'SCALE_UP';
estimatedCost = pricing.onDemand * 1.2;
} else if (utilization > 0.5 && job.checkpointInterval && job.checkpointInterval <= 1800) {
action = 'MIGRATE_TO_SPOT';
estimatedCost = pricing.spot;
}
profiles.push({
jobId: job.id,
gpuType: job.gpuType,
requestedVramGB: job.vramGB,
currentUtilization: utilization,
memoryPressure,
estimatedHourlyCost: estimatedCost,
scalingAction: action
});
}
return profiles;
}
async applyPolicies(profiles: JobCostProfile[]): Promise<void> {
for (const profile of profiles) {
switch (profile.scalingAction) {
case 'SCALE_DOWN':
await this.scheduler.reduceReplicas(profile.jobId, 0.5);
break;
case 'SCALE_UP':
await this.scheduler.increaseReplicas(profile.jobId, 1.5);
break;
case 'MIGRATE_TO_SPOT':
await this.scheduler.migrateToSpot(profile.jobId);
break;
case 'MAINTAIN':
// No action, emit telemetry for showback
break;
}
}
}
}
Step 3: Architecture Decisions & Rationale
- Decoupled Pricing Layer: Cloud GPU pricing changes frequently and varies by region, availability zone, and commitment tier. Abstracting pricing behind a
PricingProvider interface allows runtime updates without controller restarts. Implement caching with TTL-based invalidation to avoid API rate limits.
- Event-Driven Policy Evaluation: Run the controller on a 60-second reconciliation loop. Use Kubernetes informers or Prometheus alerting rules to trigger immediate evaluation during sudden load spikes or preemption events.
- Checkpoint-First Spot Migration: The controller only migrates jobs to spot instances if
checkpointInterval <= 1800 (30 minutes). This enforces fault tolerance as a prerequisite for cost optimization, preventing data loss and wasted compute.
- Utilization Thresholds Over Absolute Metrics: Policy decisions use relative utilization (
smUtilization, memoryPressure) rather than raw GPU counts. This prevents over-provisioning when workloads are memory-bound but compute-light, or vice versa.
Step 4: Implement Showback & Budget Enforcement
Expose cost profiles via a Grafana dashboard and Kubernetes annotations. Implement budget thresholds that trigger pod eviction or queueing when job-level spend exceeds allocated limits. Integrate with existing CI/CD pipelines to block deployments that request GPU tiers outside approved cost bands.
Pitfall Guide
-
Equating GPU Utilization with Cost Efficiency
High SM utilization does not guarantee cost efficiency. A job may saturate tensor cores while underutilizing VRAM, indicating poor batch sizing or unnecessary precision. Always pair compute metrics with memory bandwidth and VRAM pressure before scaling decisions.
-
Ignoring MIG/Partitioning Overhead
Multi-Instance GPU (MIG) partitioning reduces idle VRAM waste but introduces context-switch overhead and limits PCIe bandwidth per partition. Use MIG only for inference or micro-batch workloads. Training jobs requiring NVLink or high memory bandwidth will experience throughput degradation that negates cost savings.
-
Spot Preemption Without Graceful Degradation
Spot instances are cost-effective but disruptive. Migrating jobs without checkpointing, state serialization, or queue-based retry logic causes silent data loss and forces full retraining. Always enforce minimum checkpoint intervals and implement exponential backoff for spot reclamation events.
-
Static Batch Sizes Causing Memory Fragmentation
Fixed batch sizes lead to VRAM fragmentation and underutilization. Implement dynamic batching (e.g., vLLM, Triton Inference Server) or gradient accumulation to match memory capacity. Static batching inflates GPU count requirements by 20β40% without improving model quality.
-
Missing Network and Storage Egress Costs
GPU workflows often move terabytes of dataset shards and model checkpoints across zones. Egress fees and parallel filesystem costs can exceed compute spend. Co-locate data and GPU nodes, use regional endpoints, and compress checkpoint artifacts before cross-zone transfer.
-
Over-Optimizing for Cost at the Expense of SLA
Aggressive downscaling or frequent spot migration can breach latency SLOs for inference endpoints or extend training timelines beyond business windows. Implement cost-aware SLO guards: if p95 latency exceeds threshold or training ETA increases by >15%, revert to on-demand provisioning.
-
Lack of Job-Level Cost Attribution
Aggregated GPU spend masks inefficient jobs. Without per-pod cost tagging, teams cannot identify which models, datasets, or engineers drive waste. Enforce Kubernetes labels (app, team, model-version) and map them to cost allocation tags in cloud billing.
Best Practice: Treat GPU cost management as a continuous feedback loop. Instrument β evaluate β act β measure. Automate policy enforcement, but retain manual override paths for critical workloads. Review cost-to-throughput ratios monthly, not quarterly.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Large-scale pretraining (multi-week) | Reserved + Spot Hybrid with 15-min checkpoints | Reserved guarantees baseline capacity; spot absorbs variable load with fault tolerance | 55β70% reduction vs pure on-demand |
| Real-time inference (sub-100ms SLO) | Right-sized on-demand with MIG partitioning | Predictable latency requires stable hardware; MIG reduces idle VRAM waste | 30β40% reduction via partition consolidation |
| Experimental model tuning (short runs) | Spot-first with auto-retry queue | Interruption risk is acceptable; queue ensures completion without manual intervention | 75β85% reduction vs on-demand |
| Batch inference (overnight jobs) | Auto-scaling spot cluster with dynamic batching | Workload is time-flexible; dynamic batching maximizes VRAM utilization | 65β80% reduction via scale-to-zero |
| Multi-tenant research lab | Showback-enabled K8s with budget quotas | Prevents cross-team cost leakage; enforces accountability without blocking development | 20β35% reduction via behavioral alignment |
Configuration Template
# karpenter-gpu-cost-policy.yaml
apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
name: gpu-cost-optimized
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["ml.g5.xlarge", "ml.g5.2xlarge", "ml.g5.4xlarge"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
limits:
resources:
nvidia.com/gpu: "16"
ttlSecondsAfterEmpty: 30
ttlSecondsUntilExpired: 604800
consolidation:
enabled: true
policy: Auto
weight: 100
---
# prometheus-gpu-metrics-scrape.yaml
scrape_configs:
- job_name: 'gpu-telemetry'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_gpu_monitor]
action: keep
regex: true
metrics_path: /metrics
scrape_interval: 15s
---
# ts-cost-controller-config.json
{
"reconciliationInterval": 60,
"utilizationThresholds": {
"scaleDown": 0.25,
"scaleUp": 0.85,
"spotMigration": 0.5
},
"checkpointRequirementSeconds": 1800,
"pricingCacheTTL": 300,
"sloGuards": {
"maxLatencyP95Ms": 120,
"maxTrainingETADeltaPercent": 15
}
}
Quick Start Guide
- Deploy NVIDIA DCGM or cloud GPU metric exporter to all GPU nodes and configure Prometheus to scrape at 15-second intervals.
- Install the TypeScript cost controller as a Kubernetes Deployment with RBAC permissions for pod scaling and node inspection.
- Apply the Karpenter provisioner template to enable spot/on-demand hybrid scheduling with consolidation and scale-to-zero.
- Configure Prometheus alerting rules for utilization thresholds and integrate Grafana dashboards for job-level cost showback.
- Validate by running a test training job, observing auto-scaling events, and confirming cost attribution in your billing console within 5 minutes.