g jobs that can survive node termination. The implementation isolates these workloads from control-plane components using taints and tolerations.
Step 1: Provision the Spot Node Pool
Create a dedicated node pool with the --spot flag. Apply a taint to prevent accidental scheduling of critical services.
gcloud container node-pools create ai-spot-workers \
--cluster=ml-platform-cluster \
--region=us-central1 \
--machine-type=g2-standard-8 \
--accelerator=type=nvidia-l4,count=2 \
--spot \
--node-taints=provisioning-tier=preemptible:NoSchedule \
--num-nodes=3
Step 2: Route Workloads via Tolerations
Workloads must explicitly declare tolerance for the taint. Pair this with a lower priority class to ensure system pods always win scheduling contention.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: spot-workload-low
value: -5
globalDefault: false
description: "Low priority for interruptible AI tasks"
---
apiVersion: batch/v1
kind: Job
metadata:
name: hyperparameter-search
spec:
template:
metadata:
labels:
app: ml-sweeper
spec:
priorityClassName: spot-workload-low
tolerations:
- key: "provisioning-tier"
operator: "Equal"
value: "preemptible"
effect: "NoSchedule"
containers:
- name: trainer
image: us-docker.pkg.dev/ml-registry/trainer:v2.1
resources:
limits:
nvidia.com/gpu: 2
env:
- name: CHECKPOINT_DIR
value: "/mnt/checkpoints"
restartPolicy: Never
Architectural Rationale:
- Taint Isolation: Prevents control-plane agents, monitoring sidecars, and inference endpoints from landing on preemptible nodes.
- Priority Classes: Guarantees that if quota tightens, Kubernetes evicts low-priority Spot workloads first, preserving cluster stability.
- Explicit Resource Requests: Declaring GPU limits ensures the scheduler only places pods on nodes with matching accelerator inventory, avoiding pending states.
Path 2: Queue-Driven Provisioning (DWS Flex-Start)
For multi-day model training or large-scale batch inference, preemption is unacceptable. DWS flex-start bypasses immediate quota checks by placing the request in a provisioning queue. GKE monitors regional capacity and allocates standard (non-preemptible) nodes once inventory is available.
Step 1: Declare Flex-Start Intent
No custom resources or external controllers are required. The scheduling directive lives directly in the pod spec.
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training-run
spec:
parallelism: 4
completions: 4
template:
spec:
nodeSelector:
cloud.google.com/gke-flex-start: "true"
cloud.google.com/gke-accelerator: nvidia-tesla-a100
containers:
- name: pytorch-trainer
image: us-docker.pkg.dev/ai-labs/dl-training:cuda12
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: dataset-volume
mountPath: /data
volumes:
- name: dataset-volume
persistentVolumeClaim:
claimName: training-data-pvc
restartPolicy: Never
Step 2: Cluster Prerequisites
DWS flex-start requires Node Auto-Provisioning (NAP) enabled on Standard clusters, or native support on Autopilot clusters. When the Job is applied, pods enter a Pending state. GKE's internal scheduler evaluates regional accelerator availability, provisions the nodes, and transitions pods to Running. Upon job completion, nodes are automatically deprovisioned.
Architectural Rationale:
- Queue-Based Allocation: Eliminates manual retry loops and API polling. GKE handles capacity detection and node creation atomically.
- Non-Preemptible Guarantee: Once provisioned, nodes are reserved for the workload's duration (capped at 7 days), ensuring training continuity.
- Cost-Performance Balance: Delivers significant discounts compared to on-demand pricing while maintaining the stability required for long-running compute graphs.
Pitfall Guide
1. Silent Preemption Loss
Explanation: Spot VMs receive a 30-second termination warning. If the application does not trap SIGTERM or flush in-memory state, training progress is lost.
Fix: Implement signal handlers in the training script. Save model checkpoints to remote storage (GCS, Cloud Storage FUSE, or a networked PVC) every N steps. Use terminationGracePeriodSeconds: 25 in the pod spec to allow cleanup before the 30s hard limit.
2. Control Plane Starvation
Explanation: Scheduling system components (metrics-server, ingress controllers, logging agents) on Spot nodes causes cluster instability during preemption events.
Fix: Apply strict taints to Spot node pools. Use nodeSelector or affinity rules on critical deployments to pin them to on-demand or Autopilot system nodes. Never rely on tolerations for infrastructure pods.
3. DWS Queue Blindness
Explanation: Teams assume DWS provides immediate execution. Pods can remain Pending for hours or days depending on regional accelerator inventory.
Fix: Set realistic SLAs for batch jobs. Monitor queue depth using kubectl get jobs and kubectl describe pod. Implement alerting on Pending duration exceeding thresholds. Consider breaking massive jobs into smaller parallel chunks to increase queue match probability.
4. Storage Persistence Gaps
Explanation: Spot nodes are ephemeral. Local node storage vanishes upon preemption. Workloads writing to emptyDir or local SSDs lose data instantly.
Fix: Mandate PersistentVolumeClaims (PVCs) backed by regional persistent disks or Cloud Storage FUSE CSI drivers. Ensure training scripts read/write checkpoints to networked storage, not local filesystems.
5. Taint and Selector Mismatches
Explanation: Typos in taint keys, operator values, or nodeSelector strings cause pods to remain unschedulable without clear error messages.
Fix: Use consistent naming conventions across infrastructure-as-code templates. Validate scheduling rules with kubectl describe node and kubectl describe pod before deploying to production. Implement CI linting for Kubernetes manifests.
6. Over-Provisioning DWS Requests
Explanation: Requesting more GPUs than a region can allocate in a single batch causes the DWS queue to stall indefinitely.
Fix: Start with conservative parallelism limits. Use maxParallelism in Job specs. Monitor regional accelerator availability via the GCP Console or gcloud compute accelerator-types list. Scale incrementally rather than requesting massive single-node allocations.
7. Ignoring Pod Disruption Budgets (PDBs)
Explanation: DWS nodes are reclaimed after 7 days. Without PDBs, simultaneous node expiration can crash distributed training jobs.
Fix: Configure PDBs with minAvailable or maxUnavailable thresholds. Implement graceful job restart logic that detects node expiration signals and requeues incomplete tasks.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Hyperparameter sweeps, CI validation, fault-tolerant batch jobs | Spot VM Node Pool | Immediate start, high interruptibility tolerance, maximum discount | Up to 90% reduction vs on-demand |
| Multi-day model training, RL fine-tuning, large batch inference | DWS Flex-Start Queue | Non-preemptible runtime, bypasses immediate quota gates, automatic deprovisioning | Up to 53% reduction vs on-demand |
| Production inference endpoints, low-latency APIs | On-Demand Node Pool | Zero preemption risk, predictable latency, guaranteed capacity | Baseline pricing (100%) |
| Mixed workload cluster with strict SLAs | Hybrid (Spot + DWS + On-Demand) | Routes workloads by tolerance profile, optimizes spend without violating uptime requirements | Optimized blended cost |
Configuration Template
# namespace: ai-workloads
# Apply to cluster with NAP enabled or Autopilot
---
apiVersion: v1
kind: Namespace
metadata:
name: ai-workloads
---
apiVersion: batch/v1
kind: Job
metadata:
name: flex-start-training
namespace: ai-workloads
spec:
parallelism: 2
completions: 2
template:
spec:
nodeSelector:
cloud.google.com/gke-flex-start: "true"
cloud.google.com/gke-accelerator: nvidia-l4
tolerations:
- key: "provisioning-tier"
operator: "Equal"
value: "preemptible"
effect: "NoSchedule"
containers:
- name: model-trainer
image: us-docker.pkg.dev/ai-platform/trainer:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: STORAGE_BACKEND
value: "gcs"
- name: CHECKPOINT_INTERVAL
value: "500"
volumeMounts:
- name: model-output
mountPath: /output
volumes:
- name: model-output
persistentVolumeClaim:
claimName: training-output-pvc
restartPolicy: Never
Quick Start Guide
- Verify Cluster Configuration: Ensure Node Auto-Provisioning is enabled on your GKE Standard cluster, or confirm you are using GKE Autopilot. Run
gcloud container clusters describe <CLUSTER_NAME> --region <REGION> --format="value(autopilot.enabled)" to validate.
- Create Spot Node Pool: Execute the
gcloud container node-pools create command with --spot and a dedicated taint. Wait for nodes to reach Ready status.
- Deploy a Test Job: Apply a minimal Job manifest with the
cloud.google.com/gke-flex-start: "true" nodeSelector and a GPU resource request. Monitor pod status with kubectl get pods -w.
- Validate Scheduling: Confirm Spot workloads respect taints and DWS jobs enter the provisioning queue. Check Cloud Monitoring for preemption events and queue depth metrics. Adjust checkpoint intervals and PDBs based on observed behavior.