oughput and 3.9x cost efficiency over naive BF16 A100 deployments, while maintaining sub-350ms TTFT for real-time serving.
- Bandwidth > FLOPS in Decode: Once prefill completes, inter-token latency scales with HBM streaming speed. H100βs 3.35 TB/s HBM3 outpaces A100βs 2.0 TB/s, widening the gap as KV cache grows.
- VRAM Ceiling Prediction: Accurate KV cache arithmetic prevents OOM crashes and reveals that INT4 quantization drops total VRAM from ~168GB to ~51GB, enabling single-A100 80GB deployment instead of multi-GPU tensor parallelism.
Core Solution
Deployment success hinges on four sequential decisions: workload classification, VRAM calculation, instance selection, and pricing model alignment.
1. Workload Classification
The compute signature dictates hardware requirements:
- Full Training: Every parameter is in motion. GPUs execute Allreduce across data-parallel replicas and shuttle optimizer states between HBM and compute units. AdamW in mixed precision stores weights, gradients, first moments, and second moments (16-18 bytes/parameter). Memory capacity caps batch size; memory bandwidth caps weight update velocity.
- Fine-Tuning (LoRA): Freezes base weights and trains only low-rank decomposition matrices. For Llama 3 8B at rank=16, trainable parameters drop from 8B to ~42M. Base model stays in BF16/FP16, adapters are negligible, optimizer states cover only the trainable slice. Total VRAM drops to ~20GB, enabling single-A100 80GB deployment.
- Inference: Optimizer disappears; bottleneck shifts to bandwidth. Batch inference maximizes throughput per dollar by packing sequences. Real-time inference targets TTFT (FLOPS-limited during prefill). Decode phase generates one token at a time; inter-token latency scales with KV cache streaming speed from HBM.
2. VRAM Calculation
Inference VRAM formula:
VRAM = (N_params x bytes_per_param) + KV_cache_size + framework overhead (10-15%)
KV cache size formula:
KV_cache_size = 2 x num_layers x num_heads x head_dim x seq_len x batch_size x bytes_per_element
Note: num_heads for GQA models refers to the KV head count, not the query head count (e.g., 8 for Llama-3-70B, not 64). Retrieve num_layers, num_heads (as num_key_value_heads), and head_dim from config.json on HuggingFace Hub.
Example: Llama-3-70B at 4K context, batch size 8
- Weights at BF16: 70B x 2 bytes = 140GB
- Weights at INT4 via bitsandbytes: 70B x 0.5 bytes = 35GB
- KV cache at BF16: 2 x 80 layers x 8 KV heads x 128 head_dim x 4096 tokens x 8 batch x 2 bytes β 10.7GB
- Framework overhead at BF16: 140GB x 0.12 β 17GB
- Total at BF16: β 168GB (requires 2x H100 80GB or tensor parallelism)
- Total at INT4: β 35GB + 10.7GB KV cache + 5GB overhead β 51GB (fits one A100 80GB)
Minimum Per-Precision VRAM for LLM Inference (weights only, ~12% overhead included)
| Model Size | FP16/BF16 | INT8 | INT4 | Min Instance (FP16) | Min Instance (INT4) |
|---|
| 8B | ~18GB | ~9GB | ~5GB | A100 40GB | RTX 4090 24GB |
| 13B | ~29GB | ~14GB | ~8GB | A100 40GB | RTX 4090 24GB |
| 34B | ~76GB | ~38GB | ~19GB | A100 80GB | A100 40GB |
| 70B | ~157GB | ~78GB | ~40GB | 2x A100 80GB | A100 80GB |
Fine-tuning adjustment: Full AdamW mixed-precision training multiplies FP16 weight VRAM by 8x. LoRA at rank=16 adds ~4GB overhead on top of the frozen base model. Rank=8 halves overhead (quality trade-off); rank=32 doubles it.
3. Instance & Pricing Alignment
- Prefill-heavy workloads: Prioritize TFLOPS (A100/H100 PCIe/SXM).
- Decode-heavy/long-context workloads: Prioritize HBM bandwidth (H100 SXM > A100 > consumer GPUs).
- Pricing: Spot instances for fault-tolerant, long-running training. On-demand or reserved for latency-sensitive inference. Auto-scaling groups for bursty batch inference.
Pitfall Guide
- Ignoring KV Cache Scaling at Long Contexts: VRAM scales linearly with sequence length. Assuming static weight memory suffices for 8K+ contexts guarantees OOM. Always calculate KV cache separately and add 2-10GB baseline, scaling upward for high concurrency.
- Miscounting GQA Heads for KV Cache: Using query head count instead of KV head count (e.g., 64 vs 8 for Llama-3-70B) inflates KV cache estimates by 8x, leading to massive over-provisioning. Verify
num_key_value_heads in config.json.
- Confusing Prefill (FLOPS-bound) with Decode (Bandwidth-bound): Selecting GPUs based on TFLOPS for decode-heavy serving wastes budget. Once prefill completes, inter-token latency depends on HBM bandwidth. H100 SXMβs 3.35 TB/s significantly outperforms A100βs 2.0 TB/s for token generation.
- Over-Provisioning Full Training When LoRA Suffices: Full AdamW training requires ~8x VRAM of FP16 weights due to optimizer states. LoRA freezes base weights, dropping trainable parameters to a fraction. Single-GPU deployment is often viable where multi-GPU was assumed.
- Neglecting Modality-Specific Memory Pressure: Vision encoders produce embeddings that inflate effective sequence length before the LLM processes input. Pure text VRAM estimates fail for multimodal models like LLaVA, causing unexpected VRAM limits at standard batch sizes.
- Misaligning Pricing Models with Workload Duration: Running long training jobs on spot instances risks interruption and checkpoint loss. Using on-demand for intermittent inference burns budget. Match pricing tier to workload fault tolerance and duration.
- Skipping Framework Overhead Allocation: Assuming 100% VRAM efficiency ignores CUDA context, activation buffers, and serving framework overhead. Always reserve 10-15% headroom to prevent silent degradation or OOM during peak load.
Deliverables
- GPU Provisioning Blueprint: Decision tree mapping workload class β modality β VRAM calculation β instance topology β pricing model. Includes quantization trade-off matrix (BF16/INT8/INT4) and GQA-aware KV cache calculator.
- Pre-Deployment Validation Checklist: 12-step verification covering workload classification, config.json parameter extraction, VRAM arithmetic, bandwidth vs FLOPS alignment, overhead reservation, and spot/on-demand pricing validation.
- Configuration Templates: Production-ready
vLLM and TGI serving configs for INT4 vs BF16 deployment, LoRA training YAML with rank/learning-rate/scheduling presets, and Terraform/CloudFormation snippets for auto-scaling inference clusters with GPU-specific instance families.