Optimizing Distributed LLM Serving on Kubernetes: Storage, Network, and Scheduling Strategies

Current Situation Analysis

Deploying large language models in Kubernetes clusters introduces a hidden performance ceiling that most teams discover only after scaling past three or four replicas. The bottleneck rarely originates from the GPU itself. Instead, it emerges from the data pipeline feeding the inference engines. Modern LLM checkpoints frequently exceed 140GB. When multiple vLLM pods attempt to load these weights simultaneously, the underlying storage protocol and network stack become the primary determinants of throughput.

This problem is systematically overlooked because infrastructure teams and ML engineers operate in separate domains. ML practitioners optimize quantization, batch sizes, and KV-cache parameters, while platform engineers focus on node autoscaling and GPU drivers. The intersection—how model weights traverse the network to reach GPU memory—is treated as a "cloud provider default" problem. In reality, default Kubernetes networking and storage configurations are tuned for general-purpose microservices, not for sustained, high-bandwidth, read-heavy workloads.

Empirical observations from production deployments reveal a consistent pattern: unoptimized clusters cap NFS throughput at baseline levels, directly throttling inference latency. Without jumbo frames and adjusted TCP socket buffers, network interfaces saturate prematurely under large sequential reads. Furthermore, unsynchronized pod scheduling creates a silent race condition. vLLM containers frequently start before network tuning daemons complete their sysctl adjustments, causing GPU nodes to operate at roughly half their theoretical throughput. The cost impact is immediate: idle GPU cycles, inflated cloud bills, and unpredictable latency spikes during scale-up events.

WOW Moment: Key Findings

The performance delta between a baseline Kubernetes deployment and a tuned reference architecture is not incremental. It is structural. By aligning storage protocols, network stack parameters, and scheduling constraints, teams can unlock predictable scaling behavior without purchasing additional hardware.

Approach	Model Load Time (140GB)	Inference Throughput (tok/s)	GPU Utilization	Network Overhead
Baseline K8s (Block Storage + Default MTU/TCP)	48–55s	1,200–1,400	58–65%	High (packet fragmentation, buffer thrashing)
Optimized K8s (Managed NFS + Jumbo Frames/TCP Tuning + Taint-Based Scheduling)	22–28s	2,400–2,800	89–94%	Low (aligned payloads, tuned socket windows)

This finding matters because it decouples compute scaling from storage latency. When model weights load in under 30 seconds and network throughput doubles on identical hardware, cold-start penalties vanish. The architecture enables horizontal scaling without linear cost increases, and it transforms GPU utilization from a guessing game into a deterministic metric. Teams can confidently run larger batch sizes or support more concurrent users because the data pipeline no longer starves the inference engine.

Core Solution

Building a production-ready LLM inference stack requires three coordinated layers: storage selection, network stack optimization, and scheduling synchronization. Each layer addresses a specific failure mode that degrades throughput.

Step 1: Storage Layer Selection

Object storage (S3-compatible) introduces latency due to chunked downloads, authentication overhead, and eventual consistency models. Block volumes (EBS, Persistent Disks) restrict concurrent access, forcing each replica to maintain a separate copy of the weights. Managed NFS provides a shared, POSIX-compliant filesystem optimized for read-heavy workloads. Multiple vLLM pods can mount the same volume concurrently without duplication, and the protocol handles metadata caching efficiently.

Rationale: NFS reduces storage costs by eliminating redundant weight copies. It also simplifies model versioning—update a single directory, and all replicas see the change on next restart. For read-only inference workloads, NFS attribute caching (actimeo) can be tuned to minimize metadata round-trips.

Step 2: Network Stack Optimization

Default Kubernetes nodes ship with an MTU of 1500 bytes and conservative TCP buffer sizes. Large sequential reads from NFS generate thousands of packets, increasing interrupt overhead and causing buffer exhaustion. Enabling jumbo frames (MTU 9000) reduces packet count by ~6x. Adjusting TCP socket buffers (rmem_max, wmem_max, tcp_rmem, tcp_wmem) prevents the kernel from throttling throughput during sustained transfers.

Implementation: A privileged DaemonSet applies sysctl parameters across all GPU nodes. The tuning targets both the network interface and the TCP stack, ensuring end-to-end alignment.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: net-tuner-daemon
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: net-tuner
  template:
    metadata:
      labels:
        app: net-tuner
    spec:
      hostNetwork: true
      hostPID: true
      tolerations:
        - key: "node-role.kubernetes.io/control-plane"
          operator: "Exists"
      containers:
        - name: sysctl-tuner
          image: alpine:3.18
          command:
            - /bin/sh
            - -c
            - |
              sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
              sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
              sysctl -w net.core.rmem_max=16777216
              sysctl -w net.core.wmem_max=16777216
              sysctl -w net.ipv4.tcp_mtu_probing=1
              sysctl -w net.ipv4.tcp_window_scaling=1
              echo "Network stack tuned successfully"
              sleep infinity
          securityContext:
            privileged: true

Step 3: Scheduling Synchronization

Network tuning and vLLM pod startup compete for the same node resources. If vLLM containers launch before the DaemonSet finishes applying sysctl changes, the inference engine inherits suboptimal socket buffers and MTU settings. The result is a silent throughput cap.

Fix: Apply a taint to GPU nodes during initialization. vLLM deployments declare a toleration that only matches after a readiness marker is created. A lightweight init container or sidecar verifies network parameters before removing the taint, ensuring pods only schedule when the stack is fully optimized.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference-stack
spec:
  replicas: 4
  template:
    spec:
      tolerations:
        - key: "llm-inference.net-ready"
          operator: "Exists"
          effect: "NoSchedule"
      containers:
        - name: vllm-engine
          image: vllm/vllm-openai:latest
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
            - "--model"
            - "/mnt/weights/llama-3.1-70b-instruct"
            - "--tensor-parallel-size"
            - "4"
            - "--max-model-len"
            - "8192"
          volumeMounts:
            - name: model-weights
              mountPath: /mnt/weights
              readOnly: true
      volumes:
        - name: model-weights
          persistentVolumeClaim:
            claimName: nfs-model-pvc

Architecture Decision: The taint-based approach avoids complex init container retries or custom admission controllers. It leverages Kubernetes-native scheduling semantics, making it auditable and compatible with cluster autoscalers.

Pitfall Guide

1. MTU Mismatch Across the Data Path

Explanation: Jumbo frames require every hop (node NIC, VPC router, load balancer, NFS endpoint) to support MTU 9000. If any component defaults to 1500, packets fragment or drop, causing retransmissions and throughput collapse. Fix: Validate MTU end-to-end using ping -M do -s 8972 <nfs-endpoint>. Configure cloud provider VPC settings and ensure the NFS service advertises the correct MTU.

2. Over-Provisioning TCP Buffers

Explanation: Setting rmem_max and wmem_max to extreme values (e.g., 64MB+) without considering node memory capacity triggers OOM conditions under high concurrency. The kernel allocates per-socket buffers eagerly. Fix: Start with 16MB buffers. Monitor /proc/net/sockstat and ss -m to verify actual usage. Scale buffers proportionally to expected concurrent connections, not theoretical maximums.

3. Ignoring NFS Attribute Caching

Explanation: Default actimeo=3 causes frequent metadata lookups. For read-only model directories, this adds unnecessary latency and NFS server load. Fix: Mount with actimeo=30 or noac (if strict consistency is required). Test with stat frequency under load to find the optimal balance.

4. Race Conditions on Pod Startup

Explanation: vLLM containers begin loading weights before network tuning completes. The inference engine inherits default socket parameters, capping throughput at ~50% of potential. Fix: Implement node taints with a readiness marker. Use a DaemonSet that applies tuning, then creates a marker file or label. Only then should the taint be removed.

5. Assuming Cloud Provider Defaults Are Optimal

Explanation: Managed Kubernetes services ship with conservative networking profiles designed for general workloads. They rarely enable TCP window scaling or MTU probing by default. Fix: Treat cloud defaults as a baseline, not a target. Audit sysctl -a | grep tcp and ip link show on fresh nodes before deploying inference workloads.

6. Neglecting GPU Memory Bandwidth vs Network Bandwidth

Explanation: Doubling NFS throughput is useless if the PCIe/NVLink path to GPU memory becomes the bottleneck. vLLM's weight loading pipeline must be profiled to ensure network gains translate to GPU utilization. Fix: Use nvtop and nvidia-smi dmon during load phases. If GPU memory bandwidth saturates before network, consider weight quantization or tensor parallelism adjustments.

7. Skipping Read-Only Mounts for Weights

Explanation: Mounting model directories as read-write triggers unnecessary journaling, lock contention, and NFS attribute updates. It also increases corruption risk during concurrent access. Fix: Always mount inference weights with readOnly: true. This disables write-back caches and reduces NFS server CPU overhead.

Production Bundle

Action Checklist

Audit node MTU and TCP buffer defaults before cluster provisioning
Deploy a privileged DaemonSet for sysctl tuning with idempotent execution
Configure Managed NFS with read-only mounts and optimized actimeo values
Implement node taints to block vLLM scheduling until network tuning completes
Validate end-to-end MTU alignment using packet size probes
Profile GPU memory bandwidth during weight loading to identify secondary bottlenecks
Monitor NFS server CPU and network interrupt rates under sustained load
Document rollback procedures for sysctl changes in case of node instability

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Prototype / <3 Replicas	Block volumes + default networking	Simplicity outweighs optimization overhead	Low (no tuning labor)
Production / 4–12 Replicas	Managed NFS + jumbo frames + taint scheduling	Eliminates weight duplication, doubles throughput	Medium (initial tuning effort, lower GPU waste)
Multi-AZ / High Availability	Managed NFS with cross-AZ replication + TCP tuning	Maintains consistency while optimizing network path	High (cross-AZ data transfer, but prevents scaling failures)
Cost-Constrained / Spot Instances	Ephemeral weights + object storage sync	Avoids persistent storage costs on volatile nodes	Low storage, higher cold-start latency

Configuration Template

# PersistentVolumeClaim for Managed NFS
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-model-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: managed-nfs-storage
  resources:
    requests:
      storage: 200Gi
---
# vLLM Deployment with Network-Ready Taint
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-serving-cluster
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    spec:
      tolerations:
        - key: "inference.network-tuned"
          operator: "Exists"
          effect: "NoSchedule"
      containers:
        - name: vllm-runtime
          image: vllm/vllm-openai:latest
          ports:
            - containerPort: 8000
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
            - "--model"
            - "/shared-weights/llama-3.1-70b"
            - "--gpu-memory-utilization"
            - "0.92"
            - "--enable-prefix-caching"
          volumeMounts:
            - name: weight-storage
              mountPath: /shared-weights
              readOnly: true
          resources:
            limits:
              nvidia.com/gpu: 4
            requests:
              nvidia.com/gpu: 4
      volumes:
        - name: weight-storage
          persistentVolumeClaim:
            claimName: nfs-model-pvc

Quick Start Guide

Provision Managed NFS: Create a shared filesystem in your cloud provider console. Note the endpoint IP and ensure VPC routing allows port 2049 (NFS) and 111 (rpcbind).
Deploy Network Tuner: Apply the privileged DaemonSet to GPU node pools. Verify sysctl changes with sysctl net.ipv4.tcp_rmem and confirm MTU with ip link show eth0.
Apply Node Taint: Tag GPU nodes with inference.network-tuned:NoSchedule. Create a readiness marker (label or file) that the tuning DaemonSet sets upon successful completion.
Launch vLLM Workloads: Deploy the inference stack with the matching toleration. Pods will remain pending until the taint is removed, guaranteeing optimized network parameters.
Validate Throughput: Run a benchmark client (e.g., vllm bench or custom OpenAI-compatible load test). Compare tok/s and GPU utilization against baseline metrics. Adjust tcp_rmem/tcp_wmem if socket buffer exhaustion appears in ss -m.

We Got 2x LLM Inference Speed With Three Kubernetes Settings