Back to KB
Difficulty
Intermediate
Read Time
4 min

CPU Inference on AMD EPYC 9334: Real Numbers for LLM and TTS Workloads

By Codcompass TeamΒ·Β·4 min read

Current Situation Analysis

Training is a capital expenditure; inference is an operational tax. Once models enter production, compute costs scale linearly with traffic, and for many organizations, inference spend eclipses training budgets within months. The traditional hardware paradigm defaults to GPU clusters for all inference workloads, but this approach frequently misaligns with the actual performance characteristics of deployed models.

Training workloads are heavily compute-bound and thrive on high-bandwidth interconnects (NVLink/InfiniBand) across large GPU clusters. Inference, however, splits into two distinct phases with divergent bottlenecks:

  • Prefill: Compute-bound. The model processes input tokens, builds the KV cache, and generates the first output token.
  • Decode: Memory-bandwidth-bound. The model generates subsequent tokens sequentially by reading from the KV cache.

When serving quantized models (Q4/Q5), the decode phase becomes strictly limited by DRAM bandwidth rather than raw FLOPS. Defaulting to GPUs for these workloads often results in underutilized tensor cores, inflated TCO, and unnecessary latency from PCIe/NVLink data movement. Without proper workload routing, quantization alignment, and memory topology awareness, teams waste budget on hardware that doesn't match the bottleneck profile of their actual traffic.

WOW Moment: Key Findings

ApproachTTFT (s)Decode Throughput (tok/s)Memory Footprint (GB)RTF
CPU (EPYC 9334) + DeepSeek-R1-8B Q4_K_M4.127.8~6.2N/A
CPU (EPYC 9334) + DeepSeek-R1-8B FP168.18.1~16.0N/A
CPU (EPYC 9334) + GPT-OSS-20B Q43.618.3~11.5N/A
CPU (EPYC 9334) + GPT-OSS-20B FP163.626.2~22.0N/A
GPU (Nvidia L4) + DeepSeek-R1-8B FP16~2.116.7~14.0N/A
GPU (Nvidia L4) + GPT-OSS-20B FP16~1.858.6~38.0N/A
CPU + Kokoro TTS (82M, ONNX)N/AN/A~0.50.162
CPU + SpeechT5 TTS (150M)N/AN/A~1.40.600
CPU + XTTS-v2 TTS (400M)N/AN/A~4.01.410

Key Findings:

  • Q4 quantization is the practical default for CPU decode: Switching DeepSeek-R1-8B from Q4_K_M to FP16 dropped throughput by 3.4Γ— while doubling TTFT. Memory bandwidth saturation occurs rapidly with higher precision.
  • CPU DRAM headroom enables multi-tenancy: Q4 workloads maintain 20–30% CPU utilization and leave substantial memory available, allowing concurrent instances on a single 64 GB node.
  • GPU throughput advantage is real but context-dependent: The L4 delivers ~2Γ— higher decode throughput for FP16, but at significantly higher memory cost and power draw. For batch/queue workloads, the CPU cost-per-token ratio is superior.
  • TTS models show clear RTF stratification: Kokoro (0.162 RTF) comfortably beats real-time on CPU. XTTS-v2 (1.41 RTF) cannot sustain streaming and belongs in batch queues.

Core Solution

The benchmark infrastructure relies on a dual-socket AMD EPYC 9334 (Zen 4, 32c/64t per socket, 2.7 GHz base, 12

8 MB L3 cache per socket, 210W TDP) paired with 64 GB DDR5. The architecture leverages default OS scheduling without NUMA pinning or hugepage configuration to reflect production-ready, zero-config deployments on HPE ProLiant DL385 Gen11 hardware.

Benchmark Methodology:

  • LLM Evaluation: llama-bench (llama.cpp) for local token generation profiling. Configuration: 512-token prompt, 128 generated tokens, 24 CPU threads.
  • API-Level Throughput: OpenLLM + llmperf for concurrent request simulation. Configuration: 1 concurrent request, 512 input tokens, 128 output tokens.
  • TTS Evaluation: Standard Python inference loops + ONNX Runtime. Configuration: 180-character input, 32 CPU threads, 30 iterations per model.

Reproduction Commands:

# LLM benchmark β€” llama-bench (part of llama.cpp)
llama-bench \
  -m /path/to/model.gguf \
  -p 512 \
  -n 128 \
  -t 24
# TTS benchmark β€” run per model, 30 iterations
# Kokoro: ONNX Runtime
# SpeechT5 + XTTS-v2: standard Python inference loop
# Input: 180-character text string, 32 threads
# API-level throughput β€” OpenLLM + llmperf
openllm start /path/to/model.gguf --backend llama-cpp

llmperf run \
  --model <model-name> \
  --num-concurrent-requests 1 \
  --num-output-tokens 128 \
  --num-input-tokens 512

Models are sourced directly from HuggingFace (e.g., bartowski/DeepSeek-R1-0528-Qwen3-8B-GGUF), pulling the Q4_K_M variant for llama.cpp compatibility. No special system preparation was applied; results reflect baseline Linux kernel memory management and scheduler behavior.

Pitfall Guide

  1. Ignoring the Prefill vs Decode Bottleneck Split: Treating inference as a single compute task leads to wrong hardware choices. Prefill is compute-heavy; decode is memory-bandwidth-bound. CPU architectures with high DDR5 bandwidth often outperform GPUs in decode-heavy streaming when models are quantized.
  2. Overlooking Quantization Impact on Memory Bandwidth: Running FP16/FP32 on CPU forces the memory controller to saturate, collapsing throughput. Q4_K_M reduces weight size by ~75%, keeping the decode phase within DRAM bandwidth limits and preserving TTFT stability.
  3. Neglecting NUMA Topology and Thread Affinity: Dual-socket EPYC systems split memory across two NUMA nodes. Without explicit thread pinning or numactl routing, cross-NUMA memory access adds latency. For production, bind inference threads to the socket holding the model weights.
  4. Assuming GPU Throughput Always Equals Lower TCO: While GPUs deliver higher raw tok/s, power consumption, licensing, and idle overhead often make CPU inference cheaper for batch, overnight, or low-concurrency workloads. Calculate cost-per-1000-tokens, not just peak throughput.
  5. Misaligning TTS Model Capability with RTF Requirements: High-capability TTS models (e.g., XTTS-v2) carry architectural overhead that pushes RTF > 1.0 on CPU. Using them for real-time voice streaming causes buffer underruns. Reserve them for offline batch generation; use lightweight ONNX models (Kokoro, SpeechT5) for interactive pipelines.
  6. Skipping KV Cache Memory Budgeting: Long prompts or multi-turn conversations expand the KV cache linearly. On 64 GB DDR5 nodes, FP16 caches exhaust memory quickly, triggering swap or OOM. Q4 quantization and context window limits are mandatory for multi-tenant CPU serving.
  7. Relying on Default OS Memory Policies in Production: Transparent HugePages (THP) and aggressive page reclaim can introduce tail latency spikes during decode. Disabling THP, setting vm.swappiness=1, and using mlock() for model weights stabilizes p95 TTFT on CPU inference nodes.

Deliverables

  • Blueprint: CPU Inference Architecture & Workload Routing Guide
    A reference architecture diagram and decision matrix for routing LLM/TTS workloads across CPU vs GPU nodes based on quantization level, concurrency, and RTF/TTFT SLAs. Includes NUMA-aware deployment patterns and DRAM sizing formulas for KV cache + weights.

  • Checklist: Pre-Deployment Validation for CPU/TTS Inference
    Step-by-step validation protocol covering: model quantization verification, thread-to-core binding, memory bandwidth headroom calculation, RTF/TTFT baseline testing, and fallback routing rules when SLAs degrade.

  • Configuration Templates: Benchmark & Production Tuning
    Ready-to-use llama-bench and OpenLLM parameter sets, numactl invocation examples for dual-socket routing, systemd service overrides for thread affinity, and Linux kernel tuning snippets (sysctl.conf for memory pressure, THP controls, and swap behavior) optimized for Zen 4 DDR5 memory controllers.