99% of Requests Failed and My Dashboard Showed Green

The Prefill Bottleneck: Diagnosing LLM Queue Saturation with Goodput Metrics

Current Situation Analysis

In modern LLM deployments, performance engineering often relies on metrics that provide a dangerously incomplete picture of user experience. The industry standard approach involves running single-user baselines and monitoring aggregate throughput. This methodology creates a "green dashboard" illusion where systems appear healthy despite delivering unacceptable latency to end-users.

The core issue stems from a misunderstanding of how inference servers handle concurrency. Most monitoring tools track Request Throughput (requests completed per second) and Average Latency. However, these metrics mask queue saturation. When an inference endpoint receives concurrent requests, the server must manage a queue. If the queue grows, requests wait in the prefill phase before the model begins generating tokens. During this wait, the user perceives a frozen interface, yet the server continues to process requests, maintaining high throughput numbers.

Data from production load tests reveals the severity of this disconnect. In controlled experiments using NVIDIA AIPerf against a local granite4:350m model, single-user tests showed an average Time to First Token (TTFT) of 223ms, suggesting a responsive system. However, when concurrency increased to 50 users, the average TTFT exploded to 41,660ms—a 186x degradation. Despite this, the system maintained a throughput of 0.88 req/sec. Without specific SLO-based analysis, this system would be declared production-ready based on throughput, while 99% of users experience timeouts.

WOW Moment: Key Findings

The critical insight emerges when comparing baseline performance against concurrent load while isolating token generation metrics. The data reveals a distinct pattern: Inter-Token Latency (ITL) remains stable even as TTFT collapses.

This separation proves that the inference model itself is not the bottleneck. Once the model begins processing a request, it generates tokens at a consistent pace. The performance degradation is entirely confined to the queue management and prefill phase.

Metric	Single-User Baseline	High Concurrency (50 Users)	Goodput Analysis (SLO: 500ms)
TTFT Average	223.11 ms	41,660.92 ms	37,380.20 ms
ITL Average	10.67 ms	10.38 ms	9.71 ms
Request Throughput	0.76 req/sec	0.88 req/sec	0.91 req/sec
Goodput	N/A	N/A	0.01 req/sec
SLO Compliance	100%	~1%	1%

Why this matters: The stability of ITL (hovering around 10ms across all runs) indicates that the decode phase is efficient. The model is capable of streaming tokens rapidly. The bottleneck is architectural: requests are queuing up waiting for the prefill phase to complete. This distinction changes the remediation strategy entirely. If ITL were also degrading, the solution would require faster hardware or a smaller model. Since only TTFT is affected, the fix lies in queue optimization, request routing, or horizontal scaling of inference replicas.

Furthermore, the Goodput metric exposes the true user impact. While throughput suggests the system is handling nearly 1 request per second, Goodput (measuring requests that meet the 500ms TTFT SLO) drops to 0.01 req/sec. This confirms that roughly 99% of requests are failing to meet user-experience standards, rendering the high throughput metric irrelevant for user satisfaction.

Core Solution

To accurately diagnose LLM performance, engineers must shift from throughput-centric testing to Goodput-centric testing using NVIDIA AIPerf. This tool enables the definition of Service Level Objectives (SLOs) and measures the rate of requests that actually satisfy those constraints under load.

Implementation Strategy

Define User-Centric SLOs: Establish a maximum acceptable TTFT based on user experience requirements. For streaming interfaces, 500ms is a common threshold for perceived responsiveness.
Simulate Production Concurrency: Never rely on single-user tests. Configure concurrency levels that match or exceed expected peak load.
Measure Goodput: Use AIPerf's --goodput flag to filter throughput based on the SLO. This provides the only metric that correlates with user satisfaction.
Analyze ITL Stability: Verify that ITL remains constant. If ITL spikes, the issue is compute-bound; if ITL is stable and TTFT spikes, the issue is queue-bound.

Technical Implementation

The following example demonstrates a robust testing workflow using shell variables for reusability and structured parameter management. This approach ensures tests are reproducible and adaptable to different endpoints.

#!/bin/bash
# LLM Performance Diagnostic Script
# Uses NVIDIA AIPerf to evaluate queue saturation and goodput

# Configuration
TARGET_MODEL="granite4:350m"
INFERENCE_ENDPOINT="http://localhost:11434"
ENDPOINT_TYPE="chat"
TOKENIZER="builtin"

# Load Parameters
CONCURRENCY_LEVEL=50
WARMUP_REQUESTS=10
TEST_DURATION_SEC=60

# SLO Definition
# TTFT must be under 500ms to count as a successful request
TTFT_SLO_MS=500

echo "Starting AIPerf diagnostic..."
echo "Model: ${TARGET_MODEL} | Concurrency: ${CONCURRENCY_LEVEL}"
echo "SLO: TTFT < ${TTFT_SLO_MS}ms"

# Execute Profile with Goodput Constraint
aiperf profile \
  --model "${TARGET_MODEL}" \
  --url "${INFERENCE_ENDPOINT}" \
  --endpoint-type "${ENDPOINT_TYPE}" \
  --tokenizer "${TOKENIZER}" \
  --concurrency "${CONCURRENCY_LEVEL}" \
  --warmup-request-count "${WARMUP_REQUESTS}" \
  --benchmark-duration "${TEST_DURATION_SEC}" \
  --goodput "time_to_first_token:${TTFT_SLO_MS}" \
  --streaming

echo "Diagnostic complete. Review Goodput vs Throughput ratio."

Architecture Decisions:

Warmup Requests: The --warmup-request-count parameter eliminates cold-start artifacts, ensuring metrics reflect steady-state performance.
Streaming Mode: Enabling --streaming captures TTFT and ITL accurately, as non-streaming endpoints may buffer responses, skewing latency measurements.
Goodput Flag: The --goodput parameter is the critical differentiator. It instructs AIPerf to calculate throughput only for requests where time_to_first_token is below the specified threshold. This filters out "successful" requests that are effectively failures from a user perspective.

Pitfall Guide

Pitfall	Explanation	Fix
The Single-User Mirage	Testing with `--concurrency 1` yields optimistic metrics that vanish under load. Queues do not form with one user, hiding saturation issues.	Always test at concurrency levels matching production peaks. Start at 10x expected load to identify breaking points.
Averaging Latency	Average TTFT masks tail latency. A system with avg 200ms TTFT might have p99 at 60s, causing timeouts for a subset of users.	Monitor p95 and p99 percentiles alongside averages. Use Goodput to enforce percentile-based SLOs.
Ignoring ITL Stability	Assuming the model is slow when TTFT increases. If ITL remains constant, the model is fine; the queue is the problem.	Always compare ITL across load levels. Stable ITL + High TTFT = Queue Bottleneck. Rising ITL = Compute Bottleneck.
Throughput Celebration	Celebrating high `req/sec` while users complain. Throughput counts all completed requests, including those that took minutes to start.	Prioritize Goodput over Throughput. A system with 1.0 req/sec throughput but 0.01 req/sec goodput is failing.
Cold Start Noise	Running tests without warmup requests captures model loading and cache warming, inflating latency metrics artificially.	Include `--warmup-request-count` to stabilize the inference server before measurement begins.
Static SLOs	Using arbitrary SLO values without user research. An SLO of 500ms might be too strict for complex reasoning models or too loose for chat.	Define SLOs based on user tolerance studies. Adjust SLOs per model capability and use case.
Endpoint Mismatch	Testing against a different endpoint configuration than production (e.g., different batch sizes or context windows).	Ensure test environment matches production configuration exactly, including server flags and resource limits.

Production Bundle

Action Checklist

Define SLOs: Establish maximum acceptable TTFT for your application context (e.g., 500ms for chat, 2000ms for code completion).
Install AIPerf: Deploy NVIDIA AIPerf in your CI/CD or staging environment using pip install aiperf.
Configure Concurrency: Set concurrency parameters to match or exceed peak production load estimates.
Execute Goodput Test: Run AIPerf with the --goodput flag targeting your TTFT SLO.
Analyze ITL: Verify Inter-Token Latency remains stable. If ITL spikes, investigate compute resources; if stable, investigate queue management.
Calculate Goodput Ratio: Compare Goodput to Throughput. A ratio below 0.5 indicates significant user experience degradation.
Iterate Architecture: If Goodput is low, implement horizontal scaling, request queuing optimization, or model routing strategies.

Decision Matrix

Scenario	ITL Behavior	TTFT Behavior	Diagnosis	Recommended Action
Queue Saturation	Stable	Explodes	Prefill bottleneck; requests waiting in queue.	Scale inference replicas; optimize batch scheduling; implement request prioritization.
Compute Bound	Increases	Explodes	Model generation is slow; hardware insufficient.	Upgrade GPU; switch to smaller/faster model; optimize model quantization.
Network Latency	Stable	Stable	Network overhead is high but consistent.	Optimize network path; check proxy configurations; verify endpoint health.
SLO Violation	Stable	Variable	Some requests meet SLO, others do not.	Implement adaptive routing; use smaller models for simple queries; cache frequent prompts.

Configuration Template

Use this template to standardize AIPerf testing across your organization. Save as aiperf_config.env and source before running tests.

# AIPerf Configuration Template
# Source this file before running aiperf profile

# Target System
export AIPERF_MODEL="granite4:350m"
export AIPERF_URL="http://localhost:11434"
export AIPERF_ENDPOINT_TYPE="chat"
export AIPERF_TOKENIZER="builtin"

# Load Profile
export AIPERF_CONCURRENCY=50
export AIPERF_WARMUP=10
export AIPERF_DURATION=60

# SLO Definitions (Format: metric:value)
export AIPERF_GOODPUT_TTFT="time_to_first_token:500"
export AIPERF_GOODPUT_LATENCY="request_latency:5000"

# Execution Command
# aiperf profile \
#   --model "${AIPERF_MODEL}" \
#   --url "${AIPERF_URL}" \
#   --endpoint-type "${AIPERF_ENDPOINT_TYPE}" \
#   --tokenizer "${AIPERF_TOKENIZER}" \
#   --concurrency "${AIPERF_CONCURRENCY}" \
#   --warmup-request-count "${AIPERF_WARMUP}" \
#   --benchmark-duration "${AIPERF_DURATION}" \
#   --goodput "${AIPERF_GOODPUT_TTFT}" \
#   --streaming

Quick Start Guide

Install Tooling: Run pip install aiperf to install NVIDIA AIPerf.
Set SLO: Determine your TTFT threshold (e.g., 500ms) based on user requirements.
Run Test: Execute the AIPerf command with --concurrency 50 and --goodput "time_to_first_token:500".
Review Results: Check the Goodput metric. If Goodput is significantly lower than Throughput, your system has a queue bottleneck.
Diagnose: Confirm ITL stability. If ITL is constant, focus on scaling inference capacity or optimizing request routing rather than model optimization.