267 tok/s local inference on RTX 5090 β llama.cpp MTP + Qwen3-35B-A3B MoE
Architecting High-Throughput Local Inference: MoE Models Meets Speculative Decoding
Current Situation Analysis
The push toward local large language model deployment has hit a hard ceiling: throughput. Engineering teams routinely benchmark dense architectures, configure speculative decoding pipelines, and discover that the draft-verify overhead often neutralizes any theoretical speedup. The industry assumption has been that speculative decoding requires a carefully paired smaller draft model, and that gains plateau quickly once verification rejection rates climb.
This assumption overlooks a structural mismatch in how modern routing architectures interact with token prediction loops. Mixture-of-Experts (MoE) models activate only a fraction of their total parameters per forward pass. When paired with Multi-Token Prediction (MTP) inside llama.cpp, this sparsity creates an unexpected computational surplus. Instead of the draft-verify loop becoming a bottleneck, the verification step becomes nearly free because the routing mechanism keeps the active expert set stable across consecutive tokens.
The data contradicts conventional wisdom. On identical hardware, a standard dense pipeline with speculative decoding often underperforms its non-speculative baseline due to verification overhead. Meanwhile, an MoE architecture running the same speculative configuration can nearly double its baseline throughput. This isn't a marginal optimization; it's a structural advantage that changes how local inference should be architected for high-volume workloads.
Cloud alternatives remain expensive at scale. Services like Claude Haiku deliver roughly 150 tok/s via API but bill at $150 per million tokens. For applications processing millions of requests daily, the marginal cost of local deployment shifts from prohibitive to negligible, provided the throughput ceiling can be broken.
WOW Moment: Key Findings
The performance delta between dense and sparse architectures under speculative decoding reveals why traditional benchmarking misleads engineering decisions.
| Approach | Throughput (tok/s) | Verification Overhead | Cost Profile |
|---|---|---|---|
| Ollama stock (35B MoE) | 171 | None | Electricity only |
| 27B Dense + MTP | 104 | High (rejection penalty) | Electricity only |
| 35B MoE + MTP | 267 | Low (sparse routing) | Electricity only |
| Claude Haiku (API) | ~150 | N/A (server-side) | $150/MTok |
The 267 tok/s result on an RTX 5090 isn't just faster than the cloud baseline; it's 56% faster than the local non-speculative MoE baseline. The dense model actually slows down under MTP because the verification pass forces full parameter activation without the sparsity buffer. MoE architectures bypass this penalty because the routing matrix remains stable across draft tokens, allowing the verification step to reuse cached expert activations.
This finding enables production-grade local inference for streaming applications, real-time agents, and high-throughput data pipelines without architectural compromises. The draft-verify loop stops being a bottleneck and becomes a throughput multiplier.
Core Solution
Implementing this pipeline requires aligning hardware constraints, quantization strategy, and speculative decoding parameters. The following implementation uses llama.cpp's native MTP support, Qwen3-35B-A3B-Instruct, and a structured launch configuration optimized for RTX 5090 under WSL2.
Step 1: Environment Preparation
WSL2 requires explicit GPU compute configuration and memory allocation tuning. The default 50% host RAM cap will throttle KV cache expansion.
# .wslconfig (placed in %USERPROFILE%)
[wsl2]
memory=64GB
swap=16GB
gpuMemory=24GB
Install CUDA toolkit and cuBLAS for WSL2, then compile llama.cpp with MTP flags enabled:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_BUILD_SERVER=ON
cmake --build build --config Release -j $(nproc)
Step 2: Model Acquisition & Quantization
Qwen3-35B-A3B-Instruct requires careful quantization. MoE routing matrices degrade sharply below Q4_K_XL. Lower quantizations cause expert collapse, where the routing mechanism defaults to a single expert, destroying sparsity and negating MTP benefits.
# Download and convert to GGUF
huggingface-cli download Qwen/Qwen3-35B-A3B-Instruct-GGUF \
--include "*Q4_K_XL*" --local-dir ./models
# Verify expert routing integrity
./build/bin/llama-cli -m ./models/qwen3-35b-a3b-instruct-q4_k_xl.gguf \
--prompt "Verify routing" --n-gpu-layers 99 --verbose 2>&1 | grep -i "expert"
Step 3: Server Launch Configuration
The launch script isolates MTP parameters, context management, and API exposure. Variable names are structured for production environment injection.
#!/usr/bin/env bash
# launch_mtp_server.sh
export INFERENCE_MODEL_PATH="./models/qwen3-35b-a3b-instruct-q4_k_xl.gguf"
export INFERENCE_CTX_SIZE=65536
export INFERENCE_GPU_LAYERS=99
export INFERENCE_MTP_DRAFT=2
export INFERENCE_THREADS=$(nproc)
export INFERENCE_PORT=8080
./build/bin/llama-server \
--model "${INFERENCE_MODEL_PATH}" \
--ctx-size "${INFERENCE_CTX_SIZE}" \
--gpu-layers "${INFERENCE_GPU_LAYERS}" \
--threads "${INFERENCE_THREADS}" \
--mtp-draft "${INFERENCE_MTP_DRAFT}" \
--port "${INFERENCE_PORT}" \
--host 0.0.0.0 \
--flash-attn \
--cache-reuse \
--no-mmap
Architecture Rationale:
--mtp-draft 2: Limits draft token generation to two tokens per step. Higher values increase rejection rates exponentially in MoE models due to routing drift.--cache-reuse: Enables KV cache persistence across requests, critical for maintaining expert activation states.--no-mmap: Forces explicit memory allocation, preventing WSL2 page-fault latency spikes during context expansion.--flash-attn: Optimizes attention computation for RTX 5090 tensor cores, reducing verification overhead.
Step 4: TypeScript Client Integration
Production clients must handle streaming responses and fallback gracefully when speculative verification fails.
// src/inference/mtp-client.ts
import { createReadStream } from 'fs';
import { Readable } from 'stream';
interface MTPRequestConfig {
model: string;
prompt: string;
maxTokens: number;
temperature: number;
stream: boolean;
}
export class LocalMTPClient {
private readonly endpoint: string;
private readonly timeout: number;
constructor(endpoint: string = 'http://localhost:8080', timeout: number = 30000) {
this.endpoint = endpoint;
this.timeout = timeout;
}
async generateStream(config: MTPRequestConfig): Promise<Readable> {
const response = await fetch(`${this.endpoint}/v1/chat/completions`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: config.model,
messages: [{ role: 'user', content: config.prompt }],
max_tokens: config.maxTokens,
temperature: config.temperature,
stream: true,
}),
signal: AbortSignal.timeout(this.timeout),
});
if (!response.ok) {
throw new Error(`Inference failed: ${response.status} ${response.statusText}`);
}
return Readable.fromWeb(response.body as any);
}
async parseStream(stream: Readable): Promise<string> {
const chunks: string[] = [];
for await (const chunk of stream) {
const text = chunk.toString();
const lines = text.split('\n').filter(line => line.startsWith('data: '));
for (const line of lines) {
const payload = JSON.parse(line.replace('data: ', ''));
if (payload.choices?.[0]?.delta?.content) {
chunks.push(payload.choices[0].delta.content);
}
}
}
return chunks.join('');
}
}
Pitfall Guide
1. Expert Routing Collapse from Aggressive Quantization
Explanation: Quantizing below Q4_K_XL degrades the routing matrix precision. The model defaults to activating a single expert for all tokens, eliminating sparsity. MTP verification then becomes computationally expensive, negating throughput gains.
Fix: Lock quantization at Q4_K_XL or higher. Validate routing distribution using --verbose 2 and confirm expert activation spreads across multiple indices.
2. Context Window Fragmentation
Explanation: Setting ctx-size to 65536 without cache management causes KV cache fragmentation. The verification pass must recompute attention for fragmented sequences, increasing latency.
Fix: Enable --cache-reuse and implement sliding window truncation in the application layer. Monitor cache hit rates via /v1/internal/health endpoints.
3. WSL2 Memory Allocation Throttling
Explanation: WSL2 defaults to 50% of host RAM. KV cache expansion for 65536 context triggers swap thrashing, dropping throughput to <40 tok/s.
Fix: Configure .wslconfig with explicit memory and gpuMemory limits. Use wsl --shutdown and restart after changes. Verify allocation with free -h inside WSL2.
4. Draft Length Misconfiguration
Explanation: Setting --mtp-draft above 2 causes routing drift. MoE expert selection changes per token, making draft tokens statistically unlikely to match verification. Rejection rates exceed 60%, slowing generation.
Fix: Cap draft length at 2. If higher draft lengths are required, switch to dense architectures or implement adaptive draft scaling based on routing stability metrics.
5. Thermal Throttling on RTX 5090
Explanation: Sustained inference pushes VRAM and tensor cores to thermal limits. Without fan curve management, clocks drop by 15-20%, reducing throughput to baseline levels.
Fix: Configure nvidia-smi power limits and fan profiles. Use --gpu-layers 99 to maximize VRAM utilization, reducing PCIe transfer overhead that generates additional heat.
6. Ignoring Batch Size Interactions
Explanation: MTP performance degrades under concurrent requests. The verification pass serializes when multiple streams compete for expert routing tables.
Fix: Limit concurrent streams to 2-3 per RTX 5090. Implement request queuing with backpressure. Use --parallel cautiously; MoE models do not scale linearly with batch size.
7. Flash Attention Compatibility Gaps
Explanation: Older llama.cpp builds lack optimized flash attention for RTX 5090 architecture. Verification passes fall back to standard attention, increasing compute overhead.
Fix: Compile from source with GGML_CUDA=ON and verify --flash-attn is active. Check build logs for cuBLAS and Tensor Core initialization messages.
Production Bundle
Action Checklist
- Verify WSL2 memory allocation matches
.wslconfigsettings before launch - Compile
llama.cppwith CUDA and flash attention flags enabled - Validate expert routing distribution using verbose logging
- Set
--mtp-draftto 2 and monitor rejection rates via API metrics - Configure thermal management profiles for sustained tensor core utilization
- Implement client-side stream parsing with timeout and fallback logic
- Test KV cache reuse under concurrent request simulation
- Document quantization level and routing stability thresholds for rollback
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| High-volume streaming (>10k req/day) | Local MoE + MTP | 267 tok/s throughput, near-zero marginal cost | Electricity only (~$0.02/MTok) |
| Low-latency single-turn queries | Cloud API (Haiku) | Optimized routing, no infrastructure overhead | $150/MTok |
| Budget-constrained prototyping | Local Dense + MTP | Lower VRAM footprint, simpler setup | Electricity only, but 104 tok/s ceiling |
| Multi-modal or tool-use pipelines | Local MoE (no MTP) | Stable expert routing, predictable latency | 171 tok/s, higher compute cost than MTP |
Configuration Template
# docker-compose.yml (production-ready wrapper)
version: '3.8'
services:
mtp-inference:
image: ghcr.io/ggerganov/llama.cpp:full-cuda
runtime: nvidia
environment:
- MODEL_PATH=/models/qwen3-35b-a3b-instruct-q4_k_xl.gguf
- CTX_SIZE=65536
- MTP_DRAFT=2
- GPU_LAYERS=99
- PORT=8080
volumes:
- ./models:/models
- ./logs:/app/logs
ports:
- "8080:8080"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
command: >
/app/bin/llama-server
--model ${MODEL_PATH}
--ctx-size ${CTX_SIZE}
--gpu-layers ${GPU_LAYERS}
--mtp-draft ${MTP_DRAFT}
--port ${PORT}
--host 0.0.0.0
--flash-attn
--cache-reuse
--no-mmap
--log-filename /app/logs/inference.log
restart: unless-stopped
Quick Start Guide
- Prepare Environment: Install WSL2 with Ubuntu 24.04, configure
.wslconfigwith 64GB memory, and install NVIDIA CUDA toolkit for WSL2. - Compile & Pull: Clone
llama.cpp, build withGGML_CUDA=ON, and downloadQwen3-35B-A3B-Instruct-GGUF(Q4_K_XL variant). - Launch Server: Run the provided
launch_mtp_server.shscript. Verify startup logs showflash-attnandcache-reuseinitialization. - Validate Throughput: Send a streaming request to
http://localhost:8080/v1/chat/completions. Monitor token generation rate; expect ~267 tok/s under single-stream load. - Integrate Client: Import the TypeScript
LocalMTPClientclass, configure endpoint and timeout, and implement stream parsing with error handling for production routing.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
