eliminate PCIe transfer overhead. They run 25β40 tokens/sec on 14B models and 6β10 tokens/sec on 70B models. They are optimal when memory bandwidth is the bottleneck rather than compute.
Step 2: Model Selection Matrix
Model choice should align with task requirements, not leaderboard rankings.
- Qwen 3 (7B/14B/32B/72B/235B-MoE): The current default for general-purpose deployment. Native ChatML formatting, robust tool-calling, and strong multilingual performance make it the safest baseline. The 14B variant hits the optimal balance between capability and resource consumption.
- Llama 3.3 (8B/70B): Use the 8B variant as a benchmark reference. The 70B variant closes the gap to frontier models on long-context tasks. Ideal when evaluation consistency matters.
- Phi-4 (14B): Prioritize for code-heavy or reasoning-intensive pipelines. The 16k context window is a constraint, but reasoning density per token is high.
- DeepSeek-R1 Distillates: Deploy only when multi-step reasoning is required. The chain-of-thought output increases latency and token consumption, making them unsuitable for short-response interfaces.
Step 3: Serving Architecture & Stack Selection
The serving layer dictates concurrency handling, API compatibility, and operational overhead.
- Ollama: Best for rapid prototyping and single-user deployments. Exposes an OpenAI-compatible endpoint at
localhost:11434. Conservative defaults reduce configuration overhead but limit fine-grained control.
- vLLM: Required for multi-user or high-throughput environments. CPU support matured in 2025, and the continuous batching scheduler dramatically improves throughput under concurrent load. Setup is heavier but scales predictably.
- MLX-LM: Apple Silicon exclusive. Offers clean Python bindings and optimized memory management. Use when deploying on macOS infrastructure.
- LocalAI: Suitable for polyglot environments requiring text, embedding, and image generation from a single endpoint. Backend abstraction adds latency but reduces application code complexity.
Step 4: Quantization Strategy
Quantization is not a one-size-fits-all setting. It is a trade-off between memory footprint, evaluation speed, and output fidelity.
Q4_K_M (~4.5 bits/weight): The production default. 95% of workloads should start here.
Q5_K_M (~5.5 bits/weight): Use when headroom exists and marginal quality gains justify the 25% size increase.
IQ4_XS: Importance-aware quantization. Matches Q4_K_M footprint but improves quality on critical weights. Slower evaluation due to metadata overhead. Reserve for quality-sensitive pipelines.
IQ3_M and below: Aggressive compression. Necessary for fitting 70B models on 16GB GPUs, but introduces noticeable degradation in reasoning and instruction following.
Implementation Examples
Custom Inference Router (Python)
This wrapper abstracts backend differences and routes requests based on workload type.
import asyncio
import httpx
from typing import Optional
class LocalInferenceRouter:
def __init__(self, cpu_endpoint: str, gpu_endpoint: str):
self.cpu_client = httpx.AsyncClient(base_url=cpu_endpoint)
self.gpu_client = httpx.AsyncClient(base_url=gpu_endpoint)
async def generate(self, prompt: str, model: str, priority: str = "interactive") -> str:
client = self.gpu_client if priority == "interactive" else self.cpu_client
payload = {
"model": model,
"prompt": prompt,
"stream": False,
"temperature": 0.7,
"max_tokens": 1024
}
response = await client.post("/v1/completions", json=payload)
response.raise_for_status()
return response.json()["choices"][0]["text"].strip()
async def close(self):
await self.cpu_client.aclose()
await self.gpu_client.aclose()
vLLM Async Server Deployment (Python)
Configures continuous batching and memory optimization for concurrent serving.
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
def initialize_vllm_server(model_path: str, gpu_memory_utilization: float = 0.85):
engine_args = AsyncEngineArgs(
model=model_path,
gpu_memory_utilization=gpu_memory_utilization,
max_num_batched_tokens=4096,
max_num_seqs=256,
disable_log_requests=True
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
return engine
async def run_inference(engine, prompt: str, max_tokens: int = 512):
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=max_tokens,
stop=["<|end|>", "\n\n"]
)
outputs = engine.generate(prompt, sampling_params, request_id="req_001")
final_output = None
async for output in outputs:
final_output = output
return final_output.outputs[0].text if final_output else ""
MLX-LM Generation Script (Python)
Optimized for Apple Silicon unified memory architecture.
import mlx.core as mx
from mlx_lm import load, generate
def run_mlx_inference(model_name: str, system_prompt: str, user_input: str):
model, tokenizer = load(model_name)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output = generate(
model=model,
tokenizer=tokenizer,
prompt=prompt,
max_tokens=1024,
temperature=0.7,
top_p=0.9,
verbose=False
)
return output.strip()
Architectural Rationale
- Routing by priority prevents interactive requests from queuing behind batch jobs.
- vLLM's
max_num_batched_tokens and max_num_seqs parameters are tuned to prevent OOM crashes while maximizing GPU utilization.
- MLX-LM leverages Apple's memory hierarchy by avoiding explicit tensor transfers between CPU and GPU domains.
- Quantization defaults are enforced at the model loading stage to prevent accidental FP16 allocation.
Pitfall Guide
-
Ignoring NUMA Topology on Multi-Socket CPUs
- Explanation: Modern workstations often span multiple NUMA nodes. If the inference process allocates memory on a different node than the CPU cores executing it, latency spikes and throughput drops by 30β50%.
- Fix: Bind the process to a specific NUMA node using
numactl --cpunodebind=0 --membind=0 ollama serve or equivalent systemd CPU affinity settings.
-
Over-Quantizing Reasoning Workloads
- Explanation: Aggressive quantization (IQ3_M and below) degrades the model's ability to maintain coherent chain-of-thought. The weight distribution critical for logical steps gets flattened.
- Fix: Reserve Q4_K_M or Q5_K_M for reasoning pipelines. Use IQ4_XS if memory is constrained but quality cannot be compromised.
-
Mismanaging KV Cache Memory
- Explanation: The key-value cache scales linearly with context length. Long conversations or document ingestion can silently exhaust VRAM/RAM, causing silent failures or fallback to CPU swapping.
- Fix: Implement context window limits at the application layer. Use sliding window attention or cache eviction strategies for long-running sessions.
-
Treating MoE Active Parameters as Total Parameters
- Explanation: Models like Qwen 3 235B-A22B have 235B total parameters but only 22B active per token. Engineers often miscalculate memory requirements by assuming full model loading.
- Fix: Size hardware based on active parameters plus routing overhead. A 22B active MoE fits comfortably on a 24GB GPU when quantized, despite the 235B label.
-
Deploying Single-Threaded Servers for Concurrent Users
- Explanation: Ollama and basic llama.cpp servers handle one request at a time. Under concurrent load, requests queue linearly, destroying perceived performance.
- Fix: Migrate to vLLM or LocalAI with continuous batching. Configure
max_num_seqs to match expected concurrency, and monitor queue depth.
-
Chasing Context Length Over Throughput
- Explanation: Extending context windows beyond 32k tokens increases KV cache size quadratically in some implementations, reducing tokens/sec by 40β60%.
- Fix: Use retrieval-augmented generation (RAG) or chunking strategies instead of raw context extension. Keep local inference context windows between 8kβ32k for optimal throughput.
-
Neglecting Sampler Configuration for Deterministic Outputs
- Explanation: Default sampling parameters introduce variance in code generation and data extraction tasks. Temperature > 0.5 causes inconsistent formatting.
- Fix: Set
temperature=0.0 or 0.1 for deterministic pipelines. Use top_p=0.9 and repetition_penalty=1.1 to maintain coherence without sacrificing output stability.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single developer prototyping | Ollama + Qwen 3 14B Q4_K_M | Zero configuration, OpenAI-compatible API, fast iteration | Near-zero hardware cost |
| Internal team tool (5β20 concurrent users) | vLLM + Qwen 3 14B/32B Q4_K_M | Continuous batching handles concurrency, predictable latency | Moderate GPU cost (RTX 4090) |
| Long-context document analysis | CPU node + Llama 3.3 8B Q4_K_M | Memory-bound workload, CPU bandwidth sufficient, avoids GPU contention | Low cost, utilizes existing workstations |
| Code generation pipeline | Phi-4 14B Q5_K_M via LocalAI | High reasoning density, deterministic sampling, multi-backend flexibility | Moderate GPU cost |
| macOS-native development environment | MLX-LM + Qwen 3 14B Q4_K_M | Unified memory optimization, native Python integration, no PCIe overhead | Zero additional hardware cost |
| Multi-modal agent (text + embeddings + images) | LocalAI + mixed backend routing | Single endpoint abstraction, reduces application code complexity | Higher RAM requirement, moderate GPU |
Configuration Template
# docker-compose.yml - Local Inference Stack
version: "3.8"
services:
vllm-gpu:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- VLLM_USE_V1=1
command: >
--model qwen/Qwen3-14B-Instruct
--quantization awq
--gpu-memory-utilization 0.85
--max-num-batched-tokens 4096
--max-num-seqs 128
--disable-log-requests
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ollama-cpu:
image: ollama/ollama:latest
environment:
- OLLAMA_NUM_GPU=0
- OLLAMA_HOST=0.0.0.0
command: serve
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
inference-router:
build: ./router
environment:
- GPU_ENDPOINT=http://vllm-gpu:8000
- CPU_ENDPOINT=http://ollama-cpu:11434
ports:
- "3000:3000"
depends_on:
- vllm-gpu
- ollama-cpu
volumes:
ollama_data:
Quick Start Guide
- Install the serving runtime: Deploy Ollama for CPU fallback and vLLM for GPU acceleration using the provided Docker Compose template. Ensure NVIDIA container toolkit is installed on GPU nodes.
- Pull and verify the model: Execute
ollama pull qwen3:14b on the CPU node. On the GPU node, vLLM will automatically download and quantize the model on first request. Verify throughput using curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model":"qwen/Qwen3-14B-Instruct","prompt":"Test","max_tokens":10}'.
- Configure the inference router: Build and start the routing service. Set environment variables to point to your GPU and CPU endpoints. Test priority routing by sending concurrent requests with
priority: interactive and priority: batch.
- Integrate with your application: Replace cloud API calls with the router endpoint (
http://localhost:3000/v1/completions). Implement context window limits and deterministic sampling parameters in your request payload. Monitor latency and adjust max_num_seqs based on observed concurrency.