1: Hardware & Driver Validation
Ollama relies on llama.cpp under the hood. GPU acceleration requires explicit driver alignment.
Linux (NVIDIA)
# Verify CUDA toolkit and driver compatibility
nvidia-smi --query-gpu=driver_version,name,compute_cap,memory.total --format=csv
nvcc --version
Ensure CUDA 12.1+ and driver β₯535. Ollama bundles its own runtime, but mismatched host drivers cause silent CPU fallback.
Linux (AMD)
rocm-smi --showmeminfo vram
# Install ROCm 6.0+ and verify HIP_VISIBLE_DEVICES
macOS
Apple Silicon uses Metal automatically. No driver setup required. Verify unified memory allocation:
system_profiler SPDisplaysDataType | grep -i "VRAM (Total)"
Phase 2: Installation & Path Configuration
Linux/macOS (Native)
curl -fsSL https://ollama.com/install.sh | sh
By default, models cache to ~/.ollama/models. In production, relocate to a dedicated volume:
export OLLAMA_MODELS=/var/lib/ollama/models
mkdir -p $OLLAMA_MODELS
chown -R $(whoami):$(whoami) $OLLAMA_MODELS
Windows
Use the official installer. Set environment variables via System Properties > Advanced > Environment Variables:
OLLAMA_MODELS: C:\ProgramData\Ollama\models
OLLAMA_HOST: 0.0.0.0:11434 (if network exposure is required)
Phase 3: Service Hardening & Network Configuration
Running Ollama as a foreground process is unacceptable for production. Use systemd (Linux) or Docker with explicit resource limits.
Systemd Service Unit (/etc/systemd/system/ollama.service)
[Unit]
Description=Ollama LLM Inference Service
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
User=ollama
Group=ollama
ExecStart=/usr/local/bin/ollama serve
Environment=OLLAMA_HOST=0.0.0.0:11434
Environment=OLLAMA_MODELS=/var/lib/ollama/models
Environment=OLLAMA_NUM_GPU=999
Environment=OLLAMA_KEEP_ALIVE=12h
Environment=OLLAMA_MAX_LOADED_MODELS=3
Environment=OLLAMA_FLASH_ATTENTION=1
Restart=on-failure
RestartSec=5
LimitNOFILE=65536
LimitMEMLOCK=infinity
TimeoutStartSec=30
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo journalctl -u ollama -f
Phase 4: Model Selection & Quantization Strategy
Model choice dictates VRAM allocation and throughput. Use quantization to balance precision and memory.
| Quantization | VRAM (7B) | Quality Loss | Use Case |
|---|
| q4_K_M | ~4.1 GB | ~2-3% | Production chat, RAG |
| q5_K_M | ~5.0 GB | ~1% | Code generation, reasoning |
| q8_0 | ~7.2 GB | <0.5% | High-fidelity tasks, evaluation |
| f16 | ~14 GB | None | Research, fine-tuning |
Pull and verify:
ollama pull llama3.1:8b-instruct-q4_K_M
ollama show llama3.1:8b-instruct-q4_K_M
Phase 5: API Integration & Streaming Architecture
Ollama exposes a REST API compatible with OpenAI-style streaming. Implement backpressure handling and connection pooling.
cURL Example (Streaming)
curl -N -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b-instruct-q4_K_M",
"prompt": "Explain KV cache optimization in transformer architectures.",
"stream": true,
"options": {
"num_ctx": 4096,
"temperature": 0.7,
"top_p": 0.9
}
}'
Python Client (Production Pattern)
import requests
import json
from typing import Generator
class OllamaClient:
def __init__(self, base_url: str = "http://localhost:11434"):
self.base_url = base_url
self.session = requests.Session()
self.session.headers.update({"Content-Type": "application/json"})
def generate_stream(self, model: str, prompt: str, **kwargs) -> Generator[str, None, None]:
payload = {
"model": model,
"prompt": prompt,
"stream": True,
"options": kwargs
}
with self.session.post(f"{self.base_url}/api/generate", json=payload, stream=True) as resp:
resp.raise_for_status()
for line in resp.iter_lines():
if line:
data = json.loads(line)
if "response" in data:
yield data["response"]
if data.get("done", False):
break
# Usage
client = OllamaClient()
for chunk in client.generate_stream("llama3.1:8b-instruct-q4_K_M", "Summarize quantum computing.", num_ctx=4096):
print(chunk, end="", flush=True)
Architecture Decisions
- GPU Offloading:
OLLAMA_NUM_GPU=999 forces maximum layer offloading. If VRAM is constrained, calculate manually: num_gpu = total_layers - ceil((required_vram - model_vram) / vram_per_layer).
- Context Window Management: Set
num_ctx explicitly. Unbounded contexts trigger KV cache OOM. Default to 4096 for 7B models; scale to 8192 only with β₯12GB VRAM.
- Model Routing: Use
OLLAMA_MAX_LOADED_MODELS=3 to prevent cache eviction thrashing. Implement application-level routing to batch requests by model.
- Flash Attention: Enable
OLLAMA_FLASH_ATTENTION=1 on Ampere+ GPUs. Reduces memory bandwidth pressure by 30β40%.
- Keep-Alive Tuning:
OLLAMA_KEEP_ALIVE=12h prevents model reload latency. Set to 0 for ephemeral environments to free VRAM.
Pitfall Guide
-
Ignoring VRAM Budgets
Loading a q8_0 70B model on a 24GB GPU triggers silent CPU fallback. Always verify nvidia-smi memory usage after ollama run. Use ollama ps to monitor active model memory footprint.
-
Skipping Driver/CUDA Alignment
Ollama bundles its own runtime, but host NVIDIA drivers <535 or CUDA toolkit mismatches cause initialization failures. Run ldd $(which ollama) | grep libcuda to verify linkage.
-
Binding to 0.0.0.0 Without Authentication
Exposing the API to public networks without firewall rules or reverse proxy authentication invites unauthorized inference and data exfiltration. Always place behind nginx/Traefik with mTLS or API key validation.
-
Running as Root or Using Default Temp Directories
Ollama creates temporary computation buffers in /tmp. On systems with tmpfs size limits, this causes silent truncation. Set TMPDIR=/var/lib/ollama/tmp and run as a dedicated ollama user.
-
Neglecting Context Window Limits
Default num_ctx=2048 truncates long prompts silently. KV cache allocation scales quadratically with context length. Exceeding VRAM causes OOM kills. Explicitly set num_ctx matching workload requirements.
-
Assuming Automatic Model Updates
Ollama does not auto-update pulled models. Tag pinning (llama3.1:8b-instruct-q4_K_M) is mandatory for reproducibility. Implement CI/CD cache invalidation when updating base registries.
-
Overlooking Log Rotation & Disk I/O
Model downloads and inference logs accumulate rapidly. Unmanaged /var/lib/ollama fills disks, triggering service crashes. Configure logrotate or Docker volume pruning. Monitor I/O wait with iostat -x 1.
Production Bundle
Action Checklist
Decision Matrix
| Deployment Target | Recommended Setup | GPU Offloading | Network Exposure | Maintenance Overhead |
|---|
| Developer Laptop | Native + CLI | Auto (Metal/CUDA) | localhost only | Low |
| On-Prem Server | systemd + Native | Explicit (NUM_GPU=999) | Internal VPC only | Medium |
| Edge Device | Docker + CPU/CUDA | Limited (NUM_GPU=0-16) | Isolated subnet | High |
| Cloud VM (GPU) | Docker + systemd | Explicit + Flash Attention | Private subnet + LB | Medium |
| Multi-Node Cluster | Docker + Orchestrator | Per-node routing | Mesh network | High |
Configuration Template
docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama-inference
restart: unless-stopped
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_MODELS=/models
- OLLAMA_NUM_GPU=999
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_KEEP_ALIVE=12h
- OLLAMA_MAX_LOADED_MODELS=3
volumes:
- ollama_models:/models
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- "11434:11434"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
volumes:
ollama_models:
driver: local
driver_opts:
type: none
o: bind
device: /srv/ollama/models
ollama_data:
driver: local
.env
OLLAMA_HOST=0.0.0.0:11434
OLLAMA_MODELS=/models
OLLAMA_NUM_GPU=999
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KEEP_ALIVE=12h
OLLAMA_MAX_LOADED_MODELS=3
OLLAMA_KEEP_ALIVE=12h
Quick Start Guide
- Verify Hardware: Run
nvidia-smi or rocm-smi. Confirm β₯8GB VRAM for 7B q4 models. Update drivers if version <535 (NVIDIA) or ROCm <6.0 (AMD).
- Install & Configure: Execute native installer or deploy
docker-compose.yml. Set OLLAMA_MODELS to a dedicated volume. Export environment variables matching your hardware profile.
- Pull & Test: Run
ollama pull llama3.1:8b-instruct-q4_K_M. Verify with curl -N http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{"model":"llama3.1:8b-instruct-q4_K_M","prompt":"test","stream":true}'. Monitor ollama ps for VRAM allocation.
- Harden & Route: Bind API to internal interface only. Deploy reverse proxy with authentication. Implement streaming client with connection pooling and explicit
num_ctx limits. Enable monitoring for GPU utilization, cache hit rate, and p99 latency.
Ollama is not a toy. It is a production-grade inference router that demands explicit hardware alignment, resource budgeting, and network hardening. Treat it as infrastructure, not a CLI command, and it will deliver predictable, cost-efficient, and privacy-compliant LLM serving at scale.