ttps://ollama.com/install.sh | sh
**Verify GPU Detection:**
Ollama automatically detects CUDA and ROCm devices. Verify detection via logs:
```bash
# Check logs for GPU initialization
journalctl -u ollama -f | grep -i gpu
Environment Variables for Control:
Create a systemd override or export these variables to control runtime behavior:
OLLAMA_NUM_GPU: Number of layers to offload to GPU. Default is auto. Set to 999 to force maximum offloading.
OLLAMA_HOST: Bind address. Default is 127.0.0.1:11434. Use 0.0.0.0:11434 for container access.
OLLAMA_KEEP_ALIVE: Duration to keep models loaded. Default 5m. Use -1 for indefinite loading in production.
OLLAMA_MAX_LOADED_MODELS: Maximum concurrent models. Default 1. Increase for multi-model routing.
2. Modelfile Architecture
The Modelfile is the core mechanism for optimization. It allows defining model parameters, system prompts, and template overrides without altering the base weights.
Optimized Modelfile Example:
# syntax=docker
FROM llama3:8b
# Optimization: Reduce context to match app needs, saving KV-cache VRAM
PARAMETER num_ctx 4096
# Optimization: Temperature and repetition penalty for deterministic output
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.1
# Optimization: Top-k and top-p for sampling control
PARAMETER top_k 40
PARAMETER top_p 0.9
# System prompt injection
SYSTEM """
You are a technical assistant. Provide concise, code-focused answers.
Do not include conversational filler.
"""
# Custom template for chat completion
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
"""
Build and Run:
ollama create my-optimized-model -f Modelfile
ollama run my-optimized-model
3. TypeScript Client Integration
Use the official ollama npm package. Implement streaming for low-latency UX and configure timeouts to handle variable generation speeds.
import ollama from 'ollama';
interface ChatConfig {
model: string;
prompt: string;
contextLength?: number;
}
export async function streamCompletion(config: ChatConfig) {
const response = await ollama.chat({
model: config.model,
messages: [{ role: 'user', content: config.prompt }],
stream: true,
options: {
num_ctx: config.contextLength || 4096,
temperature: 0.7,
// Override Modelfile params if necessary
num_gpu: 999,
},
});
let fullResponse = '';
// Process stream chunks
for await (const part of response) {
const chunk = part.message.content;
process.stdout.write(chunk);
fullResponse += chunk;
}
console.log('\n---');
console.log(`Total tokens: ${response.total_eval_count}`);
return fullResponse;
}
// Production usage with timeout handling
async function safeCompletion(config: ChatConfig) {
try {
// AbortController for timeout
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 30000); // 30s timeout
const result = await streamCompletion(config);
clearTimeout(timeoutId);
return result;
} catch (error) {
if (error.name === 'AbortError') {
console.error('Generation timed out');
}
throw error;
}
}
4. Docker Production Deployment
Running Ollama in Docker isolates the service and simplifies networking. Use the official image with GPU runtime.
Docker Compose Configuration:
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
runtime: nvidia # Requires NVIDIA Container Toolkit
environment:
- OLLAMA_NUM_GPU=999
- OLLAMA_KEEP_ALIVE=-1
- OLLAMA_HOST=0.0.0.0:11434
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
restart: unless-stopped
volumes:
ollama_data:
Pitfall Guide
-
Ignoring Quantization Trade-offs:
- Mistake: Using FP16 models on limited VRAM or defaulting to Q4_K_M when higher precision is needed for code generation.
- Correction: Use Q4_K_M for general text. Switch to Q5_K_S or Q6_K for code-heavy tasks where syntax precision is critical. Always benchmark perplexity impact vs. VRAM savings.
-
Context Window Mismatch:
- Mistake: Leaving
num_ctx at 8192 when the application only processes 2048 tokens.
- Impact: KV-cache memory scales linearly with context. Excess context allocation consumes VRAM that could be used for model layers, forcing CPU offloading and killing performance. Set
num_ctx to the maximum token count your application actually sends.
-
The keep_alive Cold Start Trap:
- Mistake: Relying on the default 5-minute keep-alive in an API server.
- Impact: Intermittent requests cause model unloading/reloading, introducing 2-5 second latency spikes.
- Correction: Set
OLLAMA_KEEP_ALIVE=-1 for always-on services, or use a proxy to manage model lifecycle if memory is constrained.
-
Partial GPU Offloading Overhead:
- Mistake: Allowing Ollama to split layers between GPU and CPU without monitoring.
- Impact: CPU inference is orders of magnitude slower. Even a few layers on CPU can bottleneck the entire pipeline due to synchronization overhead.
- Correction: Set
OLLAMA_NUM_GPU=999 and monitor VRAM. If OOM occurs, reduce model size or quantization rather than accepting CPU fallback.
-
Windows WSL2 Memory Sharing:
- Mistake: Assuming WSL2 shares VRAM dynamically with Windows.
- Impact: WSL2 has a capped memory limit (often 50% of system RAM) and VRAM sharing can be unstable.
- Correction: Configure
.wslconfig to increase memory limits, or use DirectML backend (OLLAMA_HOST=0.0.0.0 ollama serve --gpu=directml) if CUDA is unavailable, though performance will be lower.
-
Concurrency Bottlenecks:
- Mistake: Sending parallel requests to a single Ollama instance without configuring queue handling.
- Impact: Ollama processes requests sequentially by default. Parallel calls queue up, increasing latency.
- Correction: Use
OLLAMA_MAX_QUEUE to define queue depth, or deploy multiple Ollama instances behind a load balancer for high-concurrency scenarios.
-
Security Exposure:
- Mistake: Binding Ollama to
0.0.0.0 on a public-facing server without authentication.
- Impact: Ollama has no built-in auth. Any network access grants full control over the model and potential host access via tool use.
- Correction: Never expose Ollama directly to the internet. Use a reverse proxy with auth, or restrict access via security groups/firewalls.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Consumer GPU (8GB VRAM) | llama3:8b-q4_K_M, num_ctx 4096, num_gpu auto | Balances model capacity with VRAM limits. Q4 quantization fits in 8GB with room for KV-cache. | Low. Runs on existing hardware. |
| Enterprise Multi-GPU Server | llama3:70b-q4_K_M, num_gpu 999, OLLAMA_MAX_LOADED_MODELS 2 | Maximizes throughput across GPUs. Loading multiple models reduces swap latency for routing. | Medium. Requires significant VRAM and compute investment. |
| Low-Latency API Service | OLLAMA_KEEP_ALIVE=-1, Modelfile with strict num_ctx, Docker deployment | Eliminates model loading latency. Strict context prevents KV-cache bloat. | Operational cost of keeping GPU active continuously. |
| Edge Device / Low Power | phi3:mini-4k, num_ctx 2048, DirectML/CPU fallback | Phi3 offers high efficiency for small form factors. Reduced context minimizes memory pressure. | Minimal hardware cost. Acceptable latency trade-off for edge constraints. |
Configuration Template
Systemd Service for Linux Production:
[Unit]
Description=Ollama LLM Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
[Install]
WantedBy=multi-user.target
Modelfile Template for Code Assistant:
FROM codellama:13b-python-q5_K_S
PARAMETER num_ctx 8192
PARAMETER temperature 0.2
PARAMETER repeat_penalty 1.15
PARAMETER top_k 10
PARAMETER top_p 0.95
SYSTEM """
You are an expert coding assistant. Output only valid code blocks unless asked otherwise.
Use type hints and docstrings.
"""
TEMPLATE """{{ if .System }}<|begin_of_text|>{{ .System }}<|end_of_text|>{{ end }}{{ if .Prompt }}{{ .Prompt }}<|end_of_text|>{{ end }}"""
Quick Start Guide
- Install Ollama: Run the install script on Linux or download the binary for your OS.
- Start Service: Execute
ollama serve or enable the systemd service.
- Pull and Run: Execute
ollama run llama3:8b to verify GPU detection and inference.
- Create Optimized Model: Write a Modelfile with
num_ctx and parameters, then run ollama create my-app -f Modelfile.
- Integrate: Use the TypeScript client example to stream responses from
http://localhost:11434 in your application.