ecycle Management
Ollama must run as a managed daemon, not a foreground process. Systemd provides restart policies, environment isolation, and structured logging. Create a drop-in override to inject GPU driver paths and host bindings without modifying the base unit file.
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0"
Restart=always
RestartSec=3
LimitNOFILE=65536
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now ollama
Rationale: Binding to 0.0.0.0 allows containerized or remote clients to reach the endpoint. Explicit GPU device selection prevents driver conflicts on multi-GPU systems. The Restart=always policy ensures automatic recovery after kernel updates or driver reloads. LimitNOFILE prevents file descriptor exhaustion during high-concurrency streaming.
Step 2: Modelfile Architecture
Custom GGUF deployments require explicit token mapping and parameter boundaries. The Modelfile acts as a declarative configuration layer that binds the raw weights to a chat interface.
# ./inference-config
FROM ./Llama-3.2-3B-Instruct-Q5_K_M.gguf
SYSTEM """
You are a technical reasoning engine. Provide concise, structured responses.
"""
TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|start_header_id|>"
PARAMETER temperature 0.6
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER num_gpu 999
Rationale:
TEMPLATE maps the raw GGUF token stream to the model's native chat format. Omitting this causes malformed completions.
num_gpu 999 forces maximum layer offloading to VRAM. The runtime caps this automatically based on available memory.
num_ctx 8192 bounds the attention window. Exceeding hardware limits triggers silent CPU fallback.
temperature 0.6 and top_p 0.9 balance determinism and creativity for technical workloads.
Build and register the model:
ollama build dev-reasoning-v1 -f ./inference-config
Step 3: Programmatic API Integration
Interactive shells are unsuitable for production. A typed client wrapper ensures request validation, streaming handling, and error recovery.
import { createInterface } from 'readline';
interface OllamaRequest {
model: string;
prompt: string;
stream: boolean;
options?: {
temperature: number;
num_ctx: number;
};
}
interface OllamaResponse {
model: string;
response: string;
done: boolean;
total_duration?: number;
eval_count?: number;
}
class LocalInferenceClient {
private baseUrl: string;
constructor(baseUrl: string = 'http://localhost:11434') {
this.baseUrl = baseUrl;
}
async generate(request: OllamaRequest): Promise<OllamaResponse> {
const payload = JSON.stringify({
model: request.model,
prompt: request.prompt,
stream: false,
options: request.options
});
const res = await fetch(`${this.baseUrl}/api/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: payload
});
if (!res.ok) {
throw new Error(`Inference failed: ${res.status} ${res.statusText}`);
}
return res.json() as Promise<OllamaResponse>;
}
async streamGenerate(request: OllamaRequest): Promise<AsyncGenerator<string>> {
const payload = JSON.stringify({
model: request.model,
prompt: request.prompt,
stream: true,
options: request.options
});
const res = await fetch(`${this.baseUrl}/api/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: payload
});
if (!res.ok || !res.body) {
throw new Error('Streaming connection failed');
}
const reader = res.body.getReader();
const decoder = new TextDecoder();
return {
async *[Symbol.asyncIterator]() {
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(Boolean);
for (const line of lines) {
const parsed = JSON.parse(line) as OllamaResponse;
if (parsed.response) yield parsed.response;
}
}
}
};
}
}
// Usage
const client = new LocalInferenceClient();
const result = await client.generate({
model: 'dev-reasoning-v1',
prompt: 'Explain the difference between attention mechanisms in Transformers vs RNNs.',
options: { temperature: 0.6, num_ctx: 8192 }
});
console.log(result.response);
Rationale: The client abstracts HTTP boilerplate, enforces payload structure, and provides both synchronous and streaming interfaces. Streaming uses AsyncGenerator to process tokens incrementally, reducing memory overhead and enabling real-time UI updates. Error handling catches network failures and model unavailability before they cascade.
Step 4: Hardware Validation & Monitoring
GPU offloading must be verified post-deployment. Silent CPU fallback is the most common production failure mode.
# Monitor VRAM allocation in real-time
watch -n 1 nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv
# Verify daemon logs for offloading confirmation
journalctl -u ollama -f | grep -E "offloading|CUDA|ROCm|using GPU"
Rationale: nvidia-smi provides hardware-level visibility. Log filtering confirms the runtime successfully mapped tensors to VRAM. If logs show offloading 0 layers, the Modelfile num_gpu parameter or driver environment is misconfigured.
Pitfall Guide
1. Omitting the TEMPLATE Directive
Explanation: Raw GGUF files contain weights, not chat formatting rules. Without explicit token mapping, the model outputs unstructured text or repeats system prompts.
Fix: Always define TEMPLATE matching the model's native chat template. Reference Hugging Face documentation for exact token boundaries.
2. Unbounded Context Windows
Explanation: Setting num_ctx to 32768 on a 24GB GPU forces VRAM thrashing. The runtime silently offloads layers to CPU, increasing TTFT by 10x.
Fix: Calculate VRAM budget: ~1GB per 2k tokens for Q4 quantization. Cap num_ctx at hardware limits. Use num_ctx 8192 as a safe baseline.
3. Ignoring Systemd Environment Propagation
Explanation: Ollama inherits the shell environment only when run interactively. Systemd services start with a minimal environment, causing GPU driver detection failures.
Fix: Use override.conf to inject CUDA_VISIBLE_DEVICES, HSA_OVERRIDE_GFX_VERSION, and OLLAMA_HOST. Never rely on ~/.bashrc for service daemons.
4. Using Base Models for Chat Workloads
Explanation: Base models lack instruction tuning and chat formatting. They complete prompts literally, producing raw continuations instead of conversational responses.
Fix: Always deploy -Instruct or -Chat variants. Verify the model card specifies instruction-tuning before Modelfile creation.
5. Hardcoding API Endpoints
Explanation: Embedding localhost:11434 in application code breaks containerization, remote debugging, and multi-node deployments.
Fix: Externalize the base URL via environment variables. Implement connection retry logic with exponential backoff for daemon restarts.
6. Neglecting Log Structuring
Explanation: Raw journalctl output is unstructured. Production debugging requires filtering by component, severity, and request ID.
Fix: Pipe logs through jq or a log aggregator. Tag requests with correlation IDs in the API client for distributed tracing.
Explanation: Legacy GGML files lack modern metadata and GPU offloading support. Ollama may load them but fall back to CPU inference silently.
Fix: Migrate all weights to GGUF format. Use llama-quantize or Hugging Face conversion scripts to standardize the model registry.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Development Sandbox | Q4_K_M, 4k ctx, single GPU | Fast iteration, low VRAM footprint, tolerates minor latency | Minimal hardware cost, high dev velocity |
| Staging/Pre-Prod | Q4_K_M, 8k ctx, systemd managed | Validates production parameters, ensures service stability | Moderate VRAM usage, requires monitoring setup |
| Edge/Offline Deployment | Q5_K_M, 4k ctx, containerized | Balances precision and size, runs without cloud dependency | Higher storage cost, eliminates egress fees |
| High-Precision Analysis | Q8_0, 8k ctx, dedicated GPU | Preserves weight fidelity for technical/math workloads | 2x VRAM requirement, reduced throughput |
Configuration Template
# /etc/systemd/system/ollama.service.d/production.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="CUDA_VISIBLE_DEVICES=0"
Restart=on-failure
RestartSec=5
LimitNOFILE=131072
LimitMEMLOCK=infinity
# ./production-modelfile
FROM ./Mistral-7B-Instruct-v0.3-Q4_K_M.gguf
SYSTEM """
You are a production-grade reasoning assistant. Output structured JSON when requested.
"""
TEMPLATE """<s>[INST] {{ .System }}
{{ .Prompt }} [/INST]
"""
PARAMETER stop "[/INST]"
PARAMETER stop "</s>"
PARAMETER temperature 0.5
PARAMETER top_p 0.85
PARAMETER num_ctx 8192
PARAMETER num_gpu 999
PARAMETER num_thread 8
Quick Start Guide
- Install & Enable Service: Run the official installer script, create the systemd override with GPU environment variables, and enable the daemon with
systemctl enable --now ollama.
- Build Modelfile: Place your GGUF weights in a dedicated directory, write a declarative Modelfile with explicit
TEMPLATE and bounded num_ctx, then register it using ollama build.
- Validate Hardware: Open a secondary terminal, run
watch -n 1 nvidia-smi, and execute a test prompt. Confirm VRAM allocation increases and logs show GPU offloading.
- Integrate Client: Deploy the TypeScript API wrapper, configure the base URL via environment variables, and implement streaming or synchronous calls based on latency requirements.
- Monitor & Tune: Track tokens/sec and VRAM pressure under load. Adjust
num_ctx and temperature to match workload SLAs. Rotate logs and set up alerting for daemon restarts.