runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_GPU=999
ports:
- "11434:11434"
volumes:
- ./models:/root/.ollama
- ./configs:/app/configs
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
**Architectural Rationale:**
- `OLLAMA_NUM_GPU=999` forces maximum GPU layer offloading, preventing silent CPU fallback that degrades throughput by 10â15x.
- Volume mapping separates model weights from runtime state, enabling rapid model swapping without container rebuilds.
- Explicit GPU reservation prevents resource contention in multi-service environments.
### Step 2: Quantization & Model Selection Strategy
Quantization reduces memory footprint by compressing weight precision. The trade-off curve is non-linear: Q4_K_M retains 96% of full-precision quality while halving VRAM requirements compared to Q8_0.
| Quantization | 14B Size | 32B Size | Quality Retention | Recommended Use Case |
|--------------|----------|----------|-------------------|----------------------|
| Q8_0 | 14.7 GB | 33.6 GB | 99% | Research/Archival |
| Q6_K | 11.2 GB | 25.4 GB | 98% | High-fidelity coding |
| **Q4_K_M** | **8.2 GB** | **18.7 GB** | **96%** | **Production default** |
| Q3_K_M | 6.4 GB | 14.5 GB | 92% | VRAM-constrained |
| Q2_K | 4.9 GB | 10.8 GB | 85% | Emergency fallback |
**Selection Logic:**
- 8â12 GB VRAM â 14B Q4_K_M
- 16â24 GB VRAM â 32B Q4_K_M
- 36+ GB VRAM â 70B Q4_K_M
- Multi-GPU cluster â 671B (requires distributed inference framework)
### Step 3: Inference Configuration Architecture
Ollama's `Modelfile` syntax allows deterministic control over sampling parameters, context windows, and system behavior. Production configurations should decouple reasoning intensity from output formatting.
```dockerfile
# configs/production-r1.modelfile
FROM unsloth/DeepSeek-R1-32B-GGUF:Q4_K_M
# Sampling configuration for deterministic reasoning
PARAMETER temperature 0.35
PARAMETER top_p 0.88
PARAMETER top_k 40
PARAMETER repeat_penalty 1.12
# Context management
PARAMETER num_ctx 32768
PARAMETER num_thread 16
# System behavior directives
SYSTEM """You are a senior software architect and reasoning engine.
Analyze problems step-by-step before generating solutions.
Output structured reasoning traces, then provide implementation code.
Maintain strict adherence to modern TypeScript 5.x and Python 3.12+ standards.
Never omit error handling or type definitions."""
Parameter Rationale:
temperature 0.35 balances creativity with deterministic code generation. Values below 0.2 cause repetitive patterns; above 0.5 introduce hallucination in technical contexts.
top_p 0.88 and top_k 40 constrain token selection to high-probability candidates, reducing syntactic errors in generated code.
repeat_penalty 1.12 prevents loop degradation during long reasoning chains.
num_ctx 32768 accommodates full repository context without triggering attention fragmentation.
Step 4: Programmatic Orchestration (TypeScript Client)
Direct API integration enables streaming responses, timeout handling, and structured prompt templating for CI/CD pipelines or internal tooling.
// src/clients/reasoning-client.ts
import axios, { AxiosInstance } from 'axios';
interface InferenceRequest {
model: string;
prompt: string;
temperature?: number;
maxTokens?: number;
stream?: boolean;
}
interface InferenceResponse {
reasoning: string;
output: string;
tokensGenerated: number;
latencyMs: number;
}
export class ReasoningEngineClient {
private api: AxiosInstance;
private defaultModel: string;
constructor(baseUrl: string = 'http://localhost:11434', model: string = 'deepseek-r1:32b') {
this.defaultModel = model;
this.api = axios.create({
baseURL: `${baseUrl}/api`,
timeout: 120000,
headers: { 'Content-Type': 'application/json' }
});
}
async generate(request: InferenceRequest): Promise<InferenceResponse> {
const startTime = performance.now();
const payload = {
model: request.model || this.defaultModel,
prompt: request.prompt,
stream: request.stream ?? false,
options: {
temperature: request.temperature ?? 0.35,
num_predict: request.maxTokens ?? 4096
}
};
const response = await this.api.post('/generate', payload);
const latency = performance.now() - startTime;
// Parse explicit reasoning chain from response
const rawText = response.data.response;
const reasoningMatch = rawText.match(/<think>([\s\S]*?)<\/think>/);
const reasoning = reasoningMatch ? reasoningMatch[1].trim() : '';
const output = reasoningMatch ? rawText.replace(/<think>[\s\S]*?<\/think>/, '').trim() : rawText.trim();
return {
reasoning,
output,
tokensGenerated: response.data.eval_count || 0,
latencyMs: Math.round(latency)
};
}
}
Implementation Notes:
- The client explicitly parses
<think> tokens, separating reasoning traces from final output. This enables audit logging and prompt refinement without exposing internal deliberation to end-users.
- Timeout is set to 120s to accommodate long reasoning chains on 32B models. Adjust based on VRAM and prompt complexity.
num_predict caps output length to prevent runaway generation during recursive debugging tasks.
Pitfall Guide
1. Silent CPU Fallback
Explanation: Ollama may default to CPU inference if GPU drivers are misconfigured or environment variables are unset, reducing throughput from ~25 tok/s to ~2 tok/s without explicit warnings.
Fix: Verify GPU allocation via nvidia-smi during inference. Set OLLAMA_NUM_GPU=999 in runtime environment. Monitor ollama ps to confirm GPU layer assignment.
2. Quantization Over-Compression
Explanation: Dropping below Q3_K_M on complex reasoning tasks causes attention mechanism degradation, manifesting as syntactic errors, broken logic chains, or hallucinated API signatures.
Fix: Maintain Q4_K_M as the production floor. Use Q3_K_M only for exploratory prototyping on VRAM-constrained hardware. Validate quantization impact against a benchmark suite before deployment.
3. Context Window Fragmentation
Explanation: Requesting num_ctx values exceeding available VRAM triggers memory paging, causing severe latency spikes and occasional generation truncation.
Fix: Calculate VRAM budget: Model Size + Context Buffer + System Overhead. For 32B Q4_K_M (18.7 GB), allocate 32K context on a 24GB card by reducing num_ctx to 16384 or upgrading to 32GB+ VRAM.
4. Reasoning Chain Suppression
Explanation: Using distilled variants or incorrect model tags strips the explicit <think> token generation, reverting to standard autoregressive completion without step-by-step verification.
Fix: Verify model tag matches deepseek-r1:[size]. Avoid community distills unless explicitly required for low-VRAM environments. Test with a known reasoning prompt to confirm chain output.
5. Prompt Template Drift
Explanation: Missing or malformed system directives cause the model to default to training-language patterns (often Chinese), or omit required formatting constraints.
Fix: Always define explicit SYSTEM directives in the Modelfile. Include language enforcement, output structure requirements, and domain-specific constraints. Validate with a dry-run prompt before pipeline integration.
6. Sampling Parameter Misalignment
Explanation: High temperature (>0.7) combined with low top_p creates conflicting probability distributions, resulting in incoherent code blocks or broken mathematical derivations.
Fix: Align sampling parameters with task type. Use temperature 0.3â0.4 + top_p 0.85â0.9 for code/math. Reserve temperature 0.7+ for creative or exploratory prompts only.
7. API Endpoint Version Mismatch
Explanation: Ollama's native API (/api/generate) differs from the OpenAI-compatible shim (/v1/chat/completions). Mixing endpoints causes payload structure errors and missing streaming support.
Fix: Standardize on /v1/chat/completions for chat-based workflows and /api/generate for raw prompt completion. Update client libraries to match endpoint schema. Validate payload structure against runtime documentation.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Internal code review assistant | R1:32B Q4_K_M on single RTX 4090 | Matches GPT-4o on coding benchmarks, zero API cost | $0/month after hardware |
| High-volume customer support routing | R1:14B Q4_K_M + vector search | Lower latency, sufficient for intent classification | $0/month, scales horizontally |
| Mathematical verification pipeline | R1:32B Q6_K on dual RTX 3090 | Higher precision reduces derivation errors | ~$1.2k hardware, $0 operational |
| Edge deployment (8GB VRAM) | R1:14B Q3_K_M + context truncation | Fits memory constraints, maintains 92% quality | $0 operational, acceptable accuracy trade-off |
| Multi-agent orchestration | R1:70B Q4_K_M on workstation GPU | Superior tool-use and planning capabilities | ~$3k hardware, replaces multiple API calls |
Configuration Template
# Modelfile: production-reasoning-32b
FROM unsloth/DeepSeek-R1-32B-GGUF:Q4_K_M
PARAMETER temperature 0.32
PARAMETER top_p 0.87
PARAMETER top_k 45
PARAMETER repeat_penalty 1.15
PARAMETER num_ctx 24576
PARAMETER num_thread 20
SYSTEM """You are an autonomous reasoning engine for software engineering tasks.
Process all requests through explicit step-by-step verification.
Output structured analysis, then implementation code.
Enforce strict typing, error handling, and modern syntax standards.
Never speculate; flag uncertainty explicitly."""
# docker-compose.prod.yml
version: '3.8'
services:
r1-node:
image: ollama/ollama:latest
container_name: reasoning-production
runtime: nvidia
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_GPU=999
- OLLAMA_KEEP_ALIVE=24h
ports:
- "11434:11434"
volumes:
- ./weights:/root/.ollama
- ./configs:/app/configs
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: always
logging:
driver: json-file
options:
max-size: "50m"
max-file: "3"
Quick Start Guide
- Install NVIDIA drivers and Docker with GPU runtime support. Verify with
nvidia-smi and docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi.
- Launch the inference container using the provided
docker-compose.prod.yml. The runtime will initialize and expose port 11434.
- Pull and configure the model by placing the Modelfile in the configs directory and running
ollama create prod-r32 -f /app/configs/Modelfile inside the container.
- Validate the deployment with a test prompt:
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"prod-r32","messages":[{"role":"user","content":"Explain the time complexity of merge sort"}],"temperature":0.3}'. Confirm structured reasoning output and GPU utilization via nvidia-smi.