lient integration pattern required for streaming and error handling.
Architecture Decisions
- Runtime Selection: Ollama is selected for its unified API, automatic quantization handling, and active ecosystem. For multi-tenant, high-concurrency scenarios, vLLM is recommended; however, Ollama remains the standard for single-tenant local development and edge deployment.
- Containerization: Docker ensures GPU passthrough consistency and isolates dependencies. The
nvidia-container-toolkit is required for CUDA access.
- API Compatibility: The server exposes an OpenAI-compatible endpoint. This allows existing SDKs to connect without code modification, simply by swapping the
baseURL.
- Model Management: Models are stored in a persistent volume to avoid re-downloading and to manage storage quotas.
Step-by-Step Implementation
1. Infrastructure Setup
Create a docker-compose.yml that configures Ollama with GPU support, persistent model storage, and network exposure.
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
container_name: local-llm-server
runtime: nvidia
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_MODELS=/models
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_KEEP_ALIVE=24h
volumes:
- ollama_data:/root/.ollama
- model_store:/models
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
restart: unless-stopped
volumes:
ollama_data:
model_store:
Rationale:
OLLAMA_NUM_PARALLEL: Controls concurrent request handling. Set based on VRAM headroom.
OLLAMA_KEEP_ALIVE: Prevents model unloading between requests, reducing cold-start latency.
runtime: nvidia: Ensures CUDA tensors are processed on the GPU.
2. Model Pull and Configuration
Pull the optimized quantization variant. Avoid FP16 unless debugging.
docker exec -it local-llm-server ollama pull llama3.2:3b-instruct-q4_K_M
For a 7B model, use llama3.2:latest which defaults to Q4_K_M. Verify GPU offloading:
docker logs local-llm-server | grep "offload"
# Expected: llm_load_tensors: offloaded 33/33 layers to GPU
3. TypeScript Client Integration
Use the official OpenAI SDK with a custom base URL. Implement streaming for latency perception and robust error handling for local failures.
// src/llm-client.ts
import OpenAI from "openai";
const localLLM = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama", // Ollama accepts any non-empty API key
dangerouslyAllowBrowser: false, // Ensure server-side usage
});
interface LLMRequest {
model: string;
prompt: string;
maxTokens?: number;
}
export async function streamCompletion({
model,
prompt,
maxTokens = 1024,
}: LLMRequest) {
try {
const stream = await localLLM.chat.completions.create({
model,
messages: [{ role: "user", content: prompt }],
stream: true,
max_tokens: maxTokens,
temperature: 0.7,
});
let fullResponse = "";
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) {
fullResponse += delta;
// Emit chunk to client or process stream
process.stdout.write(delta);
}
}
return fullResponse;
} catch (error) {
if (error instanceof OpenAI.APIError) {
// Handle local server errors specifically
if (error.status === 503) {
throw new Error("LLM Server overloaded or model unloading");
}
throw new Error(`LLM API Error: ${error.message}`);
}
throw error;
}
}
Rationale:
- Streaming reduces perceived latency by delivering tokens as they are generated.
- Error handling catches HTTP 503, which indicates the server is busy or the model is being loaded/unloaded.
dangerouslyAllowBrowser: false enforces server-side execution, preventing API key leakage and CORS issues.
Pitfall Guide
Production local LLM deployments fail due to resource mismanagement and architectural blind spots.
- Ignoring Quantization Impact: Running FP16 models doubles VRAM usage with negligible quality improvement over Q4_K_M. This limits batch size and forces smaller models.
- Fix: Always use K-quants (Q4_K_M or Q5_K_M) for inference.
- Context Window Mismatch: Sending prompts exceeding the model's context window causes silent truncation or OOM crashes.
- Fix: Implement client-side token counting and truncate or summarize inputs before sending. Configure
OLLAMA_CONTEXT_LENGTH if the model supports it.
- Blocking the Event Loop: Using synchronous inference calls in Node.js blocks the main thread, degrading application responsiveness.
- Fix: Always use async/await and streaming. Offload inference to worker threads if CPU fallback occurs.
- GPU Fallback Silence: If CUDA initialization fails, some runtimes silently fall back to CPU, resulting in 10x latency degradation without alerting the developer.
- Fix: Check server logs for "offloaded layers to GPU". Implement health checks that verify GPU utilization via
nvidia-smi.
- Unrestricted API Exposure: Binding to
0.0.0.0 without authentication allows any device on the network to use your GPU resources.
- Fix: Use a reverse proxy (Nginx/Traefik) with API key validation or restrict binding to
127.0.0.1 for single-host setups.
- Version Drift: Model file formats change between runtime versions. Pulling a model with an old Ollama version may render it incompatible with updates.
- Fix: Pin runtime versions in Docker tags. Re-pull models after runtime upgrades.
- No Resource Quotas: Allowing unlimited concurrent requests exhausts VRAM, causing thrashing or crashes.
- Fix: Configure
OLLAMA_NUM_PARALLEL and implement queueing in the application layer if load exceeds capacity.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Individual Dev Machine | Ollama Desktop | Zero config, instant model switching, integrates with IDE tools. | Free |
| Team LAN / Shared GPU | Ollama Docker + Auth Proxy | Centralized model management, shared VRAM, access control. | Low (Infra) |
| High Concurrency / Throughput | vLLM + Docker | PagedAttention handles high request volume; better batching. | Medium (GPU) |
| Air-Gapped / Privacy Critical | Raw llama.cpp Server | No telemetry, minimal attack surface, full control over binaries. | High (Ops) |
| Edge / ARM Devices | Ollama ARM / llama.cpp | Optimized for ARM NEON; runs on Raspberry Pi/Jetson. | Low (Hardware) |
Configuration Template
Docker Compose with Auth Proxy:
services:
ollama:
image: ollama/ollama:0.3.10
runtime: nvidia
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_MODELS=/models
volumes:
- ollama_data:/root/.ollama
- model_store:/models
networks:
- llm-net
restart: unless-stopped
auth-proxy:
image: nginx:alpine
ports:
- "8080:8080"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
- ./htpasswd:/etc/nginx/.htpasswd
networks:
- llm-net
depends_on:
- ollama
volumes:
ollama_data:
model_store:
networks:
llm-net:
driver: bridge
nginx.conf (Auth Proxy):
events { worker_connections 1024; }
http {
server {
listen 8080;
location / {
auth_basic "Restricted Access";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://ollama:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
}
Generate htpasswd:
htpasswd -c .htpasswd user
TypeScript Client Config:
// config/llm.ts
export const LLM_CONFIG = {
baseURL: process.env.LLM_API_URL || "http://localhost:8080/v1",
apiKey: process.env.LLM_API_KEY || "secure-api-key",
model: "llama3.2:3b-instruct-q4_K_M",
maxTokens: 1024,
timeout: 30000,
retries: 3,
};
Quick Start Guide
- Install Runtime:
curl -fsSL https://ollama.com/install.sh | sh
# Or use Docker: docker pull ollama/ollama:latest
- Start Server:
# Docker method
docker run -d --gpus all -v ollama_data:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
- Pull Model:
docker exec ollama ollama pull llama3.2:3b-instruct-q4_K_M
- Verify API:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b-instruct-q4_K_M",
"messages": [{"role": "user", "content": "Hello"}],
"stream": false
}'
- Integrate Client:
Copy the TypeScript client code, update
baseURL to http://localhost:11434/v1, and run your application. Monitor VRAM with watch -n 1 nvidia-smi.
Conclusion:
Local LLM API server setup requires rigorous attention to quantization, runtime selection, and resource management. By adopting the architecture patterns and safeguards outlined in this guide, teams can achieve cloud-comparable reliability with the privacy, latency, and cost benefits of local inference. The transition from experimental setup to production service hinges on treating the LLM runtime as a critical infrastructure component, not a development convenience.