Run Claude Code Locally for Free with Docker Model Runner
Architecting Offline-First AI Workflows: Local LLM Integration with Docker Model Runner and Claude Code CLI
Current Situation Analysis
Cloud-hosted AI coding assistants have fundamentally changed developer productivity, but they introduce three critical operational constraints: unpredictable token-based billing, data residency compliance risks, and network dependency. As software projects scale, the volume of context windows, file reads, and iterative refactoring requests causes API consumption to grow non-linearly. For teams handling proprietary intellectual property or operating in restricted environments, routing source code through external inference endpoints is no longer a viable default.
Many engineering teams mistakenly assume that running large language models locally requires managing complex orchestration layers, custom API gateways, or sacrificing the polished developer experience of cloud-native CLI tools. This perception stems from early local inference setups that demanded manual GPU driver configuration, fragmented model repositories, and inconsistent API contracts. The reality has shifted dramatically. Containerized inference runtimes now abstract hardware complexity and expose standardized REST interfaces that align with existing cloud SDKs.
Docker Model Runner addresses this gap by providing a unified, container-native lifecycle manager for LLMs. It automatically handles model quantization, GPU/CPU resource allocation, and exposes an Anthropic-compatible /v1/messages endpoint on a local TCP port. This architectural shift allows developers to treat local inference as a drop-in replacement for cloud APIs, preserving tooling familiarity while eliminating external data transmission and per-request costs.
WOW Moment: Key Findings
The transition from cloud API routing to local containerized inference fundamentally alters the cost, security, and reliability profile of AI-assisted development. The following comparison illustrates the operational delta when routing Claude Code CLI requests through Docker Model Runner versus traditional cloud endpoints.
| Approach | Cost Structure | Data Residency | Network Dependency | Setup Overhead |
|---|---|---|---|---|
| Cloud API Routing | Pay-per-token, scales with project complexity | External provider infrastructure | Required for all inference | Minimal (API key only) |
| Local Docker Model Runner | Zero marginal cost, hardware-bound | Fully on-premise/developer machine | Optional (offline capable) | Moderate (Docker + model pull) |
This finding matters because it decouples AI capability from subscription economics. Developers can now run iterative code generation, refactoring, and documentation tasks without token budget constraints. The local endpoint also enables deterministic behavior in air-gapped environments, CI/CD runners with restricted outbound traffic, and compliance-heavy workflows where source code cannot leave the host machine. By standardizing the inference layer through Docker, teams gain reproducible model versions, versioned context windows, and consistent API contracts across development environments.
Core Solution
Implementing a local-first AI workflow requires aligning three components: the containerized inference runtime, the model artifact, and the CLI tooling. The architecture relies on Docker Model Runner's ability to expose a standardized HTTP interface that Claude Code CLI natively understands.
Step 1: Runtime Initialization and TCP Binding
Docker Model Runner operates as a background service within Docker Desktop or Docker Engine. The first step is enabling TCP access on a dedicated port. The default port is 12434, but you can bind to any available interface.
# Enable TCP listener for local inference routing
docker desktop enable model-runner --tcp 12434
Architecture Rationale: Binding to a specific port isolates inference traffic from other Docker services. This prevents port collisions and allows firewall rules to restrict external access. The runtime automatically negotiates hardware acceleration (CUDA, Metal, or CPU fallback) based on available system resources.
Step 2: Model Selection and Artifact Retrieval
Models are distributed through the Docker Hub AI catalog. Selection should prioritize coding-optimized architectures and quantization formats that balance VRAM consumption with inference quality. The Q4_K_M format uses 4-bit quantization with mixed precision, reducing memory footprint while preserving code generation accuracy.
# Retrieve a coding-optimized model variant
docker model pull ai/phi4:14B-Q4_K_M
# Verify artifact integrity and runtime status
docker model ls
docker model status
Architecture Rationale: Containerized models are immutable artifacts. Pulling a specific tag ensures reproducible inference across machines. Quantization directly impacts context window capacity and token generation speed. Developers should benchmark their target hardware before committing to larger parameter counts.
Step 3: Endpoint Validation and Payload Verification
Before routing CLI traffic, validate that the inference service responds correctly to Anthropic-compatible message payloads. The endpoint expects a JSON structure with model, max_tokens, and messages fields.
# Validate local inference routing
curl -s http://localhost:12434/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "ai/phi4:14B-Q4_K_M",
"max_tokens": 64,
"messages": [{"role": "user", "content": "Verify local routing."}]
}' | jq '.content[0].text'
Architecture Rationale: Direct endpoint testing isolates network configuration issues from CLI routing problems. Using jq filters the response payload, confirming that the inference service returns structured text rather than raw HTTP errors. This step preven
ts silent failures when Claude Code attempts to initialize sessions.
Step 4: CLI Routing and Environment Binding
Claude Code CLI routes requests to Anthropic's cloud by default. Overriding this behavior requires setting the base URL environment variable and specifying the local model identifier. The CLI transparently forwards requests to the configured endpoint.
# Route CLI traffic to local inference service
export ANTHROPIC_BASE_URL="http://localhost:12434"
claude --model ai/phi4:14B-Q4_K_M
Architecture Rationale: Environment variables provide session-scoped configuration without modifying CLI binaries. This approach allows developers to toggle between cloud and local routing by adjusting shell state. The --model flag maps directly to the Docker Model Runner artifact name, ensuring the runtime loads the correct weights.
Step 5: Context Window Expansion and Model Packaging
Default context windows often restrict large codebase analysis. Docker Model Runner supports repackaging models with expanded context limits, trading additional VRAM for longer conversation history and file inclusion.
# Base model retrieval
docker model pull ai/gpt-oss
# Repackage with expanded context window
docker model package \
--from ai/gpt-oss \
--context-size 32000 \
gpt-oss:32k
# Deploy repackaged variant
claude --model gpt-oss:32k
Architecture Rationale: Context window size directly correlates with memory allocation and token generation latency. Repackaging creates a new immutable artifact with modified runtime parameters. This approach avoids rebuilding base images and allows teams to maintain multiple context configurations for different project scales.
Step 6: Request Monitoring and Telemetry
Observability is critical when debugging inference behavior. Docker Model Runner exposes a request stream that logs payload metadata, token counts, and latency metrics without intercepting application traffic.
# Stream inference telemetry
docker model requests --model ai/phi4:14B-Q4_K_M
Architecture Rationale: Real-time request logging enables performance profiling and error diagnosis. Developers can identify context overflow, malformed payloads, or hardware throttling without adding external monitoring agents. The stream operates independently of the inference API, ensuring zero performance degradation.
Pitfall Guide
1. Port Collision on Default Interface
Explanation: Multiple services or previous Docker containers may already occupy port 12434, causing the model runner to fail silently or bind to an unexpected interface.
Fix: Verify port availability before initialization using lsof -i :12434 or netstat -tuln | grep 12434. Bind to an alternative port (--tcp 12435) and update ANTHROPIC_BASE_URL accordingly.
2. Quantization Mismatch and VRAM Exhaustion
Explanation: Loading unquantized or high-precision models on consumer hardware triggers out-of-memory errors, causing the inference service to crash or fallback to CPU with severe latency.
Fix: Always verify VRAM capacity against model requirements. Use Q4_K_M or Q5_K_M quantization for 7B-14B parameter models. Monitor GPU utilization with nvidia-smi or metal diagnostics before scaling context windows.
3. Environment Variable Scope Leakage
Explanation: Setting ANTHROPIC_BASE_URL globally affects all CLI tools and scripts that rely on Anthropic's API, causing unexpected routing to local endpoints in unrelated workflows.
Fix: Scope variables to specific shell sessions or use wrapper scripts. Prefer env ANTHROPIC_BASE_URL=http://localhost:12434 claude --model <name> for one-off executions, or maintain separate shell profiles for local vs cloud routing.
4. Context Window Overflow Without Repackaging
Explanation: Attempting to process large repositories with default context limits results in truncated file reads, incomplete refactoring suggestions, and silent context drops.
Fix: Use docker model package --context-size to create project-specific variants. Benchmark token consumption per session and adjust context limits based on average codebase size. Monitor docker model requests for truncation warnings.
5. Missing System Prompts for Coding Tasks
Explanation: Local models lack the cloud-hosted system instructions that optimize Claude Code for software engineering, resulting in verbose outputs, poor formatting, or irrelevant suggestions. Fix: Inject coding-optimized system prompts via CLI configuration or wrapper scripts. Use structured markdown templates for file reads, diff generation, and test scaffolding. Validate output consistency across multiple iterations before adopting as a default workflow.
6. Inconsistent API Payload Compatibility
Explanation: Some local inference runtimes deviate from Anthropic's message schema, causing CLI parsing failures or malformed response handling.
Fix: Validate endpoint compatibility using the curl test payload before CLI integration. If responses lack content[0].text structure, implement a lightweight proxy that normalizes payloads. Prefer Docker Model Runner's native compatibility layer to avoid custom translation layers.
7. Overlooking Concurrency and Request Throttling
Explanation: Running multiple CLI sessions or background agents against a single local endpoint saturates GPU memory and causes request queuing, timeouts, or degraded token generation speed.
Fix: Limit concurrent sessions to one active CLI instance per model artifact. Implement request queuing in wrapper scripts or use Docker Model Runner's built-in concurrency limits. Monitor docker model status for active session counts and adjust workflow parallelism accordingly.
Production Bundle
Action Checklist
- Verify Docker Desktop/Engine installation and enable Model Runner TCP binding on an available port
- Pull coding-optimized model artifacts with appropriate quantization (Q4_K_M recommended)
- Validate endpoint responsiveness using Anthropic-compatible JSON payload and
jqfiltering - Configure
ANTHROPIC_BASE_URLwith session-scoped environment variables or shell wrappers - Test CLI routing with
--modelflag and verify token generation latency - Package context-expanded variants for large codebase analysis using
docker model package - Enable request telemetry streaming for performance profiling and error diagnosis
- Document model versions, context limits, and hardware requirements in project README
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Solo developer with 16GB+ VRAM | Local Docker Model Runner + Q4 quantized model | Eliminates token costs, enables offline work, maintains CLI familiarity | Zero marginal cost, hardware amortization |
| Enterprise with compliance requirements | Local routing with strict port binding and firewall rules | Prevents data exfiltration, ensures auditability, meets regulatory standards | Infrastructure setup cost, reduced cloud API spend |
| CI/CD pipeline with outbound restrictions | Pre-pulled model artifacts + containerized CLI execution | Guarantees reproducible builds, avoids network timeouts, scales with runner pool | Container registry storage, runner GPU allocation |
| Travel/air-gapped development | Local endpoint + expanded context packaging | Maintains productivity without internet, handles large repos offline | VRAM dependency, model update latency |
Configuration Template
#!/usr/bin/env bash
# local-ai-router.sh - Session-scoped CLI routing wrapper
LOCAL_INFERENCE_PORT="${LOCAL_INFERENCE_PORT:-12434}"
LOCAL_MODEL="${LOCAL_MODEL:-ai/phi4:14B-Q4_K_M}"
BASE_URL="http://localhost:${LOCAL_INFERENCE_PORT}"
# Validate runtime availability
if ! curl -s -o /dev/null -w "%{http_code}" "${BASE_URL}/v1/messages" | grep -q "200"; then
echo "Error: Local inference service unavailable on port ${LOCAL_INFERENCE_PORT}"
exit 1
fi
# Execute CLI with scoped environment
env ANTHROPIC_BASE_URL="${BASE_URL}" claude --model "${LOCAL_MODEL}" "$@"
Usage:
chmod +x local-ai-router.sh
./local-ai-router.sh --model ai/gpt-oss:32k
Quick Start Guide
- Initialize Runtime: Run
docker desktop enable model-runner --tcp 12434to bind the inference service to a local TCP interface. - Retrieve Model: Execute
docker model pull ai/phi4:14B-Q4_K_Mto download a quantized coding-optimized artifact. - Validate Endpoint: Test routing with
curl http://localhost:12434/v1/messages -H "Content-Type: application/json" -d '{"model":"ai/phi4:14B-Q4_K_M","max_tokens":32,"messages":[{"role":"user","content":"test"}]}'. - Launch CLI: Run
ANTHROPIC_BASE_URL=http://localhost:12434 claude --model ai/phi4:14B-Q4_K_Mto begin local-first development sessions.
