Difficulty

Intermediate

Read Time

8 min

Run Claude Code Locally for Free with Docker Model Runner

By Codcompass Team·2026-05-12·8 min read

Architecting Offline-First AI Workflows: Local LLM Integration with Docker Model Runner and Claude Code CLI

Current Situation Analysis

Cloud-hosted AI coding assistants have fundamentally changed developer productivity, but they introduce three critical operational constraints: unpredictable token-based billing, data residency compliance risks, and network dependency. As software projects scale, the volume of context windows, file reads, and iterative refactoring requests causes API consumption to grow non-linearly. For teams handling proprietary intellectual property or operating in restricted environments, routing source code through external inference endpoints is no longer a viable default.

Many engineering teams mistakenly assume that running large language models locally requires managing complex orchestration layers, custom API gateways, or sacrificing the polished developer experience of cloud-native CLI tools. This perception stems from early local inference setups that demanded manual GPU driver configuration, fragmented model repositories, and inconsistent API contracts. The reality has shifted dramatically. Containerized inference runtimes now abstract hardware complexity and expose standardized REST interfaces that align with existing cloud SDKs.

Docker Model Runner addresses this gap by providing a unified, container-native lifecycle manager for LLMs. It automatically handles model quantization, GPU/CPU resource allocation, and exposes an Anthropic-compatible /v1/messages endpoint on a local TCP port. This architectural shift allows developers to treat local inference as a drop-in replacement for cloud APIs, preserving tooling familiarity while eliminating external data transmission and per-request costs.

WOW Moment: Key Findings

The transition from cloud API routing to local containerized inference fundamentally alters the cost, security, and reliability profile of AI-assisted development. The following comparison illustrates the operational delta when routing Claude Code CLI requests through Docker Model Runner versus traditional cloud endpoints.

Approach	Cost Structure	Data Residency	Network Dependency	Setup Overhead
Cloud API Routing	Pay-per-token, scales with project complexity	External provider infrastructure	Required for all inference	Minimal (API key only)
Local Docker Model Runner	Zero marginal cost, hardware-bound	Fully on-premise/developer machine	Optional (offline capable)	Moderate (Docker + model pull)

This finding matters because it decouples AI capability from subscription economics. Developers can now run iterative code generation, refactoring, and documentation tasks without token budget constraints. The local endpoint also enables deterministic behavior in air-gapped environments, CI/CD runners with restricted outbound traffic, and compliance-heavy workflows where source code cannot leave the host machine. By standardizing the inference layer through Docker, teams gain reproducible model versions, versioned context windows, and consistent API contracts across development environments.

Core Solution

Implementing a local-first AI workflow requires aligning three components: the containerized inference runtime, the model artifact, and the CLI tooling. The architecture relies on Docker Model Runner's ability to expose a standardized HTTP interface that Claude Code CLI natively understands.

Step 1: Runtime Initialization and TCP Binding

Docker Model Runner operates as a background service within Docker Desktop or Docker Engine. The first step is enabling TCP access on a dedicated port. The default port is 12434, but you can bind to any available interface.

# Enable TCP listener for local inference routing
docker desktop enable model-runner --tcp 12434

Architecture Rationale: Binding to a specific port isolates inference traffic from other Docker services. This prevents port collisions and allows firewall rules to restrict external access. The runtime automatically negotiates hardware acceleration (CUDA, Metal, or CPU fallback) based on available system resources.

Step 2: Model Selection and Artifact Retrieval

Models are distributed through the Docker Hub AI catalog. Selection should prioritize coding-optimized architectures and quantization formats that balance VRAM consumption with inference quality. The Q4_K_M format uses 4-bit quantization with mixed precision, reducing memory footprint while preserving code generation accuracy.

# Retrieve a coding-optimized model variant
docker model pull ai/phi4:14B-Q4_K_M

# Verify artifact integrity and runtime status
docker model ls
docker model status

Architecture Rationale: Containerized models are immutable artifacts. Pulling a specific tag ensures reproducible inference across machines. Quantization directly impacts context window capacity and token generation speed. Developers should benchmark their target hardware before committing to larger parameter counts.

Step 3: Endpoint Validation and Payload Verification

Before routing CLI traffic, validate that the inference service responds correctly to Anthropic-compatible message payloads. The endpoint expects a JSON structure with model, max_tokens, and messages fields.

# Validate local inference routing
curl -s http://localhost:12434/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/phi4:14B-Q4_K_M",
    "max_tokens": 64,
    "messages": [{"role": "user", "content": "Verify local routing."}]
  }' | jq '.content[0].text'

Architecture Rationale: Direct endpoint testing isolates network configuration issues from CLI routing problems. Using jq filters the response payload, confirming that the inference service returns structured text rather than raw HTTP errors. This step preven

ts silent failures when Claude Code attempts to initialize sessions.

Step 4: CLI Routing and Environment Binding

Claude Code CLI routes requests to Anthropic's cloud by default. Overriding this behavior requires setting the base URL environment variable and specifying the local model identifier. The CLI transparently forwards requests to the configured endpoint.

# Route CLI traffic to local inference service
export ANTHROPIC_BASE_URL="http://localhost:12434"
claude --model ai/phi4:14B-Q4_K_M

Architecture Rationale: Environment variables provide session-scoped configuration without modifying CLI binaries. This approach allows developers to toggle between cloud and local routing by adjusting shell state. The --model flag maps directly to the Docker Model Runner artifact name, ensuring the runtime loads the correct weights.

Step 5: Context Window Expansion and Model Packaging

Default context windows often restrict large codebase analysis. Docker Model Runner supports repackaging models with expanded context limits, trading additional VRAM for longer conversation history and file inclusion.

# Base model retrieval
docker model pull ai/gpt-oss

# Repackage with expanded context window
docker model package \
  --from ai/gpt-oss \
  --context-size 32000 \
  gpt-oss:32k

# Deploy repackaged variant
claude --model gpt-oss:32k

Architecture Rationale: Context window size directly correlates with memory allocation and token generation latency. Repackaging creates a new immutable artifact with modified runtime parameters. This approach avoids rebuilding base images and allows teams to maintain multiple context configurations for different project scales.

Step 6: Request Monitoring and Telemetry

Observability is critical when debugging inference behavior. Docker Model Runner exposes a request stream that logs payload metadata, token counts, and latency metrics without intercepting application traffic.

# Stream inference telemetry
docker model requests --model ai/phi4:14B-Q4_K_M

Architecture Rationale: Real-time request logging enables performance profiling and error diagnosis. Developers can identify context overflow, malformed payloads, or hardware throttling without adding external monitoring agents. The stream operates independently of the inference API, ensuring zero performance degradation.

Pitfall Guide

1. Port Collision on Default Interface

Explanation: Multiple services or previous Docker containers may already occupy port 12434, causing the model runner to fail silently or bind to an unexpected interface. Fix: Verify port availability before initialization using lsof -i :12434 or netstat -tuln | grep 12434. Bind to an alternative port (--tcp 12435) and update ANTHROPIC_BASE_URL accordingly.

2. Quantization Mismatch and VRAM Exhaustion

Explanation: Loading unquantized or high-precision models on consumer hardware triggers out-of-memory errors, causing the inference service to crash or fallback to CPU with severe latency. Fix: Always verify VRAM capacity against model requirements. Use Q4_K_M or Q5_K_M quantization for 7B-14B parameter models. Monitor GPU utilization with nvidia-smi or metal diagnostics before scaling context windows.

3. Environment Variable Scope Leakage

Explanation: Setting ANTHROPIC_BASE_URL globally affects all CLI tools and scripts that rely on Anthropic's API, causing unexpected routing to local endpoints in unrelated workflows. Fix: Scope variables to specific shell sessions or use wrapper scripts. Prefer env ANTHROPIC_BASE_URL=http://localhost:12434 claude --model <name> for one-off executions, or maintain separate shell profiles for local vs cloud routing.

4. Context Window Overflow Without Repackaging

Explanation: Attempting to process large repositories with default context limits results in truncated file reads, incomplete refactoring suggestions, and silent context drops. Fix: Use docker model package --context-size to create project-specific variants. Benchmark token consumption per session and adjust context limits based on average codebase size. Monitor docker model requests for truncation warnings.

5. Missing System Prompts for Coding Tasks

Explanation: Local models lack the cloud-hosted system instructions that optimize Claude Code for software engineering, resulting in verbose outputs, poor formatting, or irrelevant suggestions. Fix: Inject coding-optimized system prompts via CLI configuration or wrapper scripts. Use structured markdown templates for file reads, diff generation, and test scaffolding. Validate output consistency across multiple iterations before adopting as a default workflow.

6. Inconsistent API Payload Compatibility

Explanation: Some local inference runtimes deviate from Anthropic's message schema, causing CLI parsing failures or malformed response handling. Fix: Validate endpoint compatibility using the curl test payload before CLI integration. If responses lack content[0].text structure, implement a lightweight proxy that normalizes payloads. Prefer Docker Model Runner's native compatibility layer to avoid custom translation layers.

7. Overlooking Concurrency and Request Throttling

Explanation: Running multiple CLI sessions or background agents against a single local endpoint saturates GPU memory and causes request queuing, timeouts, or degraded token generation speed. Fix: Limit concurrent sessions to one active CLI instance per model artifact. Implement request queuing in wrapper scripts or use Docker Model Runner's built-in concurrency limits. Monitor docker model status for active session counts and adjust workflow parallelism accordingly.

Production Bundle

Action Checklist

Verify Docker Desktop/Engine installation and enable Model Runner TCP binding on an available port
Pull coding-optimized model artifacts with appropriate quantization (Q4_K_M recommended)
Validate endpoint responsiveness using Anthropic-compatible JSON payload and jq filtering
Configure ANTHROPIC_BASE_URL with session-scoped environment variables or shell wrappers
Test CLI routing with --model flag and verify token generation latency
Package context-expanded variants for large codebase analysis using docker model package
Enable request telemetry streaming for performance profiling and error diagnosis
Document model versions, context limits, and hardware requirements in project README

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Solo developer with 16GB+ VRAM	Local Docker Model Runner + Q4 quantized model	Eliminates token costs, enables offline work, maintains CLI familiarity	Zero marginal cost, hardware amortization
Enterprise with compliance requirements	Local routing with strict port binding and firewall rules	Prevents data exfiltration, ensures auditability, meets regulatory standards	Infrastructure setup cost, reduced cloud API spend
CI/CD pipeline with outbound restrictions	Pre-pulled model artifacts + containerized CLI execution	Guarantees reproducible builds, avoids network timeouts, scales with runner pool	Container registry storage, runner GPU allocation
Travel/air-gapped development	Local endpoint + expanded context packaging	Maintains productivity without internet, handles large repos offline	VRAM dependency, model update latency

Configuration Template

#!/usr/bin/env bash
# local-ai-router.sh - Session-scoped CLI routing wrapper

LOCAL_INFERENCE_PORT="${LOCAL_INFERENCE_PORT:-12434}"
LOCAL_MODEL="${LOCAL_MODEL:-ai/phi4:14B-Q4_K_M}"
BASE_URL="http://localhost:${LOCAL_INFERENCE_PORT}"

# Validate runtime availability
if ! curl -s -o /dev/null -w "%{http_code}" "${BASE_URL}/v1/messages" | grep -q "200"; then
  echo "Error: Local inference service unavailable on port ${LOCAL_INFERENCE_PORT}"
  exit 1
fi

# Execute CLI with scoped environment
env ANTHROPIC_BASE_URL="${BASE_URL}" claude --model "${LOCAL_MODEL}" "$@"

Usage:

chmod +x local-ai-router.sh
./local-ai-router.sh --model ai/gpt-oss:32k

Quick Start Guide

Initialize Runtime: Run docker desktop enable model-runner --tcp 12434 to bind the inference service to a local TCP interface.
Retrieve Model: Execute docker model pull ai/phi4:14B-Q4_K_M to download a quantized coding-optimized artifact.
Validate Endpoint: Test routing with curl http://localhost:12434/v1/messages -H "Content-Type: application/json" -d '{"model":"ai/phi4:14B-Q4_K_M","max_tokens":32,"messages":[{"role":"user","content":"test"}]}'.
Launch CLI: Run ANTHROPIC_BASE_URL=http://localhost:12434 claude --model ai/phi4:14B-Q4_K_M to begin local-first development sessions.