jq enables safe JSON validation during configuration deployment.
Step 2: Dual-Backend Provider Registration
The gateway requires explicit provider definitions. Ollama exposes a native REST interface, while MLX serves an OpenAI-compatible endpoint. Both must be registered with identical cost structures to prevent billing logic errors.
{
"inference": {
"providers": {
"llama_cpp_engine": {
"protocol": "ollama-native",
"endpoint": "http://127.0.0.1:11434",
"credentials": "local-auth-token",
"models": [
{
"identifier": "gemma4:26b-a4b-it-q8_0",
"display_name": "Gemma 4 26B (Q8_0)",
"context_limit": 131072,
"modalities": ["text", "image"],
"reasoning_capable": true,
"pricing": { "input": 0, "output": 0, "cache_read": 0, "cache_write": 0 }
}
]
},
"apple_native_engine": {
"protocol": "openai-completions",
"endpoint": "http://127.0.0.1:8080/v1",
"credentials": "mlx-local-key",
"models": [
{
"identifier": "mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
"display_name": "Gemma 4 26B-A4B OptiQ-4bit",
"context_limit": 131072,
"modalities": ["text"],
"reasoning_capable": true,
"max_output_tokens": 4096,
"pricing": { "input": 0, "output": 0, "cache_read": 0, "cache_write": 0 }
}
]
}
}
}
}
Rationale: Explicit protocol separation prevents routing ambiguity. Setting all pricing fields to zero ensures the gateway's usage accounting remains accurate and prevents accidental cloud fallback triggers. The max_output_tokens constraint on MLX prevents context window overflow during extended reasoning chains.
Step 3: Agent Routing and Workspace Partitioning
Production deployments require strict privilege separation. A single configuration file can define multiple agent personas with distinct tool access levels and routing rules.
{
"agents": {
"default_workspace": "/opt/local-ai/workspace",
"active_model": "apple_native_engine/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
"model_registry": {
"llama_cpp_engine/gemma4:26b-a4b-it-q8_0": {
"short_name": "gemma-q8",
"execution_params": { "enable_thinking": true }
},
"apple_native_engine/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit": {
"short_name": "gemma-optiq",
"execution_params": { "enable_thinking": true }
}
},
"instances": [
{
"instance_id": "admin",
"label": "Private Assistant",
"workspace": "/opt/local-ai/workspace/admin",
"tool_policy": "full_access"
},
{
"instance_id": "guest",
"label": "Public Interface",
"workspace": "/opt/local-ai/workspace/guest",
"tool_policy": {
"blocked": ["shell_execution", "network_scrape", "system_process"],
"allowed": ["file_read", "code_interpreter_sandbox", "knowledge_base_query"]
}
}
],
"routing_rules": [
{
"target_instance": "admin",
"match_criteria": {
"channel": "whatsapp",
"peer_type": "direct_message",
"peer_identifier": "<YOUR_NUMBER_E164>"
}
},
{
"target_instance": "guest",
"match_criteria": {
"channel": "whatsapp"
}
}
]
}
}
Rationale: Workspace partitioning prevents cross-contamination of conversation history and tool outputs. The guest instance explicitly blocks shell execution and network scraping, mitigating prompt injection risks. Routing rules use a fallback pattern: specific peer matches take precedence, with a catch-all rule for all other channel traffic.
Step 4: Persistent Service Orchestration
Local inference engines require lifecycle management that survives system reboots. macOS LaunchAgents provide native process supervision, automatic restart on crash, and environment variable injection.
Ollama Residency Configuration:
Ollama's default behavior unloads models after 5 minutes of inactivity. This introduces cold-start latency on subsequent requests. The service must be configured to maintain residency and enable hardware acceleration.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key><string>com.localai.ollama.residency</string>
<key>ProgramArguments</key>
<array>
<string>/opt/homebrew/bin/ollama</string>
<string>serve</string>
</array>
<key>EnvironmentVariables</key>
<dict>
<key>OLLAMA_FLASH_ATTENTION</key><string>1</string>
<key>OLLAMA_KEEP_ALIVE</key><string>24h</string>
<key>OLLAMA_KV_CACHE_TYPE</key><string>q8_0</string>
</dict>
<key>RunAtLoad</key><true/>
<key>KeepAlive</key><true/>
<key>StandardOutPath</key><string>/var/log/localai/ollama.out</string>
<key>StandardErrorPath</key><string>/var/log/localai/ollama.err</string>
</dict>
</plist>
MLX Server Persistence:
MLX's server component loads the model at process initialization and retains it in memory. The LaunchAgent acts as both the startup trigger and the residency enforcer.
#!/bin/bash
# /opt/local-ai/scripts/mlx-server.sh
MODEL_REF="mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit"
VENV_PATH="/opt/local-ai/mlx-runtime"
PORT=8080
source "${VENV_PATH}/bin/activate"
exec "${VENV_PATH}/bin/mlx_lm.server" \
--model "${MODEL_REF}" \
--port "${PORT}" \
--host 127.0.0.1
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key><string>com.localai.mlx.server</string>
<key>ProgramArguments</key>
<array>
<string>/opt/local-ai/scripts/mlx-server.sh</string>
</array>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key><string>/opt/local-ai/mlx-runtime/bin:/opt/homebrew/bin:/usr/bin:/bin</string>
</dict>
<key>RunAtLoad</key><true/>
<key>KeepAlive</key><true/>
<key>StandardOutPath</key><string>/var/log/localai/mlx.out</string>
<key>StandardErrorPath</key><string>/var/log/localai/mlx.err</string>
</dict>
</plist>
Rationale: Separating the MLX server into a shell wrapper script allows dynamic environment activation and clean process execution. The exec builtin replaces the shell process, ensuring LaunchAgent tracks the correct PID. Log paths are centralized for unified monitoring. Both services use KeepAlive to guarantee automatic recovery after crashes or thermal throttling events.
Pitfall Guide
1. Concurrent Model Residency Throttling
Explanation: Loading both Ollama and MLX models simultaneously saturates the unified memory controller. Apple Silicon shares memory bandwidth across CPU, GPU, and Neural Engine. Contention forces both backends to serialize memory accesses, dropping throughput by 40-50%.
Fix: Enforce single-backend residency. Use launchctl unload on the inactive provider's service before starting the active one. Implement a routing script that checks launchctl list | grep com.localai before switching providers.
2. Ollama Lazy Unload Cold Starts
Explanation: Ollama's default OLLAMA_KEEP_ALIVE=5m unloads the model from VRAM after inactivity. Subsequent requests trigger a 3-8 second model reload, breaking conversational flow.
Fix: Set OLLAMA_KEEP_ALIVE=24h in the service environment. Deploy a warm-up script that sends a minimal prompt ("ping") immediately after service startup to force GPU residency before user traffic arrives.
3. LaunchAgent Path Resolution Failures
Explanation: macOS LaunchAgents execute with a minimal environment. Scripts that rely on source venv/bin/activate or Homebrew paths will fail silently, causing the service to crash on boot.
Fix: Explicitly define the PATH environment variable in the plist. Use absolute paths for all binaries. Avoid shell wrappers that depend on interactive profile loading. Validate with launchctl print gui/$(id -u)/com.localai.service after loading.
4. Quantization Mismatch on MoE Architectures
Explanation: Applying uniform 4-bit quantization to mixture-of-experts models degrades routing accuracy. The gating network requires higher precision to correctly activate expert layers.
Fix: Use OptiQ-4bit or equivalent mixed-precision quantization. Verify the quantization metadata includes router_bits=8 and expert_bits=4. Avoid naive GGUF conversion without MoE-aware quantization flags.
5. Routing Rule Overlap and Catch-All Traps
Explanation: Defining a catch-all routing rule without explicit precedence causes all traffic to hit the public agent, including admin DMs. The gateway evaluates rules top-down; order matters.
Fix: Place specific peer matches before generic channel matches. Test routing with openclaw route simulate --peer <YOUR_NUMBER_E164> --channel whatsapp before production deployment. Add explicit deny rules for unauthorized peers.
6. Voice Pipeline Queue Backlog
Explanation: STT and TTS services process audio sequentially. High-frequency voice messages create a processing queue that blocks text-based tool execution, causing timeout errors.
Fix: Implement async audio processing with a dedicated worker pool. Set max_concurrent_audio=2 in the gateway config. Use streaming TTS to return partial responses while audio generates. Monitor queue depth with curl http://127.0.0.1:17494/health.
7. Silent Configuration Validation Errors
Explanation: OpenClaw fails to start if provider endpoints are unreachable or model identifiers contain typos. The error logs often show generic startup failures without pinpointing the invalid configuration key.
Fix: Run openclaw config validate before restarting the gateway. Use JSON schema validation in CI/CD pipelines. Implement a pre-start health check that pings both provider endpoints and verifies model availability.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-frequency interactive chat | MLX OptiQ-4bit (isolated) | Highest throughput (~73 tok/s), lowest latency, efficient memory footprint | $0 (local compute only) |
| Multi-modal input (images + text) | Ollama Q8_0 (isolated) | Native image processing support, broader model compatibility | $0 (local compute only) |
| Memory-constrained deployment (<32GB) | MLX OptiQ-4bit | ~17 GB resident vs ~33 GB for Q8_0, leaves headroom for context/tools | $0 (avoids cloud fallback) |
| Production monitoring required | Ollama + structured logging | Native eval metrics, predictable unload behavior, easier health checks | $0 (local observability) |
| Rapid provider switching | OpenClaw config swap | One-line primary model change, zero downtime routing, unified gateway | $0 (configuration only) |
Configuration Template
{
"inference": {
"providers": {
"llama_cpp_engine": {
"protocol": "ollama-native",
"endpoint": "http://127.0.0.1:11434",
"credentials": "local-auth-token",
"models": [
{
"identifier": "gemma4:26b-a4b-it-q8_0",
"display_name": "Gemma 4 26B (Q8_0)",
"context_limit": 131072,
"modalities": ["text", "image"],
"reasoning_capable": true,
"pricing": { "input": 0, "output": 0, "cache_read": 0, "cache_write": 0 }
}
]
},
"apple_native_engine": {
"protocol": "openai-completions",
"endpoint": "http://127.0.0.1:8080/v1",
"credentials": "mlx-local-key",
"models": [
{
"identifier": "mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
"display_name": "Gemma 4 26B-A4B OptiQ-4bit",
"context_limit": 131072,
"modalities": ["text"],
"reasoning_capable": true,
"max_output_tokens": 4096,
"pricing": { "input": 0, "output": 0, "cache_read": 0, "cache_write": 0 }
}
]
}
}
},
"agents": {
"default_workspace": "/opt/local-ai/workspace",
"active_model": "apple_native_engine/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit",
"model_registry": {
"llama_cpp_engine/gemma4:26b-a4b-it-q8_0": {
"short_name": "gemma-q8",
"execution_params": { "enable_thinking": true }
},
"apple_native_engine/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit": {
"short_name": "gemma-optiq",
"execution_params": { "enable_thinking": true }
}
},
"instances": [
{
"instance_id": "admin",
"label": "Private Assistant",
"workspace": "/opt/local-ai/workspace/admin",
"tool_policy": "full_access"
},
{
"instance_id": "guest",
"label": "Public Interface",
"workspace": "/opt/local-ai/workspace/guest",
"tool_policy": {
"blocked": ["shell_execution", "network_scrape", "system_process"],
"allowed": ["file_read", "code_interpreter_sandbox", "knowledge_base_query"]
}
}
],
"routing_rules": [
{
"target_instance": "admin",
"match_criteria": {
"channel": "whatsapp",
"peer_type": "direct_message",
"peer_identifier": "<YOUR_NUMBER_E164>"
}
},
{
"target_instance": "guest",
"match_criteria": {
"channel": "whatsapp"
}
}
]
}
}
Quick Start Guide
- Initialize isolated runtime: Create the Python virtual environment at
/opt/local-ai/mlx-runtime, install mlx-lm and hf-transfer, and verify the installation with mlx_lm.server --help.
- Deploy service definitions: Save the Ollama and MLX plist files to
~/Library/LaunchAgents/, set correct permissions (chmod 644), and load both services using launchctl load.
- Validate provider endpoints: Confirm Ollama responds at
http://127.0.0.1:11434/api/tags and MLX serves models at http://127.0.0.1:8080/v1/models. Run the warm-up script for Ollama if cold starts are detected.
- Load agent configuration: Place the JSON template at
~/.openclaw/openclaw.json, replace <YOUR_NUMBER_E164> with your WhatsApp identifier, and restart the gateway with openclaw gateway restart. Verify routing with a test message.