meters per token**, reading ~1.9 GB/token from memory. This yields ~52 tok/s on M4 Pro hardware, significantly outpacing the 31B Dense model (~10 tok/s) which loads all 31.2B parameters per inference step.
Core Solution
1. Hardware Selection & Architecture Mapping
| Model | Min RAM | Recommended | Best For |
|---|
| π’ E4B (Edge) | 4 GB | 8 GB | Raspberry Pi, Jetson Nano |
| π΅ 26B MoE β | 16 GB (Q4) | 24 GB | M4 MacBook Pro, RTX 4070 |
| π£ 31B Dense | 32 GB (Q4) | 48 GB+ | M4 Max, RTX 4090, GB10 |
Sweet Spot: 26B MoE on 24 GB hardware. MoE routing activates only 3.8B parameters per token, delivering high throughput without saturating memory bandwidth.
2. Model Deployment (Ollama vs llama.cpp)
Option A: Ollama (Streamlined)
# Install Ollama (macOS, Linux, Windows)
curl -fsSL https://ollama.com/install.sh | sh
# Pull the model β this downloads ~16 GB for the 26B MoE
ollama pull gemma4:26b
# Or the smaller edge model if you're on limited hardware
ollama pull gemma4:4b
# Verify it works π
ollama run gemma4:26b "Write a Python function to merge two sorted lists"
Option B: llama.cpp (Granular Control)
# Install via Homebrew (macOS)
brew install llama.cpp
# Or build from source for GPU support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # NVIDIA
# or: cmake -B build -DGGML_METAL=ON # Apple Silicon
cmake --build build --config Release -j
Download quantized weights:
# 26B MoE Q4 β best balance of quality and speed
huggingface-cli download gg-hf-gg/gemma-4-26B-A4B-it-GGUF \
gemma-4-26B-A4B-it-Q4_K_M.gguf \
--local-dir ./models/
Launch optimized server:
llama-server \
-m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
--port 1234 \
-ngl 99 \
-c 32768 \
-np 1 \
--jinja \
-ctk q8_0 \
-ctv q8_0
Flag Architecture Mapping:
-ngl 99: Full GPU offload
-c 32768: 32K context window
-np 1: Single inference slot (prevents KV cache multiplication)
--jinja: Enables Gemma 4's native tool-calling template
-ctk q8_0 -ctv q8_0: KV cache quantization (~940 MB β ~499 MB)
3. IDE Integration
Continue.dev (VS Code / JetBrains)
{
"models": [
{
"title": "Gemma 4 26B (Local)",
"provider": "ollama",
"model": "gemma4:26b",
"contextLength": 32768
}
],
"tabAutocompleteModel": {
"title": "Gemma 4 E4B (Autocomplete)",
"provider": "ollama",
"model": "gemma4:4b"
}
}
{
"models": [
{
"title": "Gemma 4 26B (llama.cpp)",
"provider": "openai",
"model": "gemma-4-26b",
"apiBase": "http://localhost:1234/v1",
"contextLength": 32768
}
]
}
Codex CLI (Terminal)
# Install Codex CLI
npm install -g @openai/codex
# Run with local model
codex --oss -m gemma4:26b
# Or with llama.cpp backend
codex --oss -m http://localhost:1234/v1
config.toml:
[model]
wire_api = "responses"
web_search = "disabled" # llama.cpp rejects this tool type
4. Hardware-Tuned Configuration
16 GB (Budget/MacBook Air):
ollama pull gemma4:4b
# Or aggressive quantization for 26B
ollama pull gemma4:26b-q3_K_M
Set contextLength: 8192 in IDE config.
24 GB (Sweet Spot):
ollama pull gemma4:26b
# Or llama.cpp with optimized KV cache
llama-server -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
--port 1234 -ngl 99 -c 32768 -np 1 --jinja \
-ctk q8_0 -ctv q8_0
48 GB+ (Workstation):
ollama pull gemma4:31b
# Or with full context
llama-server -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
--port 1234 -ngl 99 -c 65536 -np 1 --jinja
5. Prompt Engineering for Local Agentic Workflows
You are a coding assistant running locally. You have access to these tools:
- Read: Read a file from the filesystem
- Write: Write content to a file
- Execute: Run a shell command
Rules:
1. Read the existing code before making changes.
2. Write tests for any new function you create.
3. Run the tests and fix failures.
4. Keep changes minimal β don't refactor unrelated code.
5. If you're unsure, explain your reasoning before acting.
Operational Guidelines:
- Specify absolute/relative file paths (
src/utils/parser.ts vs "the parser file")
- Decompose features into sequential tasks (function β tests β execution)
- Leverage native JSON output for structured tool responses
Pitfall Guide
- Flash Attention Hang on Apple Silicon: Ollama's default backend triggers a known Flash Attention bug with Gemma 4 on M-series chips, causing silent hangs on long prompts. Fix: Switch to
llama.cpp or upgrade to Ollama v0.20.6+.
- Tool-Call Routing Bug in Ollama: Versions β€0.20.3 misroute Gemma 4's tool-call responses to the reasoning output stream instead of the
tool_calls field, breaking agentic loops. Fix: Update to Ollama v0.20.5+ or use llama.cpp.
- Vision Projector OOM via
-hf Flag: Using llama.cpp's -hf auto-download flag silently fetches a 1.1 GB vision projector module. On 24 GB systems, this triggers immediate OOM crashes. Fix: Always download GGUF weights manually via huggingface-cli and omit -hf.
- KV Cache Quantization Omission: Skipping
-ctk q8_0 -ctv q8_0 leaves the KV cache in FP16, consuming ~940 MB per slot and drastically reducing available context window. Fix: Always apply KV cache quantization flags for memory-constrained deployments.
- Context Length Mismatch: IDE extensions default to 4K/8K context while the server runs 32K/64K, causing silent truncation or tokenization errors. Fix: Explicitly align
contextLength in IDE config.json with the server's -c flag.
Deliverables
π¦ Offline AI Coding Assistant Blueprint
- Architecture decision matrix (MoE vs Dense, Ollama vs llama.cpp)
- Memory bandwidth optimization checklist
- Agentic tool-calling validation workflow
β
Pre-Flight Verification Checklist
βοΈ Configuration Templates
continue-config.json (Dual-model routing: 4B autocomplete + 26B agentic)
codex-config.toml (Local API routing & tool restrictions)
llama-server-flags.sh (Hardware-tuned launch scripts for 16/24/48 GB tiers)
system-prompt-agentic.txt (Production-ready tool-calling constraints)
Deploy locally. Zero API bills. Full code sovereignty.