Building a Fully Offline AI Coding Assistant with Gemma 4 β No Cloud Required π€
Current Situation Analysis
Traditional cloud-based AI coding assistants introduce three critical failure modes for professional development workflows:
- Cost Escalation: API billing scales linearly with usage. Multi-session daily coding, agentic tool-calling, and iterative refactoring quickly accumulate unsustainable costs.
- Privacy & Compliance Risks: Proprietary algorithms, client codebases, and internal tooling cannot safely traverse third-party servers due to data residency, IP leakage, and audit requirements.
- Operational Fragility: Cloud APIs suffer from rate limiting, regional outages, and unpredictable pricing/model deprecations. Local deployments historically failed due to poor function-calling capabilities (pre-Gemma 4 models scored ~6.6% on agentic benchmarks) and inefficient memory management, rendering them unsuitable for production coding assistance.
The transition to local AI requires overcoming architectural inefficiencies: naive quantization breaks tool-calling templates, unoptimized KV caches cause OOM crashes, and dense model deployments saturate memory bandwidth. Gemma 4βs 86.4% function-calling benchmark score and Mixture-of-Experts (MoE) architecture finally bridge the gap between local feasibility and agentic reliability.
WOW Moment: Key Findings
Experimental validation across hardware tiers reveals a clear performance-cost-accuracy tradeoff. The 26B MoE variant emerges as the optimal deployment target for mainstream developer hardware, while the 31B Dense model approaches cloud-tier quality on high-end workstations.
| Approach | Quality Score | Execution Time | Tool Calls | Key Finding |
|---|---|---|---|---|
| βοΈ GPT-5.4 (Cloud) | β β β β β | 65s | 3 | Type hints, exception chaining, clean architecture |
| π₯οΈ 31B Dense (48 GB) | β β β β β | 7 min | 3 | Functional, solid, minimal cleanup required |
| β‘ 26B MoE (24 GB) | β β β ββ | 4 min | 10 | Fast & functional; requires oversight for dead code/retries |
| π± E4B Edge (8 GB) | β β βββ | 2 min | 15+ | Autocomplete-only; struggles with multi-file agentic tasks |
Speed Architecture Insight: Despite its "26B" label, the MoE variant activates only 3.8B parameters per token, reading ~1.9 GB/token from memory. This yields ~52 tok/s on M4 Pro hardware, significantly outpacing the 31B Dense model (~10 tok/s) which loads all 31.2B parameters per inference step.
Core Solution
1. Hardware Selection & Architecture Mapping
| Model | Min RAM | Recommended | Best For |
|---|---|---|---|
| π’ E4B (Edge) | 4 GB | 8 GB | Raspberry Pi, Jetson Nano |
| π΅ 26B MoE β | 16 GB (Q4) | 24 GB | M4 MacBook Pro, RTX 4070 |
| π£ 31B Dense | 32 GB (Q4) | 48 GB+ | M4 Max, RTX 4090, GB10 |
Sweet Spot: 26B MoE on 24 GB hardware. MoE routing activates only 3.8B parameters per token, delivering high throughput without saturating memory bandwidth.
2. Model Deployment (Ollama vs llama.cpp)
Option A: Ollama (Streamlined)
# Install Ollama (macOS, Linux, Windows)
curl -fsSL https://ollama.com/install.sh | sh
# Pull the model β this downloads ~16 GB for the 26B MoE
ollama pull gemma4:26b
# Or the smaller edge model if you're on limited hardware
ollama pull gemma4:4b
# Verify it works π
ollama run gemma4:26b "Write a Python function to merge two sorted lists"
Option B: llama.cpp (Granular Control)
# Install via Homebrew (macOS)
brew install llama.cpp
# Or build from source for GPU support
git clone https://github.com/ggml-org/llama
.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # NVIDIA
or: cmake -B build -DGGML_METAL=ON # Apple Silicon
cmake --build build --config Release -j
Download quantized weights:
```bash
# 26B MoE Q4 β best balance of quality and speed
huggingface-cli download gg-hf-gg/gemma-4-26B-A4B-it-GGUF \
gemma-4-26B-A4B-it-Q4_K_M.gguf \
--local-dir ./models/
Launch optimized server:
llama-server \
-m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
--port 1234 \
-ngl 99 \
-c 32768 \
-np 1 \
--jinja \
-ctk q8_0 \
-ctv q8_0
Flag Architecture Mapping:
-ngl 99: Full GPU offload-c 32768: 32K context window-np 1: Single inference slot (prevents KV cache multiplication)--jinja: Enables Gemma 4's native tool-calling template-ctk q8_0 -ctv q8_0: KV cache quantization (~940 MB β ~499 MB)
3. IDE Integration
Continue.dev (VS Code / JetBrains)
{
"models": [
{
"title": "Gemma 4 26B (Local)",
"provider": "ollama",
"model": "gemma4:26b",
"contextLength": 32768
}
],
"tabAutocompleteModel": {
"title": "Gemma 4 E4B (Autocomplete)",
"provider": "ollama",
"model": "gemma4:4b"
}
}
{
"models": [
{
"title": "Gemma 4 26B (llama.cpp)",
"provider": "openai",
"model": "gemma-4-26b",
"apiBase": "http://localhost:1234/v1",
"contextLength": 32768
}
]
}
Codex CLI (Terminal)
# Install Codex CLI
npm install -g @openai/codex
# Run with local model
codex --oss -m gemma4:26b
# Or with llama.cpp backend
codex --oss -m http://localhost:1234/v1
config.toml:
[model]
wire_api = "responses"
web_search = "disabled" # llama.cpp rejects this tool type
4. Hardware-Tuned Configuration
16 GB (Budget/MacBook Air):
ollama pull gemma4:4b
# Or aggressive quantization for 26B
ollama pull gemma4:26b-q3_K_M
Set contextLength: 8192 in IDE config.
24 GB (Sweet Spot):
ollama pull gemma4:26b
# Or llama.cpp with optimized KV cache
llama-server -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
--port 1234 -ngl 99 -c 32768 -np 1 --jinja \
-ctk q8_0 -ctv q8_0
48 GB+ (Workstation):
ollama pull gemma4:31b
# Or with full context
llama-server -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
--port 1234 -ngl 99 -c 65536 -np 1 --jinja
5. Prompt Engineering for Local Agentic Workflows
You are a coding assistant running locally. You have access to these tools:
- Read: Read a file from the filesystem
- Write: Write content to a file
- Execute: Run a shell command
Rules:
1. Read the existing code before making changes.
2. Write tests for any new function you create.
3. Run the tests and fix failures.
4. Keep changes minimal β don't refactor unrelated code.
5. If you're unsure, explain your reasoning before acting.
Operational Guidelines:
- Specify absolute/relative file paths (
src/utils/parser.tsvs "the parser file") - Decompose features into sequential tasks (function β tests β execution)
- Leverage native JSON output for structured tool responses
Pitfall Guide
- Flash Attention Hang on Apple Silicon: Ollama's default backend triggers a known Flash Attention bug with Gemma 4 on M-series chips, causing silent hangs on long prompts. Fix: Switch to
llama.cppor upgrade to Ollama v0.20.6+. - Tool-Call Routing Bug in Ollama: Versions β€0.20.3 misroute Gemma 4's tool-call responses to the reasoning output stream instead of the
tool_callsfield, breaking agentic loops. Fix: Update to Ollama v0.20.5+ or usellama.cpp. - Vision Projector OOM via
-hfFlag: Usingllama.cpp's-hfauto-download flag silently fetches a 1.1 GB vision projector module. On 24 GB systems, this triggers immediate OOM crashes. Fix: Always download GGUF weights manually viahuggingface-cliand omit-hf. - KV Cache Quantization Omission: Skipping
-ctk q8_0 -ctv q8_0leaves the KV cache in FP16, consuming ~940 MB per slot and drastically reducing available context window. Fix: Always apply KV cache quantization flags for memory-constrained deployments. - Context Length Mismatch: IDE extensions default to 4K/8K context while the server runs 32K/64K, causing silent truncation or tokenization errors. Fix: Explicitly align
contextLengthin IDEconfig.jsonwith the server's-cflag.
Deliverables
π¦ Offline AI Coding Assistant Blueprint
- Architecture decision matrix (MoE vs Dense, Ollama vs llama.cpp)
- Memory bandwidth optimization checklist
- Agentic tool-calling validation workflow
β Pre-Flight Verification Checklist
- GPU offload flags match hardware capability (
-ngl 99) - KV cache quantization applied (
-ctk q8_0 -ctv q8_0) - Context window aligned across server & IDE config
- Tool-calling template enabled (
--jinja) - Vision projector excluded from download
- Ollama version β₯0.20.5 or llama.cpp backend active
βοΈ Configuration Templates
continue-config.json(Dual-model routing: 4B autocomplete + 26B agentic)codex-config.toml(Local API routing & tool restrictions)llama-server-flags.sh(Hardware-tuned launch scripts for 16/24/48 GB tiers)system-prompt-agentic.txt(Production-ready tool-calling constraints)
Deploy locally. Zero API bills. Full code sovereignty.
