Difficulty

Intermediate

Read Time

5 min

Building a Fully Offline AI Coding Assistant with Gemma 4 — No Cloud Required 🤖

By Codcompass Team·2026-05-07·5 min read

Current Situation Analysis

Traditional cloud-based AI coding assistants introduce three critical failure modes for professional development workflows:

Cost Escalation: API billing scales linearly with usage. Multi-session daily coding, agentic tool-calling, and iterative refactoring quickly accumulate unsustainable costs.
Privacy & Compliance Risks: Proprietary algorithms, client codebases, and internal tooling cannot safely traverse third-party servers due to data residency, IP leakage, and audit requirements.
Operational Fragility: Cloud APIs suffer from rate limiting, regional outages, and unpredictable pricing/model deprecations. Local deployments historically failed due to poor function-calling capabilities (pre-Gemma 4 models scored ~6.6% on agentic benchmarks) and inefficient memory management, rendering them unsuitable for production coding assistance.

The transition to local AI requires overcoming architectural inefficiencies: naive quantization breaks tool-calling templates, unoptimized KV caches cause OOM crashes, and dense model deployments saturate memory bandwidth. Gemma 4’s 86.4% function-calling benchmark score and Mixture-of-Experts (MoE) architecture finally bridge the gap between local feasibility and agentic reliability.

WOW Moment: Key Findings

Experimental validation across hardware tiers reveals a clear performance-cost-accuracy tradeoff. The 26B MoE variant emerges as the optimal deployment target for mainstream developer hardware, while the 31B Dense model approaches cloud-tier quality on high-end workstations.

Approach	Quality Score	Execution Time	Tool Calls	Key Finding
☁️ GPT-5.4 (Cloud)	★★★★★	65s	3	Type hints, exception chaining, clean architecture
🖥️ 31B Dense (48 GB)	★★★★☆	7 min	3	Functional, solid, minimal cleanup required
⚡ 26B MoE (24 GB)	★★★☆☆	4 min	10	Fast & functional; requires oversight for dead code/retries
📱 E4B Edge (8 GB)	★★☆☆☆	2 min	15+	Autocomplete-only; struggles with multi-file agentic tasks

Speed Architecture Insight: Despite its "26B" label, the MoE variant activates only 3.8B parameters per token, reading ~1.9 GB/token from memory. This yields ~52 tok/s on M4 Pro hardware, significantly outpacing the 31B Dense model (~10 tok/s) which loads all 31.2B parameters per inference step.

Core Solution

1. Hardware Selection & Architecture Mapping

Model	Min RAM	Recommended	Best For
🟢 E4B (Edge)	4 GB	8 GB	Raspberry Pi, Jetson Nano
🔵 26B MoE ⭐	16 GB (Q4)	24 GB	M4 MacBook Pro, RTX 4070
🟣 31B Dense	32 GB (Q4)	48 GB+	M4 Max, RTX 4090, GB10

Sweet Spot: 26B MoE on 24 GB hardware. MoE routing activates only 3.8B parameters per token, delivering high throughput without saturating memory bandwidth.

2. Model Deployment (Ollama vs llama.cpp)

Option A: Ollama (Streamlined)

# Install Ollama (macOS, Linux, Windows)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the model — this downloads ~16 GB for the 26B MoE
ollama pull gemma4:26b

# Or the smaller edge model if you're on limited hardware
ollama pull gemma4:4b

# Verify it works 🎉
ollama run gemma4:26b "Write a Python function to merge two sorted lists"

Option B: llama.cpp (Granular Control)

# Install via Homebrew (macOS)
brew install llama.cpp

# Or build from source for GPU support
git clone https://github.com/ggml-org/llama

.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # NVIDIA

or: cmake -B build -DGGML_METAL=ON # Apple Silicon

cmake --build build --config Release -j

Download quantized weights:
```bash
# 26B MoE Q4 — best balance of quality and speed
huggingface-cli download gg-hf-gg/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --local-dir ./models/

Launch optimized server:

llama-server \
  -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 \
  -ngl 99 \
  -c 32768 \
  -np 1 \
  --jinja \
  -ctk q8_0 \
  -ctv q8_0

Flag Architecture Mapping:

-ngl 99: Full GPU offload
-c 32768: 32K context window
-np 1: Single inference slot (prevents KV cache multiplication)
--jinja: Enables Gemma 4's native tool-calling template
-ctk q8_0 -ctv q8_0: KV cache quantization (~940 MB → ~499 MB)

3. IDE Integration

Continue.dev (VS Code / JetBrains)

{
  "models": [
    {
      "title": "Gemma 4 26B (Local)",
      "provider": "ollama",
      "model": "gemma4:26b",
      "contextLength": 32768
    }
  ],
  "tabAutocompleteModel": {
    "title": "Gemma 4 E4B (Autocomplete)",
    "provider": "ollama",
    "model": "gemma4:4b"
  }
}

{
  "models": [
    {
      "title": "Gemma 4 26B (llama.cpp)",
      "provider": "openai",
      "model": "gemma-4-26b",
      "apiBase": "http://localhost:1234/v1",
      "contextLength": 32768
    }
  ]
}

Codex CLI (Terminal)

# Install Codex CLI
npm install -g @openai/codex

# Run with local model
codex --oss -m gemma4:26b

# Or with llama.cpp backend
codex --oss -m http://localhost:1234/v1

config.toml:

[model]
wire_api = "responses"
web_search = "disabled"  # llama.cpp rejects this tool type

4. Hardware-Tuned Configuration

16 GB (Budget/MacBook Air):

ollama pull gemma4:4b
# Or aggressive quantization for 26B
ollama pull gemma4:26b-q3_K_M

Set contextLength: 8192 in IDE config.

24 GB (Sweet Spot):

ollama pull gemma4:26b
# Or llama.cpp with optimized KV cache
llama-server -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 32768 -np 1 --jinja \
  -ctk q8_0 -ctv q8_0

48 GB+ (Workstation):

ollama pull gemma4:31b
# Or with full context
llama-server -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 65536 -np 1 --jinja

5. Prompt Engineering for Local Agentic Workflows

You are a coding assistant running locally. You have access to these tools:
- Read: Read a file from the filesystem
- Write: Write content to a file
- Execute: Run a shell command

Rules:
1. Read the existing code before making changes.
2. Write tests for any new function you create.
3. Run the tests and fix failures.
4. Keep changes minimal — don't refactor unrelated code.
5. If you're unsure, explain your reasoning before acting.

Operational Guidelines:

Specify absolute/relative file paths (src/utils/parser.ts vs "the parser file")
Decompose features into sequential tasks (function → tests → execution)
Leverage native JSON output for structured tool responses

Pitfall Guide

Flash Attention Hang on Apple Silicon: Ollama's default backend triggers a known Flash Attention bug with Gemma 4 on M-series chips, causing silent hangs on long prompts. Fix: Switch to llama.cpp or upgrade to Ollama v0.20.6+.
Tool-Call Routing Bug in Ollama: Versions ≤0.20.3 misroute Gemma 4's tool-call responses to the reasoning output stream instead of the tool_calls field, breaking agentic loops. Fix: Update to Ollama v0.20.5+ or use llama.cpp.
Vision Projector OOM via -hf Flag: Using llama.cpp's -hf auto-download flag silently fetches a 1.1 GB vision projector module. On 24 GB systems, this triggers immediate OOM crashes. Fix: Always download GGUF weights manually via huggingface-cli and omit -hf.
KV Cache Quantization Omission: Skipping -ctk q8_0 -ctv q8_0 leaves the KV cache in FP16, consuming ~940 MB per slot and drastically reducing available context window. Fix: Always apply KV cache quantization flags for memory-constrained deployments.
Context Length Mismatch: IDE extensions default to 4K/8K context while the server runs 32K/64K, causing silent truncation or tokenization errors. Fix: Explicitly align contextLength in IDE config.json with the server's -c flag.

Deliverables

📦 Offline AI Coding Assistant Blueprint

Architecture decision matrix (MoE vs Dense, Ollama vs llama.cpp)
Memory bandwidth optimization checklist
Agentic tool-calling validation workflow

✅ Pre-Flight Verification Checklist

GPU offload flags match hardware capability (-ngl 99)
KV cache quantization applied (-ctk q8_0 -ctv q8_0)
Context window aligned across server & IDE config
Tool-calling template enabled (--jinja)
Vision projector excluded from download
Ollama version ≥0.20.5 or llama.cpp backend active

⚙️ Configuration Templates

continue-config.json (Dual-model routing: 4B autocomplete + 26B agentic)
codex-config.toml (Local API routing & tool restrictions)
llama-server-flags.sh (Hardware-tuned launch scripts for 16/24/48 GB tiers)
system-prompt-agentic.txt (Production-ready tool-calling constraints)

Deploy locally. Zero API bills. Full code sovereignty.