Back to KB
Difficulty
Intermediate
Read Time
5 min

Building a Fully Offline AI Coding Assistant with Gemma 4 β€” No Cloud Required πŸ€–

By Codcompass TeamΒ·Β·5 min read

Current Situation Analysis

Traditional cloud-based AI coding assistants introduce three critical failure modes for professional development workflows:

  1. Cost Escalation: API billing scales linearly with usage. Multi-session daily coding, agentic tool-calling, and iterative refactoring quickly accumulate unsustainable costs.
  2. Privacy & Compliance Risks: Proprietary algorithms, client codebases, and internal tooling cannot safely traverse third-party servers due to data residency, IP leakage, and audit requirements.
  3. Operational Fragility: Cloud APIs suffer from rate limiting, regional outages, and unpredictable pricing/model deprecations. Local deployments historically failed due to poor function-calling capabilities (pre-Gemma 4 models scored ~6.6% on agentic benchmarks) and inefficient memory management, rendering them unsuitable for production coding assistance.

The transition to local AI requires overcoming architectural inefficiencies: naive quantization breaks tool-calling templates, unoptimized KV caches cause OOM crashes, and dense model deployments saturate memory bandwidth. Gemma 4’s 86.4% function-calling benchmark score and Mixture-of-Experts (MoE) architecture finally bridge the gap between local feasibility and agentic reliability.

WOW Moment: Key Findings

Experimental validation across hardware tiers reveals a clear performance-cost-accuracy tradeoff. The 26B MoE variant emerges as the optimal deployment target for mainstream developer hardware, while the 31B Dense model approaches cloud-tier quality on high-end workstations.

ApproachQuality ScoreExecution TimeTool CallsKey Finding
☁️ GPT-5.4 (Cloud)β˜…β˜…β˜…β˜…β˜…65s3Type hints, exception chaining, clean architecture
πŸ–₯️ 31B Dense (48 GB)β˜…β˜…β˜…β˜…β˜†7 min3Functional, solid, minimal cleanup required
⚑ 26B MoE (24 GB)β˜…β˜…β˜…β˜†β˜†4 min10Fast & functional; requires oversight for dead code/retries
πŸ“± E4B Edge (8 GB)β˜…β˜…β˜†β˜†β˜†2 min15+Autocomplete-only; struggles with multi-file agentic tasks

Speed Architecture Insight: Despite its "26B" label, the MoE variant activates only 3.8B parameters per token, reading ~1.9 GB/token from memory. This yields ~52 tok/s on M4 Pro hardware, significantly outpacing the 31B Dense model (~10 tok/s) which loads all 31.2B parameters per inference step.

Core Solution

1. Hardware Selection & Architecture Mapping

ModelMin RAMRecommendedBest For
🟒 E4B (Edge)4 GB8 GBRaspberry Pi, Jetson Nano
πŸ”΅ 26B MoE ⭐16 GB (Q4)24 GBM4 MacBook Pro, RTX 4070
🟣 31B Dense32 GB (Q4)48 GB+M4 Max, RTX 4090, GB10

Sweet Spot: 26B MoE on 24 GB hardware. MoE routing activates only 3.8B parameters per token, delivering high throughput without saturating memory bandwidth.

2. Model Deployment (Ollama vs llama.cpp)

Option A: Ollama (Streamlined)

# Install Ollama (macOS, Linux, Windows)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the model β€” this downloads ~16 GB for the 26B MoE
ollama pull gemma4:26b

# Or the smaller edge model if you're on limited hardware
ollama pull gemma4:4b

# Verify it works πŸŽ‰
ollama run gemma4:26b "Write a Python function to merge two sorted lists"

Option B: llama.cpp (Granular Control)

# Install via Homebrew (macOS)
brew install llama.cpp

# Or build from source for GPU support
git clone https://github.com/ggml-org/llama

.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # NVIDIA

or: cmake -B build -DGGML_METAL=ON # Apple Silicon

cmake --build build --config Release -j

Download quantized weights:
```bash
# 26B MoE Q4 β€” best balance of quality and speed
huggingface-cli download gg-hf-gg/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --local-dir ./models/

Launch optimized server:

llama-server \
  -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 \
  -ngl 99 \
  -c 32768 \
  -np 1 \
  --jinja \
  -ctk q8_0 \
  -ctv q8_0

Flag Architecture Mapping:

  • -ngl 99: Full GPU offload
  • -c 32768: 32K context window
  • -np 1: Single inference slot (prevents KV cache multiplication)
  • --jinja: Enables Gemma 4's native tool-calling template
  • -ctk q8_0 -ctv q8_0: KV cache quantization (~940 MB β†’ ~499 MB)

3. IDE Integration

Continue.dev (VS Code / JetBrains)

{
  "models": [
    {
      "title": "Gemma 4 26B (Local)",
      "provider": "ollama",
      "model": "gemma4:26b",
      "contextLength": 32768
    }
  ],
  "tabAutocompleteModel": {
    "title": "Gemma 4 E4B (Autocomplete)",
    "provider": "ollama",
    "model": "gemma4:4b"
  }
}
{
  "models": [
    {
      "title": "Gemma 4 26B (llama.cpp)",
      "provider": "openai",
      "model": "gemma-4-26b",
      "apiBase": "http://localhost:1234/v1",
      "contextLength": 32768
    }
  ]
}

Codex CLI (Terminal)

# Install Codex CLI
npm install -g @openai/codex

# Run with local model
codex --oss -m gemma4:26b

# Or with llama.cpp backend
codex --oss -m http://localhost:1234/v1

config.toml:

[model]
wire_api = "responses"
web_search = "disabled"  # llama.cpp rejects this tool type

4. Hardware-Tuned Configuration

16 GB (Budget/MacBook Air):

ollama pull gemma4:4b
# Or aggressive quantization for 26B
ollama pull gemma4:26b-q3_K_M

Set contextLength: 8192 in IDE config.

24 GB (Sweet Spot):

ollama pull gemma4:26b
# Or llama.cpp with optimized KV cache
llama-server -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 32768 -np 1 --jinja \
  -ctk q8_0 -ctv q8_0

48 GB+ (Workstation):

ollama pull gemma4:31b
# Or with full context
llama-server -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 65536 -np 1 --jinja

5. Prompt Engineering for Local Agentic Workflows

You are a coding assistant running locally. You have access to these tools:
- Read: Read a file from the filesystem
- Write: Write content to a file
- Execute: Run a shell command

Rules:
1. Read the existing code before making changes.
2. Write tests for any new function you create.
3. Run the tests and fix failures.
4. Keep changes minimal β€” don't refactor unrelated code.
5. If you're unsure, explain your reasoning before acting.

Operational Guidelines:

  • Specify absolute/relative file paths (src/utils/parser.ts vs "the parser file")
  • Decompose features into sequential tasks (function β†’ tests β†’ execution)
  • Leverage native JSON output for structured tool responses

Pitfall Guide

  1. Flash Attention Hang on Apple Silicon: Ollama's default backend triggers a known Flash Attention bug with Gemma 4 on M-series chips, causing silent hangs on long prompts. Fix: Switch to llama.cpp or upgrade to Ollama v0.20.6+.
  2. Tool-Call Routing Bug in Ollama: Versions ≀0.20.3 misroute Gemma 4's tool-call responses to the reasoning output stream instead of the tool_calls field, breaking agentic loops. Fix: Update to Ollama v0.20.5+ or use llama.cpp.
  3. Vision Projector OOM via -hf Flag: Using llama.cpp's -hf auto-download flag silently fetches a 1.1 GB vision projector module. On 24 GB systems, this triggers immediate OOM crashes. Fix: Always download GGUF weights manually via huggingface-cli and omit -hf.
  4. KV Cache Quantization Omission: Skipping -ctk q8_0 -ctv q8_0 leaves the KV cache in FP16, consuming ~940 MB per slot and drastically reducing available context window. Fix: Always apply KV cache quantization flags for memory-constrained deployments.
  5. Context Length Mismatch: IDE extensions default to 4K/8K context while the server runs 32K/64K, causing silent truncation or tokenization errors. Fix: Explicitly align contextLength in IDE config.json with the server's -c flag.

Deliverables

πŸ“¦ Offline AI Coding Assistant Blueprint

  • Architecture decision matrix (MoE vs Dense, Ollama vs llama.cpp)
  • Memory bandwidth optimization checklist
  • Agentic tool-calling validation workflow

βœ… Pre-Flight Verification Checklist

  • GPU offload flags match hardware capability (-ngl 99)
  • KV cache quantization applied (-ctk q8_0 -ctv q8_0)
  • Context window aligned across server & IDE config
  • Tool-calling template enabled (--jinja)
  • Vision projector excluded from download
  • Ollama version β‰₯0.20.5 or llama.cpp backend active

βš™οΈ Configuration Templates

  • continue-config.json (Dual-model routing: 4B autocomplete + 26B agentic)
  • codex-config.toml (Local API routing & tool restrictions)
  • llama-server-flags.sh (Hardware-tuned launch scripts for 16/24/48 GB tiers)
  • system-prompt-agentic.txt (Production-ready tool-calling constraints)

Deploy locally. Zero API bills. Full code sovereignty.