Architecting Client-Side LLM Inference: Production Patterns for Gemma 4 Deployment

Current Situation Analysis

Running large language models directly in the browser promises offline resilience, reduced server costs, and enhanced privacy. Yet production deployments consistently hit invisible walls that prototyping tutorials never surface. The industry pain point isn't model capability; it's the friction between browser sandboxing, hardware dispatch routing, memory fragmentation, and API endpoint behavior.

This problem is routinely overlooked because official documentation optimizes for "first run" success on developer workstations. Tutorials assume ideal hardware, single-threaded execution, and ignore the gap between experimental libraries and production-grade runtimes. Engineers ship prototypes that work on their machines, only to discover that consumer hardware introduces dispatch routing bugs, VRAM spillover degrades throughput by orders of magnitude, and singleton inference engines silently lock under concurrent navigation.

Data from real-world deployments reveals consistent failure modes:

Chromium bug 369219127 causes WebGPU to ignore powerPreference: 'high-performance' on NVIDIA Optimus laptops, routing inference through integrated graphics and dropping throughput from ~15 tok/s to ~2 tok/s.
Loading a 3 GB quantized model on a 6 GB VRAM GPU forces KV cache and runtime overhead into shared system memory via PCIe, collapsing inference speed to ~1.8 tok/s due to bus contention.
Structured output prompts (JSON, Mermaid, SVG) trigger 400 Bad Request responses on streaming API endpoints for certain model configurations, while non-streaming endpoints succeed with identical payloads.
LlmInference instances enforce exclusive access. Concurrent generation calls fail with "Previous invocation or loading is still ongoing," breaking multi-route single-page applications.

These aren't edge cases. They are architectural constraints that dictate whether a client-side LLM feature survives production or silently degrades user experience.

WOW Moment: Key Findings

The breakthrough comes from recognizing that browser-side inference isn't a single pipeline; it's a constrained system where runtime selection, memory allocation, and API strategy must align with hardware realities. Matching the right tool to the constraint yields predictable throughput and eliminates silent failures.

Approach	Throughput (tok/s)	VRAM Utilization	Structured Output Reliability	Concurrency Safety
Transformers.js + WebGPU	2–4	Fragmented	High (but slow)	Low (no built-in queue)
MediaPipe + WebGPU	14–16	Optimized	High	Low (requires external queue)
Gemma 4 E2B-IT (Local)	14–16	~1.5 GB + overhead	Low (~70% valid)	Managed via queue
Gemma 4 26B-A4B-IT (Cloud)	25–30	N/A	High (>95% valid)	Stateless API
Streaming Endpoint (Structured)	N/A	N/A	0% (400 errors)	N/A
Non-Streaming Endpoint (Structured)	N/A	N/A	>95% valid	N/A

This finding matters because it shifts the engineering mindset from "how do I run the model?" to "how do I route workloads to match hardware and API constraints?" The 7x throughput jump from switching runtimes, combined with feature-based routing and endpoint selection, transforms an unstable prototype into a production-ready inference layer. It enables offline-first applications to deliver conversational AI locally while delegating structured generation to cloud endpoints without breaking UX continuity.

Core Solution

Building a production-grade browser inference layer requires five coordinated architectural decisions. Each addresses a specific constraint revealed during deployment.

Step 1: Runtime Selection — MediaPipe Over Transformers.js

@huggingface/transformers.js remains excellent for prototyping, but its WebGPU dispatch path lacks production stability across mixed-GPU architectures. MediaPipe's @mediapipe/tasks-genai with the WebGPU delegate optimizes the dispatch chain specifically for consumer hardware and supports Google's .task artifact format.

Implementation:

import { FilesetResolver, LlmInference } from '@mediapipe/tasks-genai';

export class MediaPipeBackend {
  private engine: LlmInference | null = null;

  async initialize(modelUrl: string): Promise<void> {
    const resolver = await FilesetResolver.forGenAiTasks(
      'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm'
    );

    this.engine = await LlmInference.createFromOptions(resolver, {
      baseOptions: { modelAssetPath: modelUrl },
      maxTokens: 2048,
      topK: 40,
      temperature: 0.7,
    });
  }

  async generate(prompt: string): Promise<string> {
    if (!this.engine) throw new Error('Backend not initialized');
    return this.engine.generateResponse(prompt);
  }
}

Rationale: MediaPipe's WebGPU delegate bypasses Chromium's Optimus routing bug by enforcing explicit hardware selection at the WASM layer. The .task format bundles quantization metadata, reducing initialization overhead and ensuring consistent behavior across browsers.

Step 2: Memory-Aware Model Selection

Browser inference is bounded by dedicated VRAM. The rule of thumb: select the largest model that fits entirely in VRAM after reserving ~1.5 GB for browser overhead, JS runtime, and KV cache.

Gemma 4 E2B-IT (~1.5 GB q4f16): Fits on 4–6 GB VRAM. Ideal for conversational tutoring, math explanations, and Socratic dialogue.
Gemma 4 E4B-IT (~3 GB q4f16): Requires 8+ GB VRAM. Spills to PCIe on 6 GB cards, collapsing throughput.
Gemma 4 26B-A4B-IT (MoE): Cloud-only. Activates ~4B parameters per forward pass. 2–3x lower latency than 31B Dense for structured outputs.

Implementation:

export interface ModelProfile {
  id: string;
  sizeGB: number;
  minVRAMGB: number;
  capability: 'conversational' | 'structured' | 'vision';
}

export const MODEL_REGISTRY: Record<string, ModelProfile> = {
  'gemma-4-e2b-it': { id: 'gemma-4-e2b-it', sizeGB: 1.5, minVRAMGB: 4, capability: 'conversational' },
  'gemma-4-e4b-it': { id: 'gemma-4-e4b-it', sizeGB: 3.0, minVRAMGB: 8, capability: 'conversational' },
  'gemma-4-26b-a4b-it': { id: 'gemma-4-26b-a4b-it', sizeGB: 13.0, minVRAMGB: Infinity, capability: 'structured' },
};

export function selectLocalModel(availableVRAM: number): string | null {
  const candidates = Object.values(MODEL_REGISTRY)
    .filter(m => m.minVRAMGB <= availableVRAM && m.capability === 'conversational')
    .sort((a, b) => b.sizeGB - a.sizeGB);
  return candidates.length > 0 ? candidates[0].id : null;
}

Rationale: VRAM spillover isn't just a performance hit; it introduces non-deterministic latency spikes. By hardcoding minimum VRAM thresholds and sorting by size, the selector guarantees the model stays in dedicated memory.

Step 3: Feature-Based Routing Architecture

Small models excel at open-ended text but struggle with rigid schemas. Forcing JSON, Mermaid, or SVG generation through a 2B parameter model yields ~70% validity, requiring fragile parsing and retry logic. The production pattern routes structured features to cloud endpoints while keeping conversational features local.

Implementation:

export type FeatureType = 'chat' | 'tutoring' | 'quiz' | 'diagram' | 'ocr';

export interface RoutingConfig {
  localFeatures: FeatureType[];
  cloudFeatures: FeatureType[];
  cloudAvailable: boolean;
}

export class FeatureRouter {
  constructor(private config: RoutingConfig) {}

  resolveBackend(feature: FeatureType): 'local' | 'cloud' | 'unavailable' {
    if (this.config.localFeatures.includes(feature)) return 'local';
    if (this.config.cloudFeatures.includes(feature)) {
      return this.config.cloudAvailable ? 'cloud' : 'unavailable';
    }
    return 'unavailable';
  }
}

Rationale: Routing by feature, not by request, eliminates runtime ambiguity. The UI can display engine status transparently, turning a technical limitation into a predictable UX surface rather than a hidden failure mode.

Step 4: Concurrency Management via FIFO Queue

LlmInference enforces exclusive access. Concurrent calls fail immediately. A production app must serialize requests, support abort propagation, and recover from stuck states.

Implementation:

export class InferenceQueue {
  private isBusy = false;
  private abortController: AbortController | null = null;
  private pending: Array<() => void> = [];

  async enqueue<T>(task: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      const execute = async () => {
        this.isBusy = true;
        this.abortController = new AbortController();
        try {
          const result = await task();
          resolve(result);
        } catch (err) {
          reject(err);
        } finally {
          this.isBusy = false;
          this.abortController = null;
          const next = this.pending.shift();
          if (next) next();
        }
      };

      if (this.isBusy) {
        this.pending.push(execute);
      } else {
        execute();
      }
    });
  }

  cancelAll(): void {
    this.abortController?.abort();
    this.pending = [];
    this.isBusy = false;
  }

  forceReset(): void {
    this.cancelAll();
    this.abortController = null;
  }
}

Rationale: The queue decouples UI navigation from inference state. Components must call cancelAll() on unmount to prevent orphaned locks. The forceReset() method provides a recovery path when the WASM runtime hangs.

Step 5: API Endpoint Strategy

Gemini API exposes generateContent and streamGenerateContent. For Gemma 4 26B, streaming fails with 400 when prompts request structured output. Non-streaming succeeds consistently.

Implementation:

export class CloudInferenceClient {
  async generateStructured(prompt: string, apiKey: string): Promise<string> {
    const response = await fetch(
      `https://generativelanguage.googleapis.com/v1beta/models/gemma-4-26b-a4b-it:generateContent?key=${apiKey}`,
      {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ contents: [{ parts: [{ text: prompt }] }] }),
      }
    );
    if (!response.ok) throw new Error(`Cloud API failed: ${response.status}`);
    const data = await response.json();
    return data.candidates?.[0]?.content?.parts?.[0]?.text ?? '';
  }

  async generateStreaming(prompt: string, apiKey: string): Promise<ReadableStream> {
    const response = await fetch(
      `https://generativelanguage.googleapis.com/v1beta/models/gemma-4-26b-a4b-it:streamGenerateContent?key=${apiKey}&alt=sse`,
      {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ contents: [{ parts: [{ text: prompt }] }] }),
      }
    );
    return response.body as ReadableStream;
  }
}

Rationale: Endpoint selection must be feature-aware. Use streaming for conversational chat where partial tokens improve perceived latency. Use non-streaming for structured outputs where payload integrity matters more than incremental delivery.

Pitfall Guide

1. Blind WebGPU Adapter Selection

Explanation: Relying on requestAdapter({ powerPreference: 'high-performance' }) assumes Chromium respects the flag. On Optimus laptops, it routes to integrated graphics, dropping throughput by 80%. Fix: Use MediaPipe's WebGPU delegate, which enforces hardware selection at the WASM layer. Verify dispatch via chrome://gpu and Task Manager GPU monitor during testing.

2. VRAM Overcommitment

Explanation: Loading a model that exceeds dedicated VRAM forces KV cache and runtime buffers into shared system memory via PCIe. Throughput collapses to ~1.8 tok/s due to bus contention. Fix: Reserve 1.5 GB for browser overhead. Select the largest model where modelSize + 1.5GB <= dedicatedVRAM. Validate with navigator.gpu adapter info or fallback to conservative defaults.

3. Structured Output Illusion

Explanation: Small models (~2B parameters) lack the instruction-following capacity to consistently emit valid JSON, Mermaid, or SVG. Prompt engineering and tolerant parsers only mask ~30% failure rates. Fix: Route schema-dependent features to cloud endpoints. Keep local inference for open-ended text. Display routing status in UI to maintain user trust.

4. Singleton Inference Assumption

Explanation: LlmInference processes one generation at a time. Concurrent calls throw "Previous invocation or loading is still ongoing," breaking multi-route SPAs. Fix: Implement a FIFO queue with abort propagation. Call cancelAll() on component unmount. Provide forceReset() for recovery.

5. Streaming Endpoint Misconfiguration

Explanation: streamGenerateContent returns 400 for structured output prompts on Gemma 4 26B. The API silently rejects certain responseSchema combinations over SSE. Fix: Use generateContent for structured features. Reserve streaming for conversational chat. Validate endpoint behavior in staging before production rollout.

6. Missing Lifecycle Cleanup

Explanation: Navigating away mid-generation leaves the inference engine locked. Subsequent pages hang silently, causing perceived app crashes. Fix: Bind cancelAll() to React useEffect cleanup, Vue onUnmounted, or Svelte onDestroy. Log abort events for debugging.

7. Hardcoded Routing Logic

Explanation: Routing decisions embedded in UI components create tight coupling and make feature toggles impossible. Fix: Centralize routing in a FeatureRouter class. Drive configuration from environment variables or user settings. Enable runtime feature flags for A/B testing.

Production Bundle

Action Checklist

Verify WebGPU dispatch routing using chrome://gpu and hardware monitor before deployment
Calculate VRAM budget: availableVRAM - 1.5GB >= modelSize
Replace Transformers.js with MediaPipe tasks-genai for production WebGPU builds
Implement feature-based routing: local for conversational, cloud for structured
Build FIFO queue with abort propagation and unmount cleanup
Route structured prompts to generateContent, chat to streamGenerateContent
Add UI indicators for engine status (local vs cloud vs unavailable)
Test navigation mid-generation to verify queue abort and recovery paths

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Consumer laptop (4–6 GB VRAM)	Gemma 4 E2B-IT + MediaPipe	Fits dedicated VRAM, avoids PCIe spillover	Zero server cost, higher client CPU/GPU usage
Mid-range GPU (8+ GB VRAM)	Gemma 4 E4B-IT + MediaPipe	Better reasoning, still fits VRAM	Zero server cost, moderate client resource usage
Structured output (JSON/Mermaid)	Gemma 4 26B-A4B-IT + Cloud API	>95% schema validity, lower latency than 31B Dense	API costs scale with usage, predictable latency
Offline-first requirement	Gemma 4 E2B-IT + Local MediaPipe	No network dependency, full feature parity for chat	One-time model download (~1.5 GB), no recurring costs
High-concurrency SPA	FIFO Queue + Abort on Unmount	Prevents singleton locks, ensures navigation safety	Negligible memory overhead, improves UX stability

Configuration Template

// inference.config.ts
export const INFERENCE_CONFIG = {
  local: {
    modelUrl: 'https://huggingface.co/litert-community/gemma-4-e2b-it/resolve/main/gemma-4-e2b-it-int4-web.task',
    maxTokens: 2048,
    temperature: 0.7,
    topK: 40,
    minVRAMGB: 4,
    features: ['chat', 'tutoring', 'math-explanation', 'socratic-dialogue'],
  },
  cloud: {
    modelId: 'gemma-4-26b-a4b-it',
    endpoint: 'generateContent', // Use non-streaming for structured
    features: ['quiz-generation', 'mermaid-mindmap', 'svg-illustration', 'handwriting-ocr'],
    requiresApiKey: true,
  },
  routing: {
    fallbackToUnavailable: true,
    showEngineBadge: true,
    abortOnNavigation: true,
  },
};

Quick Start Guide

Install Runtime: npm install @mediapipe/tasks-genai
Initialize Backend: Call MediaPipeBackend.initialize() with the .task model URL during app bootstrap.
Configure Router: Instantiate FeatureRouter with INFERENCE_CONFIG routing rules.
Wire Queue: Attach InferenceQueue to all generation calls. Bind cancelAll() to component lifecycle hooks.
Validate Dispatch: Open chrome://gpu, confirm WebGPU uses discrete GPU, and verify throughput exceeds 10 tok/s on target hardware.

5 production patterns for running Gemma 4 in the browser — what the docs don't tell you

Architecting Client-Side LLM Inference: Production Patterns for Gemma 4 Deployment

Current Situation Analysis

WOW Moment: Key Findings

Core Solution

Step 1: Runtime Selection — MediaPipe Over Transformers.js

Step 2: Memory-Aware Model Selection

Step 3: Feature-Based Routing Architecture

Step 4: Concurrency Management via FIFO Queue

Step 5: API Endpoint Strategy

Pitfall Guide

1. Blind WebGPU Adapter Selection

2. VRAM Overcommitment

3. Structured Output Illusion

4. Singleton Inference Assumption

5. Streaming Endpoint Misconfiguration

6. Missing Lifecycle Cleanup

7. Hardcoded Routing Logic

Production Bundle

Action Checklist

Decision Matrix

Configuration Template

Quick Start Guide

Mid-Year Sale — Unlock Full Article