I Ran AI Models Directly in the Browser and Measured What It Did to Core Web Vitals
Client-Side AI Performance: Architecting for INP and Main Thread Safety
Current Situation Analysis
The web development landscape is rapidly shifting toward client-side artificial intelligence. Privacy regulations, reduced server costs, and the demand for instant feedback have pushed teams to run neural networks directly in the browser. Libraries like Transformers.js have made this accessible, allowing developers to load Hugging Face models via WebAssembly without writing custom inference engines. The pitch is compelling: keep data on-device, eliminate network round-trips, and deliver offline-capable features.
Yet a critical blind spot persists. Teams are shipping AI features without measuring the actual cost to user interactivity. The focus remains on model accuracy, download size, or privacy compliance, while the browser's main thread bears the hidden tax. When a neural network runs synchronously, it monopolizes the JavaScript execution context. Any user interactionâa tap, a keystroke, a scrollâgets queued until the inference completes. That queue time directly translates to Interaction to Next Paint (INP), the metric Google adopted in March 2024 to replace First Input Delay.
INP measures the latency between a user action and the browser's next visual update. Google's classification thresholds are strict:
- Good: under 200ms
- Needs Improvement: 200â500ms
- Poor: over 500ms
Crossing into "Poor" doesn't just degrade perceived performance; it impacts search rankings and increases bounce rates. The misconception driving this issue is the belief that quantization or smaller parameter counts automatically guarantee smooth interactivity. In reality, architecture dictates execution behavior far more than weight size. Encoder-only transformers, autoregressive decoder loops, and vision transformers each interact with the main thread in fundamentally different ways. Without architectural routing, even a 5.7M parameter model can push INP into warning territory, while a 39M parameter speech model can completely stall user input.
The industry has treated client-side AI as a drop-in feature. It is not. It is a main thread workload that requires explicit scheduling, context isolation, and memory pressure monitoring. The following analysis breaks down how different model architectures impact INP, why parameter count is a misleading optimization target, and how to architect inference pipelines that preserve interactivity.
WOW Moment: Key Findings
Benchmarking four quantized models in Chrome stable reveals a counterintuitive performance landscape. Parameter count fails to predict main thread blocking. Architecture and execution pattern are the true determinants of INP degradation.
| Architecture Pattern | Avg INP | Inference Latency | Memory Pressure |
|---|---|---|---|
| Encoder-Only (DistilBERT) | 27.8ms | 25.1ms ±0.5 | 2.5% |
| Encoder-Only (BERT-base) | 85.0ms | 83.3ms ±1.5 | 4.1% |
| Encoder-Decoder (Whisper Tiny) | 540.3ms | 496.9ms ±6.2 | 7.1% |
| Vision Transformer (MobileViT-S) | 75.6ms | 66.7ms ±1.0 | 8.0% |
The data exposes three critical insights:
Autoregressive decoding dominates blocking time. Whisper Tiny carries only 39M parameters, yet generates the worst INP at 540.3ms. Unlike encoder-only models that process input in a single forward pass, encoder-decoder architectures run iterative decode loops. Each generated token requires a separate inference step, and every step blocks the main thread until completion. Quantization reduces weight size, but it cannot eliminate the cumulative blocking time of sequential token generation.
Vision transformers carry disproportionate WASM overhead. MobileViT-S loads in 1.15 seconds, roughly six times faster than text models. Despite having only 5.7M parameters, its INP sits at 75.6ms, crossing into "Needs Improvement." Vision transformers rely heavily on matrix multiplication and attention mechanisms that translate inefficiently to WebAssembly execution. The computational graph is heavier per parameter than text-based encoders, and WASM compilation overhead amplifies the cost.
Memory delta and heap pressure are decoupled. MobileViT-S consumes the least absolute memory (+37.0MB) but registers the highest memory pressure at 8.0%. Memory delta measures raw allocation, while memory pressure reflects the percentage of available JavaScript heap consumed. On mid-range Android devices with tighter heap limits, that 37MB allocation triggers garbage collection cycles far more aggressively than a larger allocation on a desktop environment.
These findings shift the optimization paradigm. Shrinking models or applying aggressive quantization will not resolve main thread blocking if the execution context remains unmanaged. The solution lies in architectural routing: matching model topology to the appropriate execution environment and scheduling inference to avoid interaction collisions.
Core Solution
Building client-side AI that preserves INP requires a deliberate execution strategy. The implementation must separate model loading, context routing, interaction scheduling, and memory management into distinct, testable layers.
Step 1: Architectural Context Routing
Not all models belong on the main thread. The routing decision should be based on architecture, not size.
- Encoder-only models (e.g., DistilBERT, BERT-base) perform single-pass inference. They can safely execute on the main thread if latency stays under 50ms.
- Encoder-decoder models (e.g., Whisper, T5, BART) use autoregressive generation. They must run in a Web Worker to prevent iterative blocking.
- Vision transformers require careful heap monitoring. They can run on the main thread for small batches, but should be offloaded for real-time camera feeds or continuous classification.
Step 2: Interaction-Safe Scheduling
Triggering inference directly on user events is the primary cause of INP degradation. Instead, decouple the interaction from the computation. Use scheduler.postTask or requestIdleCallback to defer inference until after the browser has painted the interaction response. For larger encoders, schedule inference as a background task that runs post-paint, ensuring the UI thread remains free for subsequent inputs.
Step 3: WASM Module Caching and Preloading
WebAssembly compilation is a one-time cost, but it blocks the main thread if executed during interaction. Preload the WASM module during app initialization or route it through a service worker cache. Transformers.js supports explicit WASM caching via the local_model_path configuration. Compiling the module ahead of time eliminates the initial spike that would otherwise corrupt INP measurements.
Step 4: Memory Pressure Monitoring
Heap pressure correlates with garbage collection frequency, which introduces micro-stutters that inflate INP. Implement a lightweight monitor that tracks performance.memory (where available) or estimates pressure via allocation patterns. If pressure exceeds 6%, throttle inference frequency or switch to a worker context to isolate the heap.
Implementation Example
The following TypeScript implementation demonstrates a production-ready inference router. It abstracts context routing, post-paint scheduling, and worker isolation into a single interface.
import { pipeline, env } from '@xenova/transformers';
type InferenceContext = 'main' | 'worker';
type ModelTopology = 'encoder-only' | 'encoder-decoder' | 'vision';
interface InferenceConfig {
modelId: string;
task: string;
topology: ModelTopology;
quantization: 'int8' | 'fp16';
maxHeapPressure: number;
}
class NeuralExecutor {
private pipeline: any;
private context: InferenceContext;
private config: InferenceConfig;
constructor(config: InferenceConfig) {
this.config = config;
this.context = this.resolveContext(config.topology);
this.initializePipeline();
}
private resolveContext(topology: ModelTopology): InferenceContext {
if (topology === 'encoder-decoder') return 'worker';
if (topology === 'vision') return 'worker';
return 'main';
}
private async initializePipeline() {
env.allowLocalModels = true;
env.backends.onnx.wasm.numThreads = this.context === 'worker' ? 2 : 1;
this.pipeline = await pipeline(this.config.task, this.config.modelId, {
quantized: this.config.quantization === 'int8',
device: 'wasm',
...(this.context === 'worker' ? { worker: true } : {})
});
}
async predict(input: string | Uint8Array): Promise<any> {
if (this.context === 'main') {
return this.schedulePostPaint(input);
}
return this.pipeline(input);
}
private schedulePostPaint(input: any): Promise<any> {
return new Promise((resolve) => {
if ('scheduler' in window && 'postTask' in window.scheduler) {
window.scheduler.postTask(() => {
this.pipeline(input).then(resolve);
}, { priority: 'background' });
} else {
requestIdleCallback(() => {
this.pipeline(input).then(resolve);
});
}
});
}
}
export { NeuralExecutor, InferenceConfig };
Architecture Rationale:
resolveContext()enforces architectural routing. Encoder-decoder and vision models automatically spawn workers, eliminating main thread blocking by design.schedulePostPaint()defers encoder-only inference until after the browser paints. This prevents interaction queue buildup and keeps INP under 200ms.- WASM thread allocation scales with context. Workers get 2 threads for parallel matrix operations; main thread stays single-threaded to avoid contention.
- The pipeline initialization runs once during app bootstrap, moving WASM compilation off the critical path.
Pitfall Guide
1. The Parameter Count Fallacy
Explanation: Assuming smaller models automatically produce lower INP. Whisper Tiny (39M) blocks the main thread longer than DistilBERT (66M) because of autoregressive decoding. Fix: Route based on topology, not size. Use encoder-only for main thread, offload decoder/vision models to workers.
2. Synchronous Inference on User Actions
Explanation: Calling pipeline(input) directly inside onClick or onInput handlers queues the inference on the main thread, blocking subsequent interactions until completion.
Fix: Decouple interaction from computation. Use scheduler.postTask or requestIdleCallback to defer execution until after the next paint.
3. Ignoring Autoregressive Decode Loops
Explanation: Encoder-decoder models generate output token-by-token. Each iteration is a separate inference call. Even with INT8 quantization, the cumulative blocking time exceeds 500ms.
Fix: Always run encoder-decoder models in a Web Worker. Transformers.js supports native worker execution via { worker: true }.
4. Desktop-Only Performance Validation
Explanation: Benchmarking on Apple M-series hardware masks mobile constraints. Mid-range Android devices have tighter heap limits and slower WASM compilation, inflating INP by 3â5x. Fix: Validate on physical mid-range devices or use Chrome DevTools device emulation with throttled CPU/memory. Adjust thresholds accordingly.
5. Memory Delta vs Heap Pressure Confusion
Explanation: Tracking only absolute memory allocation (+37MB) ignores heap utilization percentage. MobileViT-S shows the highest pressure (8.0%) despite lowest delta, triggering aggressive garbage collection.
Fix: Monitor performance.memory.usedJSHeapSize / performance.memory.jsHeapSizeLimit. Throttle inference or switch to workers when pressure exceeds 6%.
6. WASM Compilation Overlooked
Explanation: The first inference call triggers WASM compilation, which blocks the main thread for 200â400ms. This spike corrupts INP measurements and degrades initial UX.
Fix: Precompile WASM modules during app initialization. Cache the compiled module using service workers or env.backends.onnx.wasm.wasmPaths.
7. Missing Loading State Management
Explanation: Users interact before the model finishes loading or compiling. Unhandled promises or silent failures create inconsistent states and orphaned main thread tasks.
Fix: Implement explicit loading states (idle, compiling, ready, error). Disable interactive triggers until the pipeline emits a ready event. Queue inputs during compilation if necessary.
Production Bundle
Action Checklist
- Classify models by topology before integration: encoder-only, encoder-decoder, or vision transformer
- Route encoder-decoder and vision models to Web Workers by default
- Defer main thread inference using
scheduler.postTaskorrequestIdleCallback - Precompile and cache WASM modules during app initialization
- Implement heap pressure monitoring with a 6% throttle threshold
- Validate INP on mid-range Android devices, not just desktop hardware
- Add explicit loading states to prevent interaction during compilation
- Log INP and memory pressure in production using Performance Observer
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Real-time sentiment analysis | Main thread + post-paint scheduling | Single-pass encoder, low latency, preserves INP | Low compute, minimal memory overhead |
| Speech transcription | Web Worker execution | Autoregressive decode loop blocks main thread iteratively | Higher memory, requires worker serialization |
| Image classification (batch) | Main thread with heap monitoring | Vision transformers have high WASM cost; batch processing isolates spikes | Moderate memory, requires pressure throttling |
| Continuous camera tagging | Web Worker + frame throttling | Real-time video feeds exceed main thread budget; workers isolate GC pressure | High compute, requires frame sampling strategy |
| Feature extraction for search | Background worker + queue | Large encoder models (BERT-base) risk 85ms+ blocking; backgrounding preserves UI | Low INP impact, higher server-like compute cost |
Configuration Template
import { NeuralExecutor } from './NeuralExecutor';
import type { InferenceConfig } from './NeuralExecutor';
// Production-ready configuration registry
const MODEL_REGISTRY: Record<string, InferenceConfig> = {
sentiment: {
modelId: 'Xenova/distilbert-base-uncased-finetuned-sst-2-english',
task: 'text-classification',
topology: 'encoder-only',
quantization: 'int8',
maxHeapPressure: 0.06
},
transcription: {
modelId: 'Xenova/whisper-tiny',
task: 'automatic-speech-recognition',
topology: 'encoder-decoder',
quantization: 'int8',
maxHeapPressure: 0.08
},
imageTag: {
modelId: 'Xenova/mobilevit-small',
task: 'image-classification',
topology: 'vision',
quantization: 'int8',
maxHeapPressure: 0.06
}
};
// Initialize executors with error boundaries
export function initializeAIModels() {
const executors: Record<string, NeuralExecutor> = {};
for (const [key, config] of Object.entries(MODEL_REGISTRY)) {
try {
executors[key] = new NeuralExecutor(config);
console.info(`[AI] ${key} executor initialized in ${executors[key]['context']} context`);
} catch (error) {
console.error(`[AI] Failed to initialize ${key}:`, error);
}
}
return executors;
}
Quick Start Guide
- Install dependencies: Run
npm install @xenova/transformersand ensure your bundler supports WebAssembly and worker imports. - Define your topology: Classify each model as encoder-only, encoder-decoder, or vision transformer. This dictates execution context.
- Initialize the executor: Import
NeuralExecutor, pass your configuration, and callinitializeAIModels()during app bootstrap. - Route predictions: Call
executor.predict(input)for main thread models or rely on automatic worker routing for decoders/vision models. - Validate INP: Open Chrome DevTools, enable Performance panel, record a user interaction, and verify that INP stays under 200ms. Adjust scheduling or context routing if thresholds are breached.
Mid-Year Sale â Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register â Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
