Build a Private AI Search on Your Device: Local RAG in the Browser

By Codcompass Team·2026-05-27·7 min read

Client-Side Vector Search: Architecting Zero-Backend RAG Pipelines in Modern Browsers

Current Situation Analysis

The standard Retrieval-Augmented Generation (RAG) stack has long been tethered to backend infrastructure. Developers typically route documents through cloud APIs, store embeddings in managed vector databases, and pay per-token for inference. While effective, this architecture introduces three compounding constraints: data egress compliance risks, recurring infrastructure costs, and network-dependent latency.

This problem is frequently misunderstood because browser capabilities have evolved faster than developer mental models. Many teams still assume client-side machine learning is prohibitively slow or that browser storage is limited to the legacy 5MB LocalStorage quota. In reality, modern web standards have closed these gaps. The Origin Private File System (OPFS) provides origin-isolated storage scaling into the gigabytes with low-latency sequential I/O. Web Workers, combined with structured cloning and SharedArrayBuffer, enable true parallelism without main-thread contention. Meanwhile, ONNX Runtime Web and Transformers.js have optimized transformer inference for CPU and WebGPU, making sub-100ms embedding generation feasible on consumer hardware.

The industry pain point is clear: organizations handling sensitive intellectual property, legal contracts, or internal engineering documentation cannot legally or practically upload raw text to third-party inference endpoints. Yet, building a local alternative has historically required Electron wrappers or native desktop applications. The browser now offers a viable, standards-compliant path to run complete RAG pipelines without leaving the client environment.

WOW Moment: Key Findings

Shifting RAG execution from cloud to client fundamentally alters the cost, latency, and compliance profile of AI search. The following comparison isolates the architectural trade-offs:

Approach	Data Egress	Infrastructure Cost	Cold Start Latency	Privacy Model
Cloud-Hosted RAG	High (uploads to API)	$50-$500+/mo	~200-800ms (network)	Trust-based (provider policy)
Browser-Native RAG	Zero	$0	~1.5-3s (model load)	Architectural (device-bound)

This finding matters because it decouples AI search from vendor lock-in and data processing agreements. When embeddings are generated and queried entirely within the browser's sandbox, the privacy guarantee becomes structural rather than contractual. It also enables offline-first workflows, edge deployments on restricted networks, and zero-cost scaling for internal tooling. The trade-off is upfront model loading time and reliance on client hardware, but for document sets under 500MB, modern CPUs handle the workload efficiently.

Core Solution

Building a browser-native RAG pipeline requires coordinating three subsystems: a background execution layer, a local inference engine, and a persistent vector store. The architecture follows a strict unidirectional flow:

Document Ingestion → Text Extraction → Semantic Chunking → Local Embedding → Vector Serialization → OPFS Persistence → Similarity Search

Phase 1: Background Execution Layer

Vector operations are CPU-intensive. Running them on the main thread will cause frame dro

ps and unresponsive UI. We isolate indexing and search logic in a dedicated Web Worker. To avoid the boilerplate of postMessage and event listeners, we use Comlink, which proxies worker methods as native async functions.

// main-thread.ts
import { wrap } from "comlink";

export class LocalSearchOrchestrator {
  private indexerProxy: any;

  constructor() {
    const worker = new Worker(
      new URL("./vector-worker.ts", import.meta.url),
      { type: "module" }
    );
    this.indexerProxy = wrap(worker);
  }

  async ingestDocument(file: File): Promise<void> {
    await this.indexerProxy.executeIndexingPipeline(file);
  }

  async queryDocuments(prompt: string, topK: number = 5): Promise<SearchResult[]> {
    return await this.indexerProxy.runSemanticSearch(prompt, topK);
  }
}

Phase 2: Local Inference Engine

We rely on @xenova/transformers (Transformers.js) for model management and onnxruntime-web for runtime execution. The pipeline loads a quantized embedding model, processes text chunks, and returns normalized float arrays. Quantization to Q8 or F16 reduces memory footprint by ~50% with negligible accuracy loss for retrieval tasks.

// vector-worker.ts
import { pipeline, env } from "@xenova/transformers";
import * as Comlink from "comlink";

env.allowLocalModels = true;
env.useBrowserCache = true;

let embedder: any = null;

async function initializeModel() {
  if (!embedder) {
    embedder = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2", {
      dtype: "q8",
      device: "wasm",
    });
  }
}

async function generateEmbeddings(textChunks: string[]): Promise<number[][]> {
  await initializeModel();
  const results = await embedder(textChunks, { pooling: "mean", normalize: true });
  return Array.from(results.data).reduce((acc, val, i) => {
    const dim = 384; // MiniLM-L6-v2 output dimension
    const chunkIndex = Math.floor(i / dim);
    if (!acc[chunkIndex]) acc[chunkIndex] = [];
    acc[chunkIndex].push(val);
    return acc;
  }, [] as number[][]);
}

Comlink.expose({
  executeIndexingPipeline: async (file: File) => {
    const text = await file.text();
    const chunks = splitIntoChunks(text, 512, 64);
    const vectors = await generateEmbeddings(chunks);
    await persistVectorsToStorage(vectors, chunks);
  },
  runSemanticSearch: async (query: string, k: number) => {
    const queryVec = (await generateEmbeddings([query]))[0];
    return await retrieveNearestNeighbors(queryVec, k);
  }
});

Phase 3: OPFS Vector Persistence

IndexedDB is suitable for key-value metadata, but OPFS excels at sequential binary writes and large file handling. We serialize vectors into a structured binary format and write them to a dedicated OPFS file. This avoids JSON serialization overhead and enables faster reads during search.

// storage-adapter.ts
const OPFS_ROOT = await navigator.storage.getDirectory();
const INDEX_FILE = "vector_store.bin";

export async function persistVectorsToStorage(vectors: number[][], metadata: string[]) {
  const root = await navigator.storage.getDirectory();
  const handle = await root.getFileHandle(INDEX_FILE, { create: true });
  const writable = await handle.createWritable();

  const encoder = new TextEncoder();
  const header = JSON.stringify({ count: vectors.length, dim: vectors[0]?.length || 0 });
  await writable.write(encoder.encode(header + "\n"));

  for (let i = 0; i < vectors.length; i++) {
    const floatBuffer = new Float32Array(vectors[i]);
    const metaBuffer = encoder.encode(metadata[i] + "\n");
    await writable.write(floatBuffer);
    await writable.write(metaBuffer);
  }

  await writable.close();
}

Architecture Rationale

Why OPFS over IndexedDB? OPFS provides lower overhead for large sequential writes, avoids structured cloning limits, and supports direct ArrayBuffer streaming. It behaves like a traditional filesystem without the DOM synchronization constraints.
Why Comlink? Raw postMessage requires manual message typing, error handling, and callback management. Comlink abstracts this into standard async/await patterns, reducing worker communication bugs by ~70% in production.
Why Transformers.js + ONNX? The ONNX runtime compiles transformer graphs to WebAssembly, enabling near-native execution speeds. Transformers.js handles tokenization, model caching, and device fallback automatically, eliminating custom inference glue code.

Pitfall Guide

1. Main Thread Blocking During Chunking

Explanation: Splitting large documents into semantic chunks synchronously on the UI thread causes jank and input lag. Fix: Offload chunking to the worker. Use a sliding window with overlap (e.g., 512 tokens, 64 overlap) to preserve context boundaries without blocking rendering.

2. Unnormalized Embedding Vectors

Explanation: Cosine similarity requires unit vectors. Storing raw embeddings inflates storage and breaks distance calculations. Fix: Always apply L2 normalization during inference. Transformers.js supports { normalize: true } in the pipeline options. Verify vector magnitude equals 1.0 before persistence.

3. OPFS Handle Leaks

Explanation: Forgetting to close writable streams or reusing stale file handles causes InvalidStateError and silent write failures. Fix: Wrap all OPFS operations in try/finally blocks. Always call writable.close() and avoid caching handles across page navigations. Use await root.getFileHandle() fresh per operation.

4. Ignoring Model Quantization Trade-offs

Explanation: Loading full-precision (FP32) models consumes excessive RAM and slows inference on mobile CPUs. Fix: Default to q8 or f16 variants. Benchmark retrieval accuracy against your corpus. For most technical documents, quantization drops recall by <2% while cutting memory usage in half.

5. Naive Linear Search at Scale

Explanation: Computing cosine similarity against thousands of vectors sequentially becomes a bottleneck. Fix: Implement approximate nearest neighbor (ANN) search using locality-sensitive hashing (LSH) or vector quantization. For datasets under 10k chunks, linear search is acceptable; beyond that, introduce a lightweight HNSW or FAISS-like client-side index.

6. Memory Spikes During Batch Indexing

Explanation: Loading an entire document into memory, chunking it, and generating embeddings simultaneously triggers GC pressure and potential crashes. Fix: Stream processing. Read files in chunks, embed incrementally, and flush to OPFS periodically. Use performance.memory (Chrome) to monitor heap usage and pause indexing if thresholds are breached.

7. Assuming Uniform Browser Support

Explanation: OPFS and WebGPU are not available in all environments (e.g., older Safari, restrictive enterprise policies). Fix: Implement feature detection. Fall back to IndexedDB for storage and CPU-only WASM for inference. Gracefully degrade UX rather than throwing unhandled exceptions.

Production Bundle

Action Checklist

Verify OPFS availability: Check navigator.storage.getDirectory before initializing storage adapters.
Configure model caching: Set env.useBrowserCache = true to avoid re-downloading models on every session.
Implement chunk overlap: Use 10-15% overlap between segments to prevent context fragmentation at boundaries.
Normalize vectors at inference: Ensure all embeddings are L2-normalized before storage and search.
Isolate worker communication: Use Comlink or a typed message bus to prevent postMessage serialization errors.
Add memory guards: Monitor heap usage during indexing and implement backpressure or chunked processing.
Test offline resilience: Disconnect network after model load and verify full indexing/search functionality.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Internal legal/medical docs (<500MB)	Browser-Native RAG	Zero data egress, architectural privacy, offline capable	$0 infra, dev time for worker/OPFS setup
Enterprise-wide search (>10GB)	Cloud-Hosted RAG	Client hardware limits, requires distributed ANN, compliance auditing	$50-$500+/mo, vendor lock-in risk
Public-facing AI assistant	Cloud-Hosted RAG	Requires multi-tenant isolation, rate limiting, and audit trails	High infra cost, SOC2/ISO compliance overhead
Edge/air-gapped environments	Browser-Native RAG	No network dependency, runs on restricted hardware	Zero recurring cost, model update logistics required

Configuration Template

// rag-config.ts
import { env } from "@xenova/transformers";

export const RAG_CONFIG = {
  storage: {
    opfsFile: "vector_index.bin",
    maxFileSizeMB: 2048,
    flushInterval: 500, // ms between OPFS writes
  },
  model: {
    name: "Xenova/all-MiniLM-L6-v2",
    dtype: "q8" as const,
    device: "wasm" as const,
    dimensions: 384,
    normalize: true,
  },
  chunking: {
    maxTokens: 512,
    overlapTokens: 64,
    separator: /\n\s*\n/,
  },
  worker: {
    timeoutMs: 30000,
    retryAttempts: 2,
  },
};

// Apply environment overrides
env.allowLocalModels = true;
env.useBrowserCache = true;
env.backends.onnx.wasm.numThreads = navigator.hardwareConcurrency || 4;

Quick Start Guide

Initialize the project: Run npm install comlink @xenova/transformers onnxruntime-web and create vector-worker.ts and main-thread.ts.
Configure the worker: Copy the RAG_CONFIG template, set up Comlink exposure, and implement the chunking + embedding pipeline inside the worker file.
Wire OPFS storage: Implement the persistVectorsToStorage adapter using navigator.storage.getDirectory() and createWritable() streams.
Test ingestion: Drag a PDF or text file into your UI, trigger ingestDocument(), and verify the OPFS file appears in DevTools > Application > Storage.
Execute search: Call queryDocuments() with a prompt, compute cosine similarity against stored vectors, and render the top-k results with source metadata.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back