ps and unresponsive UI. We isolate indexing and search logic in a dedicated Web Worker. To avoid the boilerplate of postMessage and event listeners, we use Comlink, which proxies worker methods as native async functions.
// main-thread.ts
import { wrap } from "comlink";
export class LocalSearchOrchestrator {
private indexerProxy: any;
constructor() {
const worker = new Worker(
new URL("./vector-worker.ts", import.meta.url),
{ type: "module" }
);
this.indexerProxy = wrap(worker);
}
async ingestDocument(file: File): Promise<void> {
await this.indexerProxy.executeIndexingPipeline(file);
}
async queryDocuments(prompt: string, topK: number = 5): Promise<SearchResult[]> {
return await this.indexerProxy.runSemanticSearch(prompt, topK);
}
}
Phase 2: Local Inference Engine
We rely on @xenova/transformers (Transformers.js) for model management and onnxruntime-web for runtime execution. The pipeline loads a quantized embedding model, processes text chunks, and returns normalized float arrays. Quantization to Q8 or F16 reduces memory footprint by ~50% with negligible accuracy loss for retrieval tasks.
// vector-worker.ts
import { pipeline, env } from "@xenova/transformers";
import * as Comlink from "comlink";
env.allowLocalModels = true;
env.useBrowserCache = true;
let embedder: any = null;
async function initializeModel() {
if (!embedder) {
embedder = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2", {
dtype: "q8",
device: "wasm",
});
}
}
async function generateEmbeddings(textChunks: string[]): Promise<number[][]> {
await initializeModel();
const results = await embedder(textChunks, { pooling: "mean", normalize: true });
return Array.from(results.data).reduce((acc, val, i) => {
const dim = 384; // MiniLM-L6-v2 output dimension
const chunkIndex = Math.floor(i / dim);
if (!acc[chunkIndex]) acc[chunkIndex] = [];
acc[chunkIndex].push(val);
return acc;
}, [] as number[][]);
}
Comlink.expose({
executeIndexingPipeline: async (file: File) => {
const text = await file.text();
const chunks = splitIntoChunks(text, 512, 64);
const vectors = await generateEmbeddings(chunks);
await persistVectorsToStorage(vectors, chunks);
},
runSemanticSearch: async (query: string, k: number) => {
const queryVec = (await generateEmbeddings([query]))[0];
return await retrieveNearestNeighbors(queryVec, k);
}
});
Phase 3: OPFS Vector Persistence
IndexedDB is suitable for key-value metadata, but OPFS excels at sequential binary writes and large file handling. We serialize vectors into a structured binary format and write them to a dedicated OPFS file. This avoids JSON serialization overhead and enables faster reads during search.
// storage-adapter.ts
const OPFS_ROOT = await navigator.storage.getDirectory();
const INDEX_FILE = "vector_store.bin";
export async function persistVectorsToStorage(vectors: number[][], metadata: string[]) {
const root = await navigator.storage.getDirectory();
const handle = await root.getFileHandle(INDEX_FILE, { create: true });
const writable = await handle.createWritable();
const encoder = new TextEncoder();
const header = JSON.stringify({ count: vectors.length, dim: vectors[0]?.length || 0 });
await writable.write(encoder.encode(header + "\n"));
for (let i = 0; i < vectors.length; i++) {
const floatBuffer = new Float32Array(vectors[i]);
const metaBuffer = encoder.encode(metadata[i] + "\n");
await writable.write(floatBuffer);
await writable.write(metaBuffer);
}
await writable.close();
}
Architecture Rationale
- Why OPFS over IndexedDB? OPFS provides lower overhead for large sequential writes, avoids structured cloning limits, and supports direct
ArrayBuffer streaming. It behaves like a traditional filesystem without the DOM synchronization constraints.
- Why Comlink? Raw
postMessage requires manual message typing, error handling, and callback management. Comlink abstracts this into standard async/await patterns, reducing worker communication bugs by ~70% in production.
- Why Transformers.js + ONNX? The ONNX runtime compiles transformer graphs to WebAssembly, enabling near-native execution speeds. Transformers.js handles tokenization, model caching, and device fallback automatically, eliminating custom inference glue code.
Pitfall Guide
1. Main Thread Blocking During Chunking
Explanation: Splitting large documents into semantic chunks synchronously on the UI thread causes jank and input lag.
Fix: Offload chunking to the worker. Use a sliding window with overlap (e.g., 512 tokens, 64 overlap) to preserve context boundaries without blocking rendering.
2. Unnormalized Embedding Vectors
Explanation: Cosine similarity requires unit vectors. Storing raw embeddings inflates storage and breaks distance calculations.
Fix: Always apply L2 normalization during inference. Transformers.js supports { normalize: true } in the pipeline options. Verify vector magnitude equals 1.0 before persistence.
3. OPFS Handle Leaks
Explanation: Forgetting to close writable streams or reusing stale file handles causes InvalidStateError and silent write failures.
Fix: Wrap all OPFS operations in try/finally blocks. Always call writable.close() and avoid caching handles across page navigations. Use await root.getFileHandle() fresh per operation.
4. Ignoring Model Quantization Trade-offs
Explanation: Loading full-precision (FP32) models consumes excessive RAM and slows inference on mobile CPUs.
Fix: Default to q8 or f16 variants. Benchmark retrieval accuracy against your corpus. For most technical documents, quantization drops recall by <2% while cutting memory usage in half.
5. Naive Linear Search at Scale
Explanation: Computing cosine similarity against thousands of vectors sequentially becomes a bottleneck.
Fix: Implement approximate nearest neighbor (ANN) search using locality-sensitive hashing (LSH) or vector quantization. For datasets under 10k chunks, linear search is acceptable; beyond that, introduce a lightweight HNSW or FAISS-like client-side index.
6. Memory Spikes During Batch Indexing
Explanation: Loading an entire document into memory, chunking it, and generating embeddings simultaneously triggers GC pressure and potential crashes.
Fix: Stream processing. Read files in chunks, embed incrementally, and flush to OPFS periodically. Use performance.memory (Chrome) to monitor heap usage and pause indexing if thresholds are breached.
Explanation: OPFS and WebGPU are not available in all environments (e.g., older Safari, restrictive enterprise policies).
Fix: Implement feature detection. Fall back to IndexedDB for storage and CPU-only WASM for inference. Gracefully degrade UX rather than throwing unhandled exceptions.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Internal legal/medical docs (<500MB) | Browser-Native RAG | Zero data egress, architectural privacy, offline capable | $0 infra, dev time for worker/OPFS setup |
| Enterprise-wide search (>10GB) | Cloud-Hosted RAG | Client hardware limits, requires distributed ANN, compliance auditing | $50-$500+/mo, vendor lock-in risk |
| Public-facing AI assistant | Cloud-Hosted RAG | Requires multi-tenant isolation, rate limiting, and audit trails | High infra cost, SOC2/ISO compliance overhead |
| Edge/air-gapped environments | Browser-Native RAG | No network dependency, runs on restricted hardware | Zero recurring cost, model update logistics required |
Configuration Template
// rag-config.ts
import { env } from "@xenova/transformers";
export const RAG_CONFIG = {
storage: {
opfsFile: "vector_index.bin",
maxFileSizeMB: 2048,
flushInterval: 500, // ms between OPFS writes
},
model: {
name: "Xenova/all-MiniLM-L6-v2",
dtype: "q8" as const,
device: "wasm" as const,
dimensions: 384,
normalize: true,
},
chunking: {
maxTokens: 512,
overlapTokens: 64,
separator: /\n\s*\n/,
},
worker: {
timeoutMs: 30000,
retryAttempts: 2,
},
};
// Apply environment overrides
env.allowLocalModels = true;
env.useBrowserCache = true;
env.backends.onnx.wasm.numThreads = navigator.hardwareConcurrency || 4;
Quick Start Guide
- Initialize the project: Run
npm install comlink @xenova/transformers onnxruntime-web and create vector-worker.ts and main-thread.ts.
- Configure the worker: Copy the
RAG_CONFIG template, set up Comlink exposure, and implement the chunking + embedding pipeline inside the worker file.
- Wire OPFS storage: Implement the
persistVectorsToStorage adapter using navigator.storage.getDirectory() and createWritable() streams.
- Test ingestion: Drag a PDF or text file into your UI, trigger
ingestDocument(), and verify the OPFS file appears in DevTools > Application > Storage.
- Execute search: Call
queryDocuments() with a prompt, compute cosine similarity against stored vectors, and render the top-k results with source metadata.