ly halves the input memory footprint.
Step 3: Bounded Sequential Processing
Never process all pages concurrently. Implement a processing queue with a fixed concurrency limit (typically 1–3 concurrent operations depending on device class). Sequential processing ensures that intermediate canvas buffers and decoded image data are allocated, used, and released before the next cycle begins.
Step 4: Event Loop Yielding & Cleanup
After each page extraction, yield control back to the browser's event loop. This allows pending microtasks, paint cycles, and GC sweeps to execute. Pair this with strict cleanup protocols: nullify parser instances, revoke object URLs, and release canvas contexts.
// main-thread-controller.ts
import type { ExtractionResult, WorkerMessage } from './types';
export class DocumentExtractionController {
private worker: Worker;
private onProgress: (page: number, total: number) => void;
private onComplete: (results: ExtractionResult[]) => void;
constructor(progressCb: (p: number, t: number) => void, completeCb: (r: ExtractionResult[]) => void) {
this.onProgress = progressCb;
this.onComplete = completeCb;
this.worker = new Worker(new URL('./pdf-extraction-worker.ts', import.meta.url), { type: 'module' });
this.worker.onmessage = this.handleWorkerResponse.bind(this);
}
public async initiateExtraction(file: File): Promise<void> {
const rawPayload = await file.arrayBuffer();
this.worker.postMessage(
{
action: 'START_EXTRACTION',
payload: rawPayload,
totalPages: await this.estimatePages(rawPayload)
},
[rawPayload] // Transfer ownership; rawPayload is now neutered here
);
}
private handleWorkerResponse(event: MessageEvent<WorkerMessage>): void {
const { type, data } = event.data;
if (type === 'PROGRESS') {
this.onProgress(data.currentPage, data.totalPages);
} else if (type === 'COMPLETE') {
this.onComplete(data.results);
this.worker.terminate();
} else if (type === 'ERROR') {
console.error('Extraction pipeline failed:', data.message);
this.worker.terminate();
}
}
private async estimatePages(buffer: ArrayBuffer): Promise<number> {
// Lightweight header scan or delegate to worker
return new Promise(resolve => {
const tempWorker = new Worker(new URL('./page-counter-worker.ts', import.meta.url), { type: 'module' });
tempWorker.postMessage({ buffer }, [buffer]);
tempWorker.onmessage = (e) => resolve(e.data.count);
});
}
}
// pdf-extraction-worker.ts
import type { WorkerMessage, ExtractionResult } from './types';
self.onmessage = async (event: MessageEvent<{ action: string; payload: ArrayBuffer; totalPages: number }>) => {
const { action, payload, totalPages } = event.data;
if (action !== 'START_EXTRACTION') return;
const results: ExtractionResult[] = [];
const parser = await importPdfLibrary(); // Dynamic import to keep worker lean
for (let pageIndex = 0; pageIndex < totalPages; pageIndex++) {
try {
const pageData = await parser.renderPage(payload, pageIndex);
const blobUrl = await createImageBlob(pageData.imageBuffer);
results.push({
pageIndex,
imageUrl: blobUrl,
dimensions: pageData.dimensions,
timestamp: Date.now()
});
self.postMessage({
type: 'PROGRESS',
data: { currentPage: pageIndex + 1, totalPages }
});
// Yield to event loop for GC and paint cycles
await yieldToEventLoop();
} catch (error) {
self.postMessage({ type: 'ERROR', data: { message: `Page ${pageIndex} failed`, error } });
return;
}
}
self.postMessage({ type: 'COMPLETE', data: { results } });
};
async function yieldToEventLoop(): Promise<void> {
if ('scheduler' in window && 'yield' in window.scheduler) {
await (window.scheduler as any).yield();
} else {
await new Promise(resolve => setTimeout(resolve, 0));
}
}
async function createImageBlob(imageData: Uint8Array): Promise<string> {
const blob = new Blob([imageData], { type: 'image/png' });
return URL.createObjectURL(blob);
}
Architecture Rationale:
- Transferables over Cloning: Structured cloning duplicates memory. Transferables move the pointer. For a 50MB file, this saves ~50MB instantly and reduces GC pressure.
- Sequential Queue over Parallel: Canvas rasterization and image encoding are CPU and memory intensive. Running 20 pages concurrently guarantees heap exhaustion. A single-threaded worker with sequential processing ensures predictable memory curves.
scheduler.yield() Fallback: Modern browsers support explicit yielding. The fallback to setTimeout(0) ensures compatibility while still breaking up long-running tasks into macrotasks.
- Dynamic Imports: Keeping heavy parsing libraries out of the initial worker bundle reduces startup latency and memory overhead.
Pitfall Guide
1. Phantom Parser References
Explanation: Developers often cache parser instances or document buffers in module-level variables or closures, preventing GC from reclaiming memory after extraction completes.
Fix: Explicitly nullify parser instances and buffer references after use. Use WeakRef for optional caching, and implement a dispose() method that clears internal state.
2. The Blob URL Leak
Explanation: URL.createObjectURL() creates a reference that persists until explicitly revoked. Generating hundreds of image URLs without cleanup guarantees memory exhaustion.
Fix: Implement an auto-revoke strategy. Store URLs in a Set, and call URL.revokeObjectURL() immediately after the consumer (e.g., <img> or download handler) finishes using them. Consider wrapping blob creation in a factory that tracks lifecycle.
3. Unbounded Concurrency
Explanation: Spawning one worker per page or using Promise.all() for all pages assumes infinite memory. Each concurrent canvas context holds pixel buffers that multiply quickly.
Fix: Implement a worker pool or sequential queue with a concurrency limit (1–3). Use a p-limit style utility or a custom async queue that processes items in batches.
4. Microtask Queue Starvation
Explanation: Using Promise.resolve().then() or queueMicrotask() in tight loops keeps execution in the microtask queue, blocking paint cycles and GC sweeps.
Fix: Always yield via macrotask scheduling (setTimeout, MessageChannel, or scheduler.yield()). This forces the browser to process pending renders and memory cleanup before continuing.
5. Silent GC Thrashing
Explanation: Repeatedly allocating large Uint8Array or Float32Array buffers without reuse causes the GC to run continuously, creating CPU spikes and jank.
Fix: Implement object pooling for intermediate buffers. Reuse typed arrays across page cycles, and only allocate new memory when dimensions change. Monitor heap snapshots in DevTools to verify stable memory curves.
6. Transferable Misuse
Explanation: Attempting to transfer non-transferable objects (e.g., Blob, File, or plain objects) throws a DataCloneError or silently falls back to cloning.
Fix: Validate transferables before posting. Only ArrayBuffer, MessagePort, ImageBitmap, and ReadableStream are transferable. Convert Blob to ArrayBuffer first, then transfer.
7. Backpressure Ignorance
Explanation: The worker emits results faster than the main thread can render or store them, causing message queue buildup and memory spikes in the worker's post queue.
Fix: Implement acknowledgment-based flow control. The main thread should signal READY_FOR_NEXT after processing each result, or use a ReadableStream with backpressure support for progressive delivery.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Enterprise compliance (PII/PHI) | Client-side Worker + Transferables | Data never leaves device; zero server storage costs | High initial dev effort, zero infra cost |
| Real-time preview (5–20 pages) | Main Thread + scheduler.yield() | Lower latency; avoids worker serialization overhead | Moderate memory usage; acceptable for small docs |
| Batch processing (100+ pages) | Worker Pool + Sequential Queue | Prevents OOM; predictable memory curve; scalable | Higher CPU time; requires queue management |
| Low-end devices (mobile/tablet) | Chunked Processing + Aggressive Yielding | Respects strict heap limits; prevents tab crashes | Slower throughput; requires UI progress indicators |
Configuration Template
// extraction-config.ts
export const ExtractionConfig = {
concurrencyLimit: 1,
yieldInterval: 0, // ms; 0 uses scheduler.yield() or setTimeout(0)
maxHeapThresholdMB: 1024, // Abort if heap exceeds this
blobRetentionMs: 30000, // Auto-revoke URLs after 30s
errorRetryLimit: 2,
progressUpdateFrequency: 1, // Emit progress every N pages
cleanupStrategy: 'immediate' | 'batched' | 'timeout'
} as const;
export type ExtractionConfig = typeof ExtractionConfig;
// memory-guard.ts
export class MemoryGuard {
private threshold: number;
private checkInterval: number;
constructor(config: { thresholdMB: number; intervalMs: number }) {
this.threshold = config.thresholdMB * 1024 * 1024;
this.checkInterval = config.intervalMs;
}
public async assertAvailable(): Promise<void> {
if ('memory' in performance) {
const mem = (performance as any).memory;
if (mem.usedJSHeapSize > this.threshold) {
throw new Error(`Heap threshold exceeded: ${Math.round(mem.usedJSHeapSize / 1048576)}MB`);
}
}
await new Promise(r => setTimeout(r, this.checkInterval));
}
}
Quick Start Guide
- Initialize the worker controller: Instantiate
DocumentExtractionController with progress and completion callbacks. Pass your file object to initiateExtraction().
- Configure the worker pipeline: Set concurrency to
1, enable scheduler.yield() fallback, and implement a blob URL manager with auto-revoke.
- Add memory monitoring: Inject
MemoryGuard checks between page cycles. Log heap usage to console or telemetry endpoint.
- Test under load: Process a 50MB+ PDF in Chrome DevTools. Capture a heap snapshot during extraction. Verify memory returns to baseline after completion and no
DataCloneError or OutOfMemory exceptions occur.
- Deploy with fallbacks: Wrap the extraction pipeline in a try/catch. If heap thresholds are breached, gracefully degrade to server-side processing or chunked user-initiated extraction.