anvas**: Standard DOM canvases trigger layout recalculation and paint cycles. OffscreenCanvas operates entirely in memory, eliminating DOM thrashing and reducing rendering overhead by ~40%.
3. Concurrency Semaphore: Processing all pages simultaneously causes heap spikes. A concurrency limiter ensures only 2–3 pages are in flight, keeping memory usage predictable.
4. Explicit Lifecycle Management: pdf.js does not automatically release internal page caches. Calling .cleanup() and nullifying references forces immediate buffer release, preventing GC starvation.
Implementation
Worker Module (pdf-rasterizer.worker.ts)
import * as pdfjs from 'pdfjs-dist';
interface RasterTask {
taskId: string;
pageIndex: number;
scale: number;
documentBuffer: ArrayBuffer;
}
interface RasterResult {
taskId: string;
pageIndex: number;
imageData: Blob;
dimensions: { width: number; height: number };
}
// Concurrency control via semaphore
const MAX_CONCURRENT_TASKS = 3;
let activeTasks = 0;
const taskQueue: RasterTask[] = [];
async function executeRasterization(task: RasterTask): Promise<RasterResult> {
const pdfDocument = await pdfjs.getDocument({ data: task.documentBuffer }).promise;
const page = await pdfDocument.getPage(task.pageIndex);
const viewport = page.getViewport({ scale: task.scale });
const offscreen = new OffscreenCanvas(viewport.width, viewport.height);
const renderContext = offscreen.getContext('2d');
await page.render({ canvasContext: renderContext, viewport }).promise;
const blob = await offscreen.convertToBlob({ type: 'image/png' });
// Deterministic cleanup
page.cleanup();
(renderContext as any) = null;
(offscreen as any) = null;
return {
taskId: task.taskId,
pageIndex: task.pageIndex,
imageData: blob,
dimensions: { width: viewport.width, height: viewport.height }
};
}
async function processQueue() {
if (taskQueue.length === 0 || activeTasks >= MAX_CONCURRENT_TASKS) return;
const task = taskQueue.shift()!;
activeTasks++;
try {
const result = await executeRasterization(task);
self.postMessage({ type: 'page-ready', payload: result }, [result.imageData]);
} catch (error) {
self.postMessage({ type: 'page-error', payload: { taskId: task.taskId, pageIndex: task.pageIndex, error: (error as Error).message } });
} finally {
activeTasks--;
processQueue(); // Drain remaining tasks
}
}
self.onmessage = async (event: MessageEvent) => {
const { type, payload } = event.data;
if (type === 'enqueue-pages') {
const { taskId, documentBuffer, totalPages, scale } = payload;
for (let i = 1; i <= totalPages; i++) {
taskQueue.push({
taskId,
pageIndex: i,
scale: scale || 1.5,
documentBuffer
});
}
processQueue();
}
};
Main Thread Consumer (document-processor.ts)
export class DocumentRasterizer {
private worker: Worker;
private pendingTasks: Map<string, (result: any) => void> = new Map();
constructor() {
this.worker = new Worker(new URL('./pdf-rasterizer.worker.ts', import.meta.url), { type: 'module' });
this.worker.onmessage = this.handleWorkerMessage.bind(this);
}
private handleWorkerMessage(event: MessageEvent) {
const { type, payload } = event.data;
if (type === 'page-ready') {
const callback = this.pendingTasks.get(payload.taskId);
if (callback) callback(payload);
} else if (type === 'page-error') {
console.error(`Rasterization failed for page ${payload.pageIndex}: ${payload.error}`);
}
}
async extractPages(buffer: ArrayBuffer, totalPages: number, scale = 1.5): Promise<Blob[]> {
const taskId = crypto.randomUUID();
const results: Blob[] = [];
return new Promise((resolve, reject) => {
let completedPages = 0;
this.pendingTasks.set(taskId, (result: any) => {
results[result.pageIndex - 1] = result.imageData;
completedPages++;
if (completedPages === totalPages) {
this.pendingTasks.delete(taskId);
resolve(results);
}
});
this.worker.postMessage({
type: 'enqueue-pages',
payload: { taskId, documentBuffer: buffer, totalPages, scale }
});
});
}
terminate() {
this.worker.terminate();
}
}
Why This Architecture Works
- Transferable Objects: The worker passes
Blob instances using the second argument of postMessage. This transfers ownership instead of cloning, eliminating duplicate memory allocation during cross-thread communication.
- Queue-Driven Execution: Instead of
Promise.all, which launches all tasks simultaneously, the semaphore pattern ensures memory usage remains bounded regardless of document length.
- Explicit Nullification: After
page.cleanup(), references to the canvas context and offscreen buffer are manually nulled. This breaks reference cycles that would otherwise delay GC collection.
- Deterministic Resolution: The main thread tracks completion via a counter rather than relying on arbitrary timeouts, ensuring predictable promise resolution.
Pitfall Guide
1. Main Thread Rasterization Lock
Explanation: Running pdf.getPage() and canvas rendering directly in the UI thread blocks the event loop. The browser cannot process input, run animations, or trigger garbage collection.
Fix: Always delegate parsing and rendering to a Web Worker. Use OffscreenCanvas to avoid DOM interaction entirely.
2. Implicit Canvas Recreation
Explanation: Creating a new <canvas> element for every page forces the browser to allocate DOM nodes, trigger style recalculation, and schedule paint cycles. This compounds memory pressure and causes layout thrashing.
Fix: Use OffscreenCanvas in workers or reuse a single canvas buffer with canvas.width = canvas.width to clear pixels without reallocating memory.
3. Deferred Garbage Collection
Explanation: JavaScript's GC is non-deterministic. In tight loops, heap allocation outpaces collection, causing memory to climb until the tab crashes.
Fix: Call page.cleanup() after every page. Nullify large references immediately. Use setTimeout or requestIdleCallback between chunks to yield to the GC.
4. Unbounded Concurrency
Explanation: Launching all page tasks simultaneously (Promise.all) creates a memory spike proportional to document length. A 100-page PDF can easily exceed 3GB RAM.
Fix: Implement a concurrency limiter (semaphore or queue) that caps active tasks at 2–3. Process pages sequentially within the worker or use a controlled batch size.
5. Non-Transferable Blob Messaging
Explanation: Sending Blob or ArrayBuffer data via postMessage without the transferable list clones the data, doubling memory usage during cross-thread communication.
Fix: Always pass transferable objects as the second argument: self.postMessage({ blob }, [blob]). This moves ownership instead of copying.
6. Parser Vulnerability Exposure
Explanation: PDFs are complex binary formats. Maliciously crafted documents can trigger buffer overflows or infinite loops in parsing engines, especially when running in untrusted environments.
Fix: Never process unvalidated PDFs in the main thread. Implement input size limits, validate MIME types, and consider sandboxing worker execution with crossOriginIsolation headers if available.
Explanation: Developers often misread heap snapshots, assuming rising memory indicates a leak when it's actually normal GC behavior or temporary buffer allocation.
Fix: Take heap snapshots before and after processing. Look for detached DOM trees or lingering pdf.js internal caches. Use the "Allocation instrumentation on timeline" to track object lifecycles precisely.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| < 20 pages, internal tool | Main thread with requestIdleCallback chunks | Simpler implementation, acceptable latency | Low dev overhead |
| 20–100 pages, customer-facing | Worker + OffscreenCanvas + concurrency queue | Prevents UI freezes, maintains 60fps | Moderate dev overhead, high UX gain |
| > 100 pages, enterprise SaaS | Worker + chunked processing + server fallback | Guarantees stability, avoids mobile crashes | Higher infrastructure cost for fallback |
| Untrusted PDF uploads | Sandboxed worker + strict validation + size limits | Mitigates parser exploits and DoS vectors | Security compliance overhead |
Configuration Template
// worker-config.ts
export const WORKER_CONFIG = {
maxConcurrentPages: 3,
defaultScale: 1.5,
outputFormat: 'image/png' as const,
transferableEnabled: true,
memoryWatermarkMB: 200, // Trigger GC hint if heap exceeds this
timeoutMs: 30000
};
// main-thread-consumer.ts
import { DocumentRasterizer } from './document-processor';
const rasterizer = new DocumentRasterizer();
export async function processDocument(file: File): Promise<Blob[]> {
if (file.size > 50 * 1024 * 1024) {
throw new Error('File exceeds 50MB safety limit');
}
const buffer = await file.arrayBuffer();
const pdf = await pdfjs.getDocument({ data: buffer }).promise;
try {
return await rasterizer.extractPages(buffer, pdf.numPages, WORKER_CONFIG.defaultScale);
} finally {
pdf.destroy();
}
}
Quick Start Guide
- Initialize Worker: Create a new
Worker instance pointing to your rasterization module. Ensure your bundler supports new URL(..., import.meta.url) for dynamic worker imports.
- Load PDF Buffer: Fetch or read the file as an
ArrayBuffer. Validate size and MIME type before passing to the worker.
- Enqueue Pages: Send a
enqueue-pages message containing the buffer, total page count, and desired scale. The worker will automatically throttle execution.
- Collect Results: Listen for
page-ready messages. Map results to an array using the pageIndex to maintain order. Resolve when all pages complete.
- Cleanup: Call
pdf.destroy() on the main thread after extraction. Terminate the worker when the feature is no longer needed to release OS threads.
Client-side document processing is no longer a novelty; it's a production requirement. The difference between a crashing tab and a seamless experience lies in respecting browser execution limits, isolating CPU-bound work, and managing memory deterministically. Implement these patterns early, and your rasterization pipeline will scale gracefully across devices.