----|
| Cloud API (AWS Textract) | $150.00 | 1.2s | N/A (Server-side) | Low (Data leaves origin) |
| Server-side Tesseract (Node.js) | $0 (Infra only) | 0.8s | 250MB | Medium (Controlled infra) |
| Browser Tesseract.js (WASM) | $0 | 2.1s | 120MB | High (Zero egress) |
Key Findings:
- Tesseract.js achieves parity with server-side accuracy (Β±2% character error rate) while eliminating per-page billing.
- Latency increases by ~150% compared to server-side processing due to WASM initialization and client CPU constraints, but remains acceptable for interactive UX when paired with progress streaming.
- Memory usage stays well within browser limits when processing is chunked and Web Workers are utilized.
Core Solution
Browser-based OCR requires a three-stage pipeline: PDF Rasterization β Image Preprocessing β WASM OCR Execution. Tesseract.js does not natively parse PDFs; pdf.js must first convert pages to canvas elements or base64 images.
Architecture Decision
- Main Thread: Handles UI,
pdf.js rendering, and progress callbacks.
- Web Worker: Isolates Tesseract.js execution to prevent UI freezing.
- Chunking Strategy: Process PDFs in batches of 3β5 pages to avoid WASM heap exhaustion.
Implementation Code
1. PDF Rasterization & Image Extraction
import * as pdfjsLib from 'pdfjs-dist';
pdfjsLib.GlobalWorkerOptions.workerSrc = `https://cdnjs.cloudflare.com/ajax/libs/pdf.js/${pdfjsLib.version}/pdf.worker.min.mjs`;
async function extractImagesFromPDF(pdfUrl) {
const loadingTask = pdfjsLib.getDocument(pdfUrl);
const pdf = await loadingTask.promise;
const images = [];
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const viewport = page.getViewport({ scale: 2.0 }); // 2x scale improves OCR accuracy
const canvas = document.createElement('canvas');
const context = canvas.getContext('2d');
canvas.height = viewport.height;
canvas.width = viewport.width;
await page.render({ canvasContext: context, viewport }).promise;
images.push(canvas.toDataURL('image/png'));
}
return images;
}
2. Tesseract.js Worker Execution
import { createWorker } from 'tesseract.js';
async function runOCR(imageData, onProgress) {
const worker = await createWorker('eng', 1, {
logger: m => onProgress(m),
workerPath: new URL('tesseract.js/dist/worker.min.js', import.meta.url),
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
corePath: new URL('tesseract.js-core/tesseract-core.wasm', import.meta.url)
});
try {
const { data: { text, confidence } } = await worker.recognize(imageData);
return { text, confidence };
} finally {
await worker.terminate(); // Critical: prevents memory leaks
}
}
3. Production Pipeline Orchestration
async function processPDF(pdfUrl, chunkSize = 3) {
const images = await extractImagesFromPDF(pdfUrl);
const results = [];
for (let i = 0; i < images.length; i += chunkSize) {
const chunk = images.slice(i, i + chunkSize);
const pageResults = await Promise.all(
chunk.map(img => runOCR(img, m => console.log(`Progress: ${m.progress}`)))
);
results.push(...pageResults);
}
return results.map(r => r.text).join('\n\n--- PAGE BREAK ---\n\n');
}
Pitfall Guide
- Main Thread Blocking: Tesseract.js is CPU-bound. Running
worker.recognize() on the main thread freezes the UI and triggers browser watchdog timeouts. Always offload to a Web Worker or use setTimeout/requestIdleCallback for chunked processing.
- PDF Rasterization DPI Mismatch:
pdf.js defaults to 72 DPI, which destroys OCR accuracy. Always render at scale: 2.0 or higher (equivalent to ~150β300 DPI) before passing to Tesseract.
- WASM Heap Exhaustion: Browser tabs enforce strict memory limits. Processing >10 pages simultaneously causes
RangeError: Maximum call stack size exceeded or silent WASM crashes. Implement chunking and explicit worker.terminate() after each batch.
- CORS & Asset Path Resolution:
tesseract-core.wasm and language data files must be served with correct MIME types (application/wasm). Relative paths fail in production bundlers. Use new URL() or explicit CDN paths for workerPath and langPath.
- Ignoring Image Preprocessing: Low-contrast scans, skew, or JPEG artifacts reduce confidence scores below 60%. Apply canvas-based binarization or sharpening filters before OCR, or configure Tesseract's
tessedit_char_whitelist and preserve_interword_spaces for noisy documents.
- Silent Language Data Failures: If
langPath is unreachable or the .traineddata file is corrupted, Tesseract falls back to English without throwing. Validate language loading with explicit error handling and fallback UI states.
- Over-relying on Default Confidence Thresholds: The default
60 threshold accepts noisy output. Production systems should filter results where confidence < 75 and route low-confidence pages to manual review or re-processing with adjusted tessedit_pageseg_mode.
Deliverables
- π Architecture Blueprint: Visual flow diagram detailing the
pdf.js β Canvas β Web Worker β Tesseract.js β Output pipeline, including memory boundaries and chunking thresholds.
- β
Pre-Flight Checklist:
- βοΈ Configuration Templates:
- Vite/Webpack WASM loader config (
?url imports, asset/resource rules)
- Tesseract.js initialization object with production-safe paths
pdf.js worker initialization snippet with version-locking strategy