parse` | 98% | ~1,200ms | High (signature validation + timeout guard) | Node only (explicit) |
Key Findings:
- Magic byte validation (
%PDF β 0x25 0x50 0x44 0x46) eliminates client-side MIME spoofing.
- A 25-second
Promise.race timeout prevents event loop starvation from malformed or zip-bomb PDFs.
- Explicit
export const runtime = 'nodejs' ensures deployment consistency across Vercel Fluid Compute and standard Node environments.
Core Solution
The production fix replicates the hardened extraction pattern into the demo route. It combines binary signature detection, a timeout-guarded parser, and explicit runtime declaration to ensure reliability and security.
const buffer = Buffer.from(await file.arrayBuffer());
if (file.type === DOCX_MIME) return extractFromDocx(buffer, ...);
if (file.type === XLSX_MIME) return extractFromXlsx(buffer, ...);
const text = buffer.toString('utf-8', 0, Math.min(buffer.length, 100000));
return extractFromText(text, ...);
import pdfParse from 'pdf-parse';
export const runtime = 'nodejs';
export const maxDuration = 30;
// Magic bytes detection (defends against MIME spoofing)
const sig = buffer.subarray(0, 4);
const isPDFBytes = sig[0] === 0x25 && sig[1] === 0x50
&& sig[2] === 0x44 && sig[3] === 0x46; // %PDF
if (file.type === 'application/pdf' || file.name.endsWith('.pdf') || isPDFBytes) {
try {
const pdfData = await Promise.race([
pdfParse(buffer),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('PDF_PARSE_TIMEOUT')), 25_000)
),
]);
const text = pdfData.text || '';
if (text.length < 10) {
return NextResponse.json(
{ error: 'Could not extract text from PDF.', code: 'EXTRACTION_FAILED' },
{ status: 422 }
);
}
return NextResponse.json(extractFromText(text, ...));
} catch (err) {
if (err instanceof Error && err.message === 'PDF_PARSE_TIMEOUT') {
return NextResponse.json(
{ error: 'PDF processing timed out.', code: 'PDF_TIMEOUT' },
{ status: 504 }
);
}
return NextResponse.json(
{ error: 'PDF parsing failed.', code: 'PDF_PARSE_ERROR' },
{ status: 422 }
);
}
}
Architecture Decisions:
Promise.race with timeout: pdfParse executes synchronously within an async wrapper. A malformed or zip-bomb PDF can block the Node event loop indefinitely. The 25s ceiling protects function execution and prevents cold-start degradation.
- Magic bytes check: Validates the first 4 bytes (
%PDF) to defend against spoofed Content-Type headers or renamed files.
- Explicit
runtime = 'nodejs': pdf-parse relies on Node Buffer APIs unavailable in Edge runtimes. Declaring the runtime makes the deployment contract explicit and prevents silent Vercel/Next.js routing mismatches.
Pitfall Guide
- Trusting Client-Declared MIME Types:
file.type originates from the browser and can be trivially spoofed. Always validate against magic bytes (0x25 0x50 0x44 0x46 for PDF) before routing to format-specific parsers.
- UTF-8 Decoding Binary Streams: PDFs contain compressed binary streams (FlateDecode, ASCIIHex, etc.). Forcing
buffer.toString('utf-8') corrupts the payload. Use dedicated parsers (pdf-parse, pdf2json, or WASM-based alternatives) that understand PDF object hierarchies.
- Missing Timeout Guards for CPU-Heavy Parsers: Synchronous parsing inside async wrappers can starve the event loop. Wrap heavy operations in
Promise.race with a strict timeout to guarantee predictable latency and prevent function hangs.
- Ignoring Runtime Environment Constraints: Edge runtimes lack Node.js
Buffer, fs, and native C++ bindings. Explicitly declare export const runtime = 'nodejs' in Next.js API routes to avoid deployment-time silent failures or unexpected bundling behavior.
- Testing with Synthetic/Sample Data Only: Playground samples often follow idealized structures. Real customer data contains scanned pages, embedded fonts, and malformed headers. Always validate endpoints with cold, production-like payloads before shipping.
- Lack of Structured Error Degradation: Returning unhandled exceptions or generic 500 errors obscures root causes. Map parser failures to explicit HTTP status codes (
422 for extraction failure, 504 for timeout) and standardized error codes for client-side retry logic.
Deliverables
- π PDF-to-JSON Extraction Blueprint: Architecture diagram covering file upload validation, magic byte routing, timeout-guarded parsing, and structured JSON transformation pipelines.
- β
File Upload Validation & Parsing Checklist: Step-by-step verification matrix for MIME spoofing prevention, binary signature checks, runtime compatibility, and timeout configuration.
- βοΈ Next.js API Route Runtime & Timeout Template: Pre-configured
export const runtime and maxDuration declarations with Promise.race wrapper patterns ready for production deployment.