d Classification Layer
Document classification should never rely solely on a single model. A hybrid approach combines rule-based heuristics for known formats with lightweight ML models for ambiguous cases. This reduces unnecessary GPU calls and routes documents to the correct extraction schema.
interface ClassificationResult {
documentType: 'invoice' | 'contract' | 'report' | 'unknown';
confidence: number;
routingKey: string;
}
class DocumentClassifier {
async classify(rawBuffer: Buffer): Promise<ClassificationResult> {
const header = rawBuffer.slice(0, 2048).toString('utf-8');
// Rule-based fast path for known formats
if (/invoice|statement|payment/i.test(header)) {
return { documentType: 'invoice', confidence: 0.95, routingKey: 'ocr.invoice' };
}
if (/agreement|party|term/i.test(header)) {
return { documentType: 'contract', confidence: 0.92, routingKey: 'ocr.contract' };
}
// Fallback to lightweight ML model for ambiguous cases
const mlResult = await this.runLightweightModel(rawBuffer);
return {
documentType: mlResult.type,
confidence: mlResult.score,
routingKey: `ocr.${mlResult.type}`
};
}
}
Rationale: Rule-based checks execute in microseconds and handle 60-80% of enterprise documents. The ML fallback only triggers for edge cases, preserving GPU capacity for heavy inference.
Step 2: GPU-Bound Inference Isolation
OCR and LLM extraction must run in separate services with dedicated GPU pools. Sharing a single GPU instance across both workloads causes memory fragmentation and context-switching overhead.
// OCR Service (GPU-Optimized)
interface OcrRequest {
jobId: string;
pages: Buffer[];
engine: 'tesseract' | 'paddleocr' | 'custom';
}
class OcrWorker {
private gpuQueue: AsyncQueue<OcrRequest>;
constructor() {
this.gpuQueue = new AsyncQueue({ concurrency: 4, batchSize: 8 });
}
async processBatch(requests: OcrRequest[]): Promise<Record<string, string[]>> {
const results: Record<string, string[]> = {};
for (const req of requests) {
const pageTexts = await Promise.all(
req.pages.map(page => this.runVisionInference(page))
);
results[req.jobId] = pageTexts;
}
return results;
}
private async runVisionInference(image: Buffer): Promise<string> {
// GPU tensor allocation, preprocessing, inference, postprocessing
return await this.visionEngine.predict(image);
}
}
// LLM Extraction Service (Separate GPU Pool)
interface ExtractionRequest {
jobId: string;
extractedText: string[];
schema: Record<string, string>;
}
class ExtractionWorker {
private inferenceQueue: AsyncQueue<ExtractionRequest>;
constructor() {
this.inferenceQueue = new AsyncQueue({ concurrency: 2, batchSize: 4 });
}
async extractStructuredData(requests: ExtractionRequest[]): Promise<Record<string, unknown>> {
const outputs: Record<string, unknown> = {};
for (const req of requests) {
const prompt = this.buildSchemaPrompt(req.schema, req.extractedText);
const rawOutput = await this.llmEngine.generate(prompt);
outputs[req.jobId] = this.parseJsonOutput(rawOutput);
}
return outputs;
}
}
Rationale: OCR benefits from high batch sizes and steady GPU utilization. LLM extraction is more variable in token length and requires different VRAM allocation. Separating them prevents one workload from starving the other.
Step 3: Async Orchestration & IO Decoupling
The pipeline controller must never block on inference or network calls. Async queues with backpressure handling ensure the system remains responsive under load.
class PipelineOrchestrator {
private ocrQueue: AsyncQueue<OcrRequest>;
private extractionQueue: AsyncQueue<ExtractionRequest>;
private resultStore: KVStore;
async ingestDocument(docId: string, fileBuffer: Buffer): Promise<void> {
const classification = await new DocumentClassifier().classify(fileBuffer);
// Split pages and push to OCR queue
const pages = this.splitDocumentIntoPages(fileBuffer);
await this.ocrQueue.push({ jobId: docId, pages, engine: 'paddleocr' });
// Listen for OCR completion asynchronously
this.ocrQueue.onComplete(async (ocrResults) => {
const extractedText = ocrResults[docId];
await this.extractionQueue.push({
jobId: docId,
extractedText,
schema: this.getSchemaForType(classification.documentType)
});
});
this.extractionQueue.onComplete(async (extractionResults) => {
await this.resultStore.set(docId, extractionResults[docId]);
await this.notifyConsumer(docId);
});
}
}
Rationale: Async event-driven flow prevents thread blocking. Queue depth becomes the primary scaling metric, not CPU utilization. The orchestrator remains lightweight, delegating compute to specialized workers.
Pitfall Guide
1. Treating OCR as a Lightweight Preprocessing Step
Explanation: Developers often assume OCR is fast and cheap, bundling it into the same process as orchestration. In reality, OCR performs heavy image transforms and layout analysis, consuming significant GPU memory and compute cycles.
Fix: Isolate OCR into a dedicated service with batched inference. Profile page throughput separately and allocate GPU pools based on pages-per-second targets, not document count.
2. Coupling Orchestration Threads to GPU Workers
Explanation: Running inference and routing logic in the same process causes CPU threads to block while waiting for GPU kernels. This creates artificial latency and masks true GPU saturation.
Fix: Decouple via message brokers (Redis Streams, RabbitMQ, or Kafka). Scale orchestration workers independently based on queue depth, not GPU metrics.
3. Ignoring Document Variance in Classification
Explanation: Relying exclusively on a single ML classifier for routing introduces unnecessary latency and cost. Many documents follow predictable structural patterns that don't require model inference.
Fix: Implement hybrid classification. Use regex, header scanning, and metadata extraction for high-confidence routing. Reserve ML models for ambiguous or novel formats.
4. Over-Provisioning LLM Concurrency
Explanation: Teams often set LLM worker concurrency based on available CPU cores or container limits. LLM inference is strictly bound by VRAM and attention matrix computation, not CPU threads.
Fix: Cap concurrency based on GPU memory headroom and token throughput. Use dynamic batching and implement request queuing with backpressure to prevent OOM crashes.
5. Synchronous IO in the Critical Path
Explanation: Blocking network calls for document storage, schema fetching, or result publishing stall the entire pipeline. IO latency compounds across thousands of documents.
Fix: Replace synchronous calls with async streams, connection pooling, and non-blocking writes. Use write-ahead logs or batched inserts for result persistence.
6. Monolithic Scaling Policies
Explanation: Applying identical autoscaling rules to all pipeline components ignores their distinct resource profiles. OCR scales with GPU compute, LLMs scale with VRAM and token limits, orchestration scales with queue depth.
Fix: Implement service-specific scaling triggers. Use GPU utilization and queue length for inference services, and CPU/memory for orchestration. Set independent min/max replica bounds.
7. Neglecting OCR Confidence Routing
Explanation: Blindly passing low-confidence OCR output to LLMs degrades extraction accuracy and wastes compute. LLMs cannot reliably reconstruct garbled or missing text.
Fix: Implement confidence thresholds at the OCR layer. Route low-confidence documents to fallback engines, human review queues, or alternative preprocessing pipelines before LLM extraction.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume standardized documents (invoices, receipts) | Rule-based classification + batched OCR + lightweight LLM | Predictable structure allows fast routing and high OCR throughput | Low GPU cost, high CPU efficiency |
| Mixed complexity with unknown formats | Hybrid classification + dedicated OCR pool + structured LLM extraction | Handles variance while maintaining throughput through async decoupling | Moderate GPU cost, scalable with queue depth |
| Low-volume, high-accuracy requirements (legal, medical) | Multi-engine OCR fallback + high-context LLM + human-in-the-loop review | Prioritizes extraction fidelity over raw throughput | Higher per-document cost, lower concurrency |
| Budget-constrained deployments | CPU-bound OCR + quantized LLM + aggressive batching | Reduces GPU dependency while maintaining acceptable latency | Lowest infrastructure cost, higher latency |
Configuration Template
# pipeline-deployment.yaml
services:
orchestrator:
image: doc-pipeline/orchestrator:latest
environment:
QUEUE_BROKER: "redis://broker:6379"
MAX_CONCURRENCY: 32
IO_TIMEOUT_MS: 5000
deploy:
replicas: 4
resources:
limits: { cpu: "2", memory: "4Gi" }
ocr-worker:
image: doc-pipeline/ocr:latest
environment:
GPU_POOL_SIZE: 2
BATCH_SIZE: 16
CONFIDENCE_THRESHOLD: 0.85
deploy:
replicas: 2
resources:
limits: { gpu: "1", memory: "8Gi" }
extraction-worker:
image: doc-pipeline/llm-extractor:latest
environment:
VRAM_LIMIT_GB: 12
MAX_TOKENS: 4096
BATCH_SIZE: 4
deploy:
replicas: 2
resources:
limits: { gpu: "1", memory: "16Gi" }
broker:
image: redis:7-alpine
command: ["redis-server", "--maxmemory", "2gb", "--maxmemory-policy", "allkeys-lru"]
Quick Start Guide
- Deploy the message broker: Spin up a Redis or RabbitMQ instance with memory limits and LRU eviction. Configure queue depth alerts at 80% capacity.
- Launch GPU inference services: Deploy the OCR and LLM extraction workers with separate GPU pools. Set batch sizes and concurrency limits based on your GPU VRAM and compute targets.
- Start the orchestrator: Run the async pipeline controller with queue consumers. Verify it routes documents through classification, pushes to OCR, and chains to extraction without blocking.
- Ingest a test batch: Upload 50-100 multi-page documents. Monitor queue depth, GPU utilization, and end-to-end latency. Adjust batch sizes and concurrency if queues back up.
- Enable observability: Instrument metrics for OCR confidence scores, LLM token throughput, queue processing rates, and error routing. Set alerts for GPU saturation and queue overflow.