Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

By Codcompass Team·2026-05-20·8 min read

Scaling Document Intelligence Pipelines: Production Patterns for OCR and LLM Integration

Current Situation Analysis

The document AI landscape suffers from a persistent deployment gap. Academic research heavily optimizes for benchmark accuracy on curated datasets, while production engineering demands predictable throughput, bounded latency, and cost-efficient resource utilization. Teams frequently treat document processing as a linear script: ingest, run OCR, pass text to an LLM, return JSON. This approach collapses under real-world load because it ignores the fundamental resource asymmetry between vision models, language models, and orchestration logic.

The problem is routinely misunderstood because developers assume large language models are the primary bottleneck. LLMs are computationally expensive, so intuition suggests they dictate system limits. In practice, batch profiling across thousands of multi-page documents per hour reveals the opposite. Optical character recognition dominates end-to-end latency, consuming the majority of wall-clock time due to image preprocessing, layout analysis, and character segmentation. Meanwhile, system throughput saturates based on shared GPU inference capacity, not the number of orchestration workers or CPU cores provisioned.

This mismatch leads to two common failure modes. First, teams over-provision CPU workers hoping to increase concurrency, only to hit a hard ceiling when GPU VRAM and compute queues fill up. Second, they treat OCR as a lightweight preprocessing step, failing to allocate dedicated inference resources or implement batching strategies, which causes unpredictable latency spikes during peak ingestion. Closing this gap requires a deliberate architectural shift: decoupling compute domains, isolating GPU-bound inference from CPU-bound orchestration, and treating document pipelines as distributed systems rather than sequential functions.

WOW Moment: Key Findings

Batch profiling across production workloads consistently surfaces two counterintuitive findings that dictate how document AI systems must be architected. Understanding these shifts the optimization focus from prompt engineering to infrastructure design.

Approach	End-to-End Latency	Primary Bottleneck	Scaling Saturation Point
Monolithic Sequential Pipeline	High (OCR + LLM serialized)	LLM Inference	Worker Count
Decoupled Microservice Architecture	Low (Parallelized + Batched)	OCR Throughput	Shared GPU Capacity

The first finding overturns the assumption that language model parsing dictates pipeline speed. OCR engines perform heavy matrix operations on high-resolution page images, often processing dozens of pages per document. Without batching and dedicated GPU allocation, OCR becomes the latency anchor. The second finding clarifies why horizontal scaling of orchestration workers yields diminishing returns. Once GPU inference queues reach capacity, adding more CPU-bound workers only increases memory pressure and queue depth without improving throughput.

These insights enable predictable capacity planning. By isolating GPU inference, implementing async IO for orchestration, and scaling services independently, teams can achieve stable processing rates of thousands of multi-page documents per hour while maintaining sub-second queue latency and controlled GPU utilization.

Core Solution

Building a production-ready document pipeline requires separating concerns at the infrastructure level. The architecture divides into three distinct domains: hybrid classification, GPU-bound inference (OCR + LLM), and CPU-bound async orchestration. Each domain scales independently and communicates through message queues rather than direct HTTP calls.

Step 1: Hybri

d Classification Layer Document classification should never rely solely on a single model. A hybrid approach combines rule-based heuristics for known formats with lightweight ML models for ambiguous cases. This reduces unnecessary GPU calls and routes documents to the correct extraction schema.

interface ClassificationResult {
  documentType: 'invoice' | 'contract' | 'report' | 'unknown';
  confidence: number;
  routingKey: string;
}

class DocumentClassifier {
  async classify(rawBuffer: Buffer): Promise<ClassificationResult> {
    const header = rawBuffer.slice(0, 2048).toString('utf-8');
    
    // Rule-based fast path for known formats
    if (/invoice|statement|payment/i.test(header)) {
      return { documentType: 'invoice', confidence: 0.95, routingKey: 'ocr.invoice' };
    }
    if (/agreement|party|term/i.test(header)) {
      return { documentType: 'contract', confidence: 0.92, routingKey: 'ocr.contract' };
    }

    // Fallback to lightweight ML model for ambiguous cases
    const mlResult = await this.runLightweightModel(rawBuffer);
    return {
      documentType: mlResult.type,
      confidence: mlResult.score,
      routingKey: `ocr.${mlResult.type}`
    };
  }
}

Rationale: Rule-based checks execute in microseconds and handle 60-80% of enterprise documents. The ML fallback only triggers for edge cases, preserving GPU capacity for heavy inference.

Step 2: GPU-Bound Inference Isolation

OCR and LLM extraction must run in separate services with dedicated GPU pools. Sharing a single GPU instance across both workloads causes memory fragmentation and context-switching overhead.

// OCR Service (GPU-Optimized)
interface OcrRequest {
  jobId: string;
  pages: Buffer[];
  engine: 'tesseract' | 'paddleocr' | 'custom';
}

class OcrWorker {
  private gpuQueue: AsyncQueue<OcrRequest>;

  constructor() {
    this.gpuQueue = new AsyncQueue({ concurrency: 4, batchSize: 8 });
  }

  async processBatch(requests: OcrRequest[]): Promise<Record<string, string[]>> {
    const results: Record<string, string[]> = {};
    
    for (const req of requests) {
      const pageTexts = await Promise.all(
        req.pages.map(page => this.runVisionInference(page))
      );
      results[req.jobId] = pageTexts;
    }
    
    return results;
  }

  private async runVisionInference(image: Buffer): Promise<string> {
    // GPU tensor allocation, preprocessing, inference, postprocessing
    return await this.visionEngine.predict(image);
  }
}

// LLM Extraction Service (Separate GPU Pool)
interface ExtractionRequest {
  jobId: string;
  extractedText: string[];
  schema: Record<string, string>;
}

class ExtractionWorker {
  private inferenceQueue: AsyncQueue<ExtractionRequest>;

  constructor() {
    this.inferenceQueue = new AsyncQueue({ concurrency: 2, batchSize: 4 });
  }

  async extractStructuredData(requests: ExtractionRequest[]): Promise<Record<string, unknown>> {
    const outputs: Record<string, unknown> = {};
    
    for (const req of requests) {
      const prompt = this.buildSchemaPrompt(req.schema, req.extractedText);
      const rawOutput = await this.llmEngine.generate(prompt);
      outputs[req.jobId] = this.parseJsonOutput(rawOutput);
    }
    
    return outputs;
  }
}

Rationale: OCR benefits from high batch sizes and steady GPU utilization. LLM extraction is more variable in token length and requires different VRAM allocation. Separating them prevents one workload from starving the other.

Step 3: Async Orchestration & IO Decoupling

The pipeline controller must never block on inference or network calls. Async queues with backpressure handling ensure the system remains responsive under load.

class PipelineOrchestrator {
  private ocrQueue: AsyncQueue<OcrRequest>;
  private extractionQueue: AsyncQueue<ExtractionRequest>;
  private resultStore: KVStore;

  async ingestDocument(docId: string, fileBuffer: Buffer): Promise<void> {
    const classification = await new DocumentClassifier().classify(fileBuffer);
    
    // Split pages and push to OCR queue
    const pages = this.splitDocumentIntoPages(fileBuffer);
    await this.ocrQueue.push({ jobId: docId, pages, engine: 'paddleocr' });

    // Listen for OCR completion asynchronously
    this.ocrQueue.onComplete(async (ocrResults) => {
      const extractedText = ocrResults[docId];
      await this.extractionQueue.push({
        jobId: docId,
        extractedText,
        schema: this.getSchemaForType(classification.documentType)
      });
    });

    this.extractionQueue.onComplete(async (extractionResults) => {
      await this.resultStore.set(docId, extractionResults[docId]);
      await this.notifyConsumer(docId);
    });
  }
}

Rationale: Async event-driven flow prevents thread blocking. Queue depth becomes the primary scaling metric, not CPU utilization. The orchestrator remains lightweight, delegating compute to specialized workers.

Pitfall Guide

1. Treating OCR as a Lightweight Preprocessing Step

Explanation: Developers often assume OCR is fast and cheap, bundling it into the same process as orchestration. In reality, OCR performs heavy image transforms and layout analysis, consuming significant GPU memory and compute cycles. Fix: Isolate OCR into a dedicated service with batched inference. Profile page throughput separately and allocate GPU pools based on pages-per-second targets, not document count.

2. Coupling Orchestration Threads to GPU Workers

Explanation: Running inference and routing logic in the same process causes CPU threads to block while waiting for GPU kernels. This creates artificial latency and masks true GPU saturation. Fix: Decouple via message brokers (Redis Streams, RabbitMQ, or Kafka). Scale orchestration workers independently based on queue depth, not GPU metrics.

3. Ignoring Document Variance in Classification

Explanation: Relying exclusively on a single ML classifier for routing introduces unnecessary latency and cost. Many documents follow predictable structural patterns that don't require model inference. Fix: Implement hybrid classification. Use regex, header scanning, and metadata extraction for high-confidence routing. Reserve ML models for ambiguous or novel formats.

4. Over-Provisioning LLM Concurrency

Explanation: Teams often set LLM worker concurrency based on available CPU cores or container limits. LLM inference is strictly bound by VRAM and attention matrix computation, not CPU threads. Fix: Cap concurrency based on GPU memory headroom and token throughput. Use dynamic batching and implement request queuing with backpressure to prevent OOM crashes.

5. Synchronous IO in the Critical Path

Explanation: Blocking network calls for document storage, schema fetching, or result publishing stall the entire pipeline. IO latency compounds across thousands of documents. Fix: Replace synchronous calls with async streams, connection pooling, and non-blocking writes. Use write-ahead logs or batched inserts for result persistence.

6. Monolithic Scaling Policies

Explanation: Applying identical autoscaling rules to all pipeline components ignores their distinct resource profiles. OCR scales with GPU compute, LLMs scale with VRAM and token limits, orchestration scales with queue depth. Fix: Implement service-specific scaling triggers. Use GPU utilization and queue length for inference services, and CPU/memory for orchestration. Set independent min/max replica bounds.

7. Neglecting OCR Confidence Routing

Explanation: Blindly passing low-confidence OCR output to LLMs degrades extraction accuracy and wastes compute. LLMs cannot reliably reconstruct garbled or missing text. Fix: Implement confidence thresholds at the OCR layer. Route low-confidence documents to fallback engines, human review queues, or alternative preprocessing pipelines before LLM extraction.

Production Bundle

Action Checklist

Profile OCR latency separately from LLM extraction to identify true bottlenecks
Decouple GPU inference services from CPU orchestration using a message broker
Implement hybrid classification with rule-based fast paths and ML fallbacks
Configure async queues with backpressure and configurable batch sizes
Set independent autoscaling policies for OCR, LLM, and orchestration layers
Add OCR confidence thresholds and fallback routing for low-quality scans
Instrument GPU memory utilization, queue depth, and end-to-end latency metrics
Validate LLM outputs against schema constraints before persisting results

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume standardized documents (invoices, receipts)	Rule-based classification + batched OCR + lightweight LLM	Predictable structure allows fast routing and high OCR throughput	Low GPU cost, high CPU efficiency
Mixed complexity with unknown formats	Hybrid classification + dedicated OCR pool + structured LLM extraction	Handles variance while maintaining throughput through async decoupling	Moderate GPU cost, scalable with queue depth
Low-volume, high-accuracy requirements (legal, medical)	Multi-engine OCR fallback + high-context LLM + human-in-the-loop review	Prioritizes extraction fidelity over raw throughput	Higher per-document cost, lower concurrency
Budget-constrained deployments	CPU-bound OCR + quantized LLM + aggressive batching	Reduces GPU dependency while maintaining acceptable latency	Lowest infrastructure cost, higher latency

Configuration Template

# pipeline-deployment.yaml
services:
  orchestrator:
    image: doc-pipeline/orchestrator:latest
    environment:
      QUEUE_BROKER: "redis://broker:6379"
      MAX_CONCURRENCY: 32
      IO_TIMEOUT_MS: 5000
    deploy:
      replicas: 4
      resources:
        limits: { cpu: "2", memory: "4Gi" }

  ocr-worker:
    image: doc-pipeline/ocr:latest
    environment:
      GPU_POOL_SIZE: 2
      BATCH_SIZE: 16
      CONFIDENCE_THRESHOLD: 0.85
    deploy:
      replicas: 2
      resources:
        limits: { gpu: "1", memory: "8Gi" }

  extraction-worker:
    image: doc-pipeline/llm-extractor:latest
    environment:
      VRAM_LIMIT_GB: 12
      MAX_TOKENS: 4096
      BATCH_SIZE: 4
    deploy:
      replicas: 2
      resources:
        limits: { gpu: "1", memory: "16Gi" }

  broker:
    image: redis:7-alpine
    command: ["redis-server", "--maxmemory", "2gb", "--maxmemory-policy", "allkeys-lru"]

Quick Start Guide

Deploy the message broker: Spin up a Redis or RabbitMQ instance with memory limits and LRU eviction. Configure queue depth alerts at 80% capacity.
Launch GPU inference services: Deploy the OCR and LLM extraction workers with separate GPU pools. Set batch sizes and concurrency limits based on your GPU VRAM and compute targets.
Start the orchestrator: Run the async pipeline controller with queue consumers. Verify it routes documents through classification, pushes to OCR, and chains to extraction without blocking.
Ingest a test batch: Upload 50-100 multi-page documents. Monitor queue depth, GPU utilization, and end-to-end latency. Adjust batch sizes and concurrency if queues back up.
Enable observability: Instrument metrics for OCR confidence scores, LLM token throughput, queue processing rates, and error routing. Set alerts for GPU saturation and queue overflow.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back