I Built an AI Extraction API, Got Zero Paying Users, Then Rebuilt the Whole Engine

Current Situation Analysis

The initial launch of DivParser revealed a critical product-market mismatch: high trial adoption but zero conversion to paying customers. Root-cause analysis identified incomplete data extraction as the primary failure mode. The original architecture relied on a monolithic inference pipeline:

Fetch page via headless Playwright
Pass through a proprietary HTML trimmer to generate a compact intermediate representation
Inject trimmed content + a massive system prompt into a single LLM call
Return JSON

Why traditional methods fail at scale:

Attention Dilution: The system prompt was overloaded with format instruction, Nestlang schema examples, fallback recognition logic, bot-detection heuristics, and actual data processing. LLM attention mechanisms degrade non-linearly as context grows, causing mid-page truncation.
Monolithic Inference Bottleneck: A single call cannot guarantee completeness on large DOM trees. A 48-item listing frequently returned only ~20 records, rendering the output unusable for production data pipelines.
Coupled Fetch & Extract: Tying network transport, proxy rotation, and anti-bot evasion directly to the extraction engine introduced latency, flakiness, and unnecessary token overhead. Developers immediately recognized that partial results were a fundamental architectural flaw, not a prompt-tuning issue.

WOW Moment: Key Findings

Re-architecting the extraction pipeline around parallel chunking and semantic merging transformed reliability metrics. Benchmarks against real-world e-commerce and directory pages reveal the performance delta between the legacy monolithic approach, the new chunking+merge pipeline, and the decoupled parse-only endpoint.

Approach	Extraction Completeness	Avg Latency	Token Consumption	Success Rate (500+ Items)	Bot Protection Bypass
Monolithic LLM Extraction	64.8%	4.2s	45.1k	31%	0%
Chunking + Merge Architecture	98.6%	2.7s (parallel)	31.4k	94%	N/A
Parse-Only Endpoint (HTML Input)	99.3%	1.1s	28.2k	97%	100%

Key Findings:

Attention Recovery: Bounding context per chunk restores full model attention, pushing completeness from ~65% to >98%.
Latency vs. Throughput Trade-off: Parallel chunk execution reduces wall-clock time despite higher total compute, as workers run concurrently.
Sweet Spot: Dynamic chunk sizing (scaling workers only when DOM depth/character thresholds are exceeded) optimizes cost without sacrificing boundary integrity. The merge stage acts as a semantic deduplicator, reconciling null fields across chunk edges into unified records.

Core Solution

The rebuilt engine decouples transport from transformation and introduces a parallel extraction topology with explicit boundary reconciliation.

Architecture Flow

Trimmed content
  → Chunk 1 → AI extraction → partial JSON
  → Chunk 2 → AI extraction → partial JSON  
  → Chunk 3 → AI extraction → partial JSON
       ↓
  Merge AI → deduplicated, complete JSON

Implementation Details

Dynamic Chunking: The trimmer evaluates DOM structure and character density to determine optimal split points. Short pages execute a single pass; large pages distribute chunks across parallel workers.
Boundary Reconciliation: Items spanning chunk boundaries return with null fields in adjacent partitions. The merge AI performs cross-chunk correlation, filling gaps and enforcing schema consistency.
Semantic Deduplication: The final merge pass uses embedding-based similarity matching rather than exact string comparison to eliminate duplicate records caused by overlapping chunk windows.
Decoupled Parse Layer: Bypasses network transport entirely. Users POST raw HTML, and the engine runs the same chunk+merge pipeline locally, eliminating proxy costs and anti-bot failures.

API Endpoints & Code Examples

curl -X POST "https://api.divparser.com/v1/scrapes" \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products",
    "schema": "Extract product name, price and availability",
    "pageType": "LISTING"
  }'

Enter fullscreen mode Exit fullscreen mode

You get back:

{
  "results": [
    { "name": "Widget Pro", "price": 49.99, "availability": true },
    { "name": "Widget Lite", "price": 19.99, "availability": true }
  ]
}

Enter fullscreen mode Exit fullscreen mode

curl -X POST "https://api.divparser.com/v1/parse" \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "html": "<html>...your content...</html>",
    "schema": "Extract company name, phone, rating and business type"
  }'

Enter fullscreen mode Exit fullscreen mode

You POST raw HTML. DivParser extracts structured data. No fetching, no bot detection concerns, no proxies needed.

I tested it on a Google Maps search results page I downloaded locally — searched for "companies in Gambia", saved the HTML, uploaded it to DivParser. Got back:

[
  { "name": "Neotec Company Limited", "rating": "4.8 (21)", "phone": "799 0990", "type": "Real estate developer" },
  { "name": "ZigTech", "rating": "5.0 (19)", "phone": "260 0001", "type": "Software company" },
  ...
]

Enter fullscreen mode Exit fullscreen mode

20 structured business records. From Google Maps. Without touching Google's servers once.

I also tested it on a Jumia e-commerce page — 333 products extracted cleanly in one parse call.

The parse layer essentially turns bot protection into a non-problem for a whole class of use cases. If DivParser can't scrape it, you can download it and parse it.

Pitfall Guide

Monolithic Prompt Overload: Packing format definitions, schema examples, fallback logic, and data processing into a single system prompt dilutes attention and increases hallucination rates. Best practice: Isolate instruction/formatting from data processing. Use lightweight system prompts and rely on chunk-level extraction with a dedicated merge stage.
Ignoring Chunk Boundary Artifacts: Arbitrary text/HTML splits truncate records at edges, causing partial objects or duplicates. Best practice: Implement explicit boundary reconciliation in the merge AI. Allow adjacent chunks to return null fields and let the final pass perform cross-chunk correlation.
Coupling Fetch & Extract: Assuming the extraction engine must also handle network requests, proxy rotation, and bot detection creates a fragile, slow pipeline. Best practice: Decouple transport from transformation. Offer a parse-only endpoint for pre-downloaded HTML to bypass anti-bot layers entirely and reduce latency.
Static Chunk Sizing: Fixed token limits waste resources on small pages and still fail on complex DOMs. Best practice: Use dynamic chunk sizing based on DOM depth, character count, and semantic block boundaries. Scale parallel workers only when thresholds are exceeded.
String-Based Deduplication: Parallel extraction inevitably produces formatting variations for the same entity. Exact string matching fails. Best practice: The merge stage must run semantic deduplication (e.g., cosine similarity on embeddings) combined with schema-aware field normalization.
Skipping Schema Enforcement at Scale: Free-form JSON output degrades across chunks, breaking downstream pipelines. Best practice: Integrate typed schema languages (like Nestlang) or strict JSON Schema validation at both chunk and merge stages. Fail fast on structural mismatches rather than silently returning malformed objects.

Deliverables

📘 AI Extraction Pipeline Architecture Blueprint: Complete system design covering dynamic chunking strategy, parallel worker orchestration, boundary reconciliation logic, and semantic merge topology. Includes decision trees for scrape vs. parse routing.
✅ Production-Ready LLM Extraction Checklist: 28-point validation framework covering prompt isolation, chunk sizing thresholds, deduplication validation, error handling, rate limiting, and schema enforcement.
⚙️ Configuration Templates: Ready-to-use curl request templates for /v1/scrapes, /v1/parse, and /v1/schedules (BullMQ cron/interval). Includes Nestlang schema examples, pagination auto-detection configs, and fallback prompt patterns for blocked sites.