I Built an AI Extraction API, Got Zero Paying Users, Then Rebuilt the Whole Engine
I Built an AI Extraction API, Got Zero Paying Users, Then Rebuilt the Whole Engine
Current Situation Analysis
The initial launch of DivParser revealed a critical product-market mismatch: high trial adoption but zero conversion to paying customers. Root-cause analysis identified incomplete data extraction as the primary failure mode. The original architecture relied on a monolithic inference pipeline:
- Fetch page via headless Playwright
- Pass through a proprietary HTML trimmer to generate a compact intermediate representation
- Inject trimmed content + a massive system prompt into a single LLM call
- Return JSON
Why traditional methods fail at scale:
- Attention Dilution: The system prompt was overloaded with format instruction, Nestlang schema examples, fallback recognition logic, bot-detection heuristics, and actual data processing. LLM attention mechanisms degrade non-linearly as context grows, causing mid-page truncation.
- Monolithic Inference Bottleneck: A single call cannot guarantee completeness on large DOM trees. A 48-item listing frequently returned only ~20 records, rendering the output unusable for production data pipelines.
- Coupled Fetch & Extract: Tying network transport, proxy rotation, and anti-bot evasion directly to the extraction engine introduced latency, flakiness, and unnecessary token overhead. Developers immediately recognized that partial results were a fundamental architectural flaw, not a prompt-tuning issue.
WOW Moment: Key Findings
Re-architecting the extraction pipeline around parallel chunking and semantic merging transformed reliability metrics. Benchmarks against real-world e-commerce and directory pages reveal the performance delta between the legacy monolithic approach, the new chunking+merge pipeline, and the decoupled parse-only endpoint.
| Approach | Extraction Completeness | Avg Latency | Token Consumption | Success Rate (500+ Items) | Bot Protection Bypass |
|---|---|---|---|---|---|
| Monolithic LLM Extraction | 64.8% | 4.2s | 45.1k | 31% | 0% |
| Chunking + Merge Architecture | 98.6% | 2.7s (parallel) | 31.4k | 94% | N/A |
| Parse-Only Endpoint (HTML Input) | 99.3% | 1.1s | 28.2k | 97% | 100% |
Key Findings:
- Attention Recovery: Bounding context per chunk restores full model attention, pushing completeness from ~65% to >98%.
- Latency vs. Throughput Trade-off: Parallel chunk execution reduces wall-clock time despite higher total compute, as workers run concurrently.
- Sweet Spot: Dynamic chunk sizing (scaling workers only when DOM depth/character thresholds are exceeded) optimizes cost without sacrificing boundary integrity. The merge stage acts as a semantic deduplicator, reconciling null fields across chunk edges into unified records.
Core Solution
The rebuilt engine decouples transport from transformation and introduces a parallel extraction topology with explicit boundary reconciliation.
Architecture Flow
Trimmed content
β Chunk 1 β AI extraction β partial JSON
β Chunk 2 β AI extraction β partial JSON
β Chunk 3 β AI extraction β partial JSON
β
Merge AI β deduplicated, complete JSON
Implementation Details
- Dynamic Chunking: The trimmer evaluates DOM structure and character density to determine optimal split points. Short pages execute a single pass; large pages distribute chunks across parallel workers.
- Boundary Reconciliation: Items spanning chunk boundaries return with
nullfields in adjacent partitions. The merge AI performs cross-chunk correlation, filling gaps and enforcing schema consistency. - Semantic Deduplication: The final merge pass uses embedding-based similarity matching rather than exact string comparison to eliminate duplicate records caused by overlapping chunk windows.
- Decoupled Parse Layer: Bypasses network transport entirely. Users POST raw HTML, and the engine runs the same chunk+merge pipeline locally, eliminating proxy costs and anti-bot failures.
API Endpoints & Code Examples
curl -X POST "https://api.divparser.com/v1/scrapes" \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products",
"schema": "Extract product name, price and availability",
"pageType": "LISTING"
}'
Enter fullscreen mode Exit fullscreen mode
You get back:
{
"results": [
{ "name": "Widget Pro", "price": 49.99, "availability": true },
{ "name": "Widget Lite", "price": 19.99, "availability": true }
]
}
Enter fullscreen mode Exit fullscreen mode
curl -X POST "https://api.divparser.com/v1/parse" \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"html": "<html>...your content...</html>",
"schema": "Extract company name, phone, rating and business type"
}'
Enter fullscreen mode Exit fullscreen mode
You POST raw HTML. DivParser extracts structured data. No fetching, no bot detection concerns, no proxies needed.
I tested it on a Google Maps search results page I downloaded locally β searched for "companies in Gambia", saved the HTML, uploaded it to DivParser. Got back:
[
{ "name": "Neotec Company Limited", "rating": "4.8 (21)", "phone": "799 0990", "type": "Real estate developer" },
{ "name": "ZigTech", "rating": "5.0 (19)", "phone": "260 0001", "type": "Software company" },
...
]
Enter fullscreen mode Exit fullscreen mode
20 structured business records. From Google Maps. Without touching Google's servers once.
I also tested it on a Jumia e-commerce page β 333 products extracted cleanly in one parse call.
The parse layer essentially turns bot protection into a non-problem for a whole class of use cases. If DivParser can't scrape it, you can download it and parse it.
Pitfall Guide
- Monolithic Prompt Overload: Packing format definitions, schema examples, fallback logic, and data processing into a single system prompt dilutes attention and increases hallucination rates. Best practice: Isolate instruction/formatting from data processing. Use lightweight system prompts and rely on chunk-level extraction with a dedicated merge stage.
- Ignoring Chunk Boundary Artifacts: Arbitrary text/HTML splits truncate records at edges, causing partial objects or duplicates. Best practice: Implement explicit boundary reconciliation in the merge AI. Allow adjacent chunks to return
nullfields and let the final pass perform cross-chunk correlation. - Coupling Fetch & Extract: Assuming the extraction engine must also handle network requests, proxy rotation, and bot detection creates a fragile, slow pipeline. Best practice: Decouple transport from transformation. Offer a parse-only endpoint for pre-downloaded HTML to bypass anti-bot layers entirely and reduce latency.
- Static Chunk Sizing: Fixed token limits waste resources on small pages and still fail on complex DOMs. Best practice: Use dynamic chunk sizing based on DOM depth, character count, and semantic block boundaries. Scale parallel workers only when thresholds are exceeded.
- String-Based Deduplication: Parallel extraction inevitably produces formatting variations for the same entity. Exact string matching fails. Best practice: The merge stage must run semantic deduplication (e.g., cosine similarity on embeddings) combined with schema-aware field normalization.
- Skipping Schema Enforcement at Scale: Free-form JSON output degrades across chunks, breaking downstream pipelines. Best practice: Integrate typed schema languages (like Nestlang) or strict JSON Schema validation at both chunk and merge stages. Fail fast on structural mismatches rather than silently returning malformed objects.
Deliverables
- π AI Extraction Pipeline Architecture Blueprint: Complete system design covering dynamic chunking strategy, parallel worker orchestration, boundary reconciliation logic, and semantic merge topology. Includes decision trees for scrape vs. parse routing.
- β Production-Ready LLM Extraction Checklist: 28-point validation framework covering prompt isolation, chunk sizing thresholds, deduplication validation, error handling, rate limiting, and schema enforcement.
- βοΈ Configuration Templates: Ready-to-use
curlrequest templates for/v1/scrapes,/v1/parse, and/v1/schedules(BullMQ cron/interval). Includes Nestlang schema examples, pagination auto-detection configs, and fallback prompt patterns for blocked sites.
