Implementation Strategy
The following TypeScript implementation demonstrates an evidence-based scraper. It defines evidence rules, handles the fetch-first fallback, and enforces quality scoring.
import { chromium, Browser, Page } from 'playwright';
// Evidence rules define what constitutes "ready" for a specific target
export interface EvidenceRule {
id: string;
type: 'selector' | 'jsonld' | 'text-length' | 'api-response';
target: string;
threshold?: number; // For text-length or api-response checks
}
export interface ScrapeConfig {
url: string;
evidenceRules: EvidenceRule[];
maxRetries?: number;
}
export interface ScrapeResult {
success: boolean;
content: string;
evidenceHit: boolean;
qualityScore: number;
source: 'raw-html' | 'rendered';
}
class EvidenceScraper {
private browser: Browser | null = null;
async init() {
this.browser = await chromium.launch({ headless: true });
}
async scrape(config: ScrapeConfig): Promise<ScrapeResult> {
// Step 1: Fetch raw HTML
const rawResponse = await fetch(config.url);
const rawHtml = await rawResponse.text();
// Step 2: Classify and check for evidence in raw HTML
const rawEvidence = this.checkEvidence(rawHtml, config.evidenceRules);
if (rawEvidence.hit && this.isResponseValid(rawResponse)) {
return {
success: true,
content: this.extractContent(rawHtml),
evidenceHit: true,
qualityScore: this.scoreQuality(rawHtml),
source: 'raw-html'
};
}
// Step 3: Fallback to rendering if evidence missing and not blocked
if (!this.isResponseValid(rawResponse)) {
throw new Error(`Blocked or invalid response: ${rawResponse.status}`);
}
if (!this.browser) await this.init();
const page = await this.browser.newPage();
try {
await page.goto(config.url, { waitUntil: 'domcontentloaded' });
// Step 4: Wait for evidence deterministically
const renderResult = await this.waitForEvidence(page, config.evidenceRules);
if (!renderResult.hit) {
return {
success: false,
content: '',
evidenceHit: false,
qualityScore: 0,
source: 'rendered'
};
}
const renderedHtml = await page.content();
return {
success: true,
content: this.extractContent(renderedHtml),
evidenceHit: true,
qualityScore: this.scoreQuality(renderedHtml),
source: 'rendered'
};
} finally {
await page.close();
}
}
private async waitForEvidence(page: Page, rules: EvidenceRule[]): Promise<{ hit: boolean }> {
// Wait for any rule to satisfy, or timeout
const promises = rules.map(rule => this.evaluateRule(page, rule));
const results = await Promise.allSettled(promises);
const hit = results.some(r => r.status === 'fulfilled' && r.value);
return { hit };
}
private async evaluateRule(page: Page, rule: EvidenceRule): Promise<boolean> {
switch (rule.type) {
case 'selector':
const element = await page.$(rule.target);
return element !== null;
case 'jsonld':
const jsonld = await page.evaluate(() => {
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
return Array.from(scripts).map(s => s.textContent).join('');
});
return jsonld.includes(rule.target);
case 'text-length':
const text = await page.textContent(rule.target);
return (text?.length ?? 0) >= (rule.threshold || 0);
default:
return false;
}
}
private checkEvidence(html: string, rules: EvidenceRule[]): { hit: boolean } {
// Simplified check for raw HTML evidence
const hit = rules.some(rule => {
if (rule.type === 'jsonld') return html.includes(rule.target);
if (rule.type === 'selector') {
// Basic regex check for selector presence in raw HTML
const selector = rule.target.replace('#', '').replace('.', '');
return html.includes(selector);
}
return false;
});
return { hit };
}
private isResponseValid(response: Response): boolean {
return response.ok && !response.headers.get('content-type')?.includes('html');
// Note: In production, check for anti-bot blocks, CAPTCHAs, or empty shells
}
private extractContent(html: string): string {
// Implementation for HTML-to-Markdown or structured extraction
return html; // Placeholder
}
private scoreQuality(html: string): number {
// Heuristic scoring: penalize high nav-to-content ratio, cookie banners, skeleton text
const navRatio = (html.match(/<nav/g) || []).length / html.length;
const hasSkeleton = html.includes('skeleton') || html.includes('loading');
let score = 100;
if (navRatio > 0.1) score -= 30;
if (hasSkeleton) score -= 50;
return Math.max(0, score);
}
}
Architecture Decisions
domcontentloaded over networkidle: When rendering is required, the initial navigation uses domcontentloaded. This minimizes the time to the first paint and allows the evidence waiter to start immediately, rather than waiting for background noise to settle.
- Rule-Based Evidence: Evidence is defined via
EvidenceRule objects. This allows the scraper to be data-agnostic. The same pipeline can scrape articles, products, or docs by swapping rules.
- Quality Scoring: Extraction is not considered successful until the output passes a quality threshold. This prevents returning navigation menus or cookie consent text as valid content, a common failure mode in naive scrapers.
- Separation of Concerns: Anti-bot detection, rendering decisions, and content extraction are distinct steps. This allows independent optimization and debugging.
Pitfall Guide
-
The Skeleton Loader Trap
- Explanation: Waiting for a selector like
.product-card often succeeds immediately because the skeleton loader shares the same class. The scraper extracts placeholder text or empty containers.
- Fix: Wait for specific attributes indicating data presence, such as
[data-loaded="true"], or verify text length within the selector. Avoid generic container selectors.
-
Generic Selector Fallacy
- Explanation: Using
#root or main as evidence. These elements exist in the initial HTML shell of almost all SPAs, long before hydration or data fetching completes.
- Fix: Target data-specific elements. For a product, wait for
.price-value or [itemprop="price"]. For articles, wait for the headline or author metadata.
-
Rendering as Default
- Explanation: Instantiating a browser for every request ignores the fact that many pages render critical data server-side or embed JSON-LD in the initial HTML. This inflates costs and latency.
- Fix: Implement the fetch-first pattern. Parse raw HTML for JSON-LD or server-rendered content. Only render if evidence is definitively absent.
-
Ignoring Post-Extraction Noise
- Explanation: The browser renders the page, the selector matches, but the extracted text includes navigation links, footer disclaimers, or cookie banner text. This corrupts downstream LLM contexts or databases.
- Fix: Apply a quality scorer that penalizes high navigation density, detects common cookie banner patterns, and validates content-to-noise ratios before returning success.
-
Conflating Blocks with Rendering Needs
- Explanation: Treating a blocked response (e.g., Cloudflare challenge) as a signal that JavaScript rendering is needed. Rendering a blocked page yields the same block page, wasting resources.
- Fix: Classify responses explicitly. Distinguish between
empty-shell, blocked, and client-side-data. Handle blocks via proxy rotation or anti-bot services, not rendering.
-
The Infinite Scroll Mirage
- Explanation: Assuming all content is available after the initial evidence trigger. Pages with infinite scroll or lazy-loaded reviews may satisfy the initial rule but lack complete data.
- Fix: Define evidence rules for completeness. For example, wait for the "Load More" button to disappear or verify the count of review items matches the expected total.
-
Static Evidence Rules for Dynamic Sites
- Explanation: Hardcoding selectors that change frequently due to A/B testing or framework updates.
- Fix: Use robust selectors based on semantic attributes (
itemprop, data-testid) where possible. Implement rule versioning and monitoring to detect selector drift.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Static HTML / SSR | Fetch + JSON-LD Parse | Data available in initial response; no JS execution needed. | Lowest (Network only) |
| SPA with JSON-LD | Fetch + JSON-LD Parse | Structured data embedded in HTML; reliable and fast. | Low (Network only) |
| SPA with Dynamic DOM | Browser + Evidence Wait | Data requires JS execution; evidence ensures correctness. | Medium (Browser + Evidence) |
| Blocked / CAPTCHA | Anti-Bot Proxy / Retry | Rendering blocked pages is futile; requires identity management. | High (Proxy costs) |
| Infinite Scroll | Browser + Scroll Evidence | Requires interaction to load data; evidence must verify completeness. | High (Browser + Interaction) |
Configuration Template
Use this JSON structure to define evidence rules per target type. This allows dynamic configuration without code changes.
{
"targets": {
"article": {
"evidenceRules": [
{
"id": "headline",
"type": "selector",
"target": "h1.article-title"
},
{
"id": "body-length",
"type": "text-length",
"target": "div.article-body",
"threshold": 500
},
{
"id": "schema",
"type": "jsonld",
"target": "Article"
}
],
"qualityThreshold": 70,
"fallbackToRender": true
},
"product": {
"evidenceRules": [
{
"id": "price",
"type": "selector",
"target": "span.price[data-currency]"
},
{
"id": "availability",
"type": "selector",
"target": "div.availability.in-stock"
},
{
"id": "schema",
"type": "jsonld",
"target": "Product"
}
],
"qualityThreshold": 80,
"fallbackToRender": true
}
}
}
Quick Start Guide
- Install Dependencies:
npm install playwright
- Define Evidence Rules: Create a configuration file specifying selectors, JSON-LD types, and text thresholds for your target data.
- Initialize Scraper: Instantiate the
EvidenceScraper, load your rules, and call scrape() with the target URL.
- Validate Output: Check
qualityScore and evidenceHit in the result. Discard results below the quality threshold.
- Monitor: Log evidence hit rates and quality scores to detect anomalies or site structure changes. Adjust rules as needed.