ting the decision logic from the execution layer. The following implementation demonstrates a TypeScript-based routing system that selects the appropriate runtime based on workload characteristics, followed by isolated execution modules for each strategy.
Start with a strict interface that normalizes output across all runtimes. This prevents downstream consumers from breaking when switching strategies.
export interface ExtractionRequest {
targetUrl: string;
strategy: 'http' | 'headless' | 'extension';
selectors: string[];
authMethod?: 'cookie' | 'session' | 'none';
}
export interface ExtractionResult {
rows: string[][];
metadata: {
strategy: string;
latencyMs: number;
status: 'success' | 'partial' | 'failed';
};
}
Step 2: Implement the HTTP Client Module
Use native fetch for static endpoints. This module prioritizes speed and resource efficiency. It bypasses DOM parsing entirely by targeting JSON endpoints or server-rendered HTML.
import { parse } from 'node-html-parser';
export async function extractViaHttp(req: ExtractionRequest): Promise<ExtractionResult> {
const start = performance.now();
const response = await fetch(req.targetUrl, {
headers: { 'User-Agent': 'DataPipeline/1.0', Accept: 'text/html' }
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const html = await response.text();
const root = parse(html);
const table = root.querySelector('table.data-grid');
if (!table) return { rows: [], metadata: { strategy: 'http', latencyMs: performance.now() - start, status: 'failed' } };
const extracted = Array.from(table.querySelectorAll('tr')).map(row =>
Array.from(row.querySelectorAll('td, th')).map(cell => cell.textContent.trim())
);
return {
rows: extracted,
metadata: { strategy: 'http', latencyMs: performance.now() - start, status: 'success' }
};
}
Step 3: Implement the Headless Browser Module
Use Playwright for client-side rendered interfaces. This module handles JavaScript execution, dynamic waits, and interaction flows. It includes built-in viewport randomization to reduce fingerprinting.
import { chromium, Browser, BrowserContext, Page } from 'playwright';
export async function extractViaHeadless(req: ExtractionRequest): Promise<ExtractionResult> {
const start = performance.now();
let browser: Browser | null = null;
try {
browser = await chromium.launch({ headless: true });
const context: BrowserContext = await browser.newContext({
viewport: { width: 1280 + Math.floor(Math.random() * 200), height: 720 + Math.floor(Math.random() * 100) },
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
});
const page: Page = await context.newPage();
await page.goto(req.targetUrl, { waitUntil: 'networkidle' });
await page.waitForSelector(req.selectors[0], { timeout: 15000 });
const data = await page.evaluate((sel) => {
const table = document.querySelector(sel);
if (!table) return [];
return Array.from(table.querySelectorAll('tr')).map(row =>
Array.from(row.querySelectorAll('td, th')).map(c => c.textContent?.trim() || '')
);
}, req.selectors[0]);
return {
rows: data,
metadata: { strategy: 'headless', latencyMs: performance.now() - start, status: 'success' }
};
} finally {
await browser?.close();
}
}
Step 4: Implement the Extension Content Script
Browser extensions run inside the active tab, inheriting cookies, local storage, and rendered DOM state. This module listens for messages from the popup or background script and returns parsed data directly.
// content-script.ts
interface ExtractionMessage {
type: 'REQUEST_TABLE_DATA';
payload: { selector: string };
}
chrome.runtime.onMessage.addListener((message: ExtractionMessage, _sender, sendResponse) => {
if (message.type === 'REQUEST_TABLE_DATA') {
const target = document.querySelector(message.payload.selector);
if (!target) {
sendResponse({ error: 'Selector not found' });
return;
}
const parsed = Array.from(target.querySelectorAll('tr')).map(row =>
Array.from(row.querySelectorAll('td, th')).map(cell => cell.textContent?.trim() || '')
);
sendResponse({ data: parsed, timestamp: Date.now() });
}
return true; // Keep message channel open for async response
});
Architecture Rationale
- Strategy Isolation: Each runtime lives in its own module. This prevents dependency bloat and allows independent testing.
- Type Safety: TypeScript enforces consistent input/output shapes, making it trivial to swap strategies without breaking downstream consumers.
- Resource Boundaries: Headless instances are explicitly closed in
finally blocks to prevent memory leaks. HTTP requests use native fetch to avoid heavy parsing libraries. Extensions leverage the existing browser process, eliminating infrastructure costs.
- Why This Matters: Production pipelines fail when strategies are mixed haphazardly. Separating execution contexts ensures that scaling one workload (e.g., batch HTTP requests) doesn't degrade another (e.g., interactive headless sessions).
Pitfall Guide
1. Selector Fragility & DOM Volatility
Explanation: Relying on auto-generated class names or positional selectors (tr:nth-child(3)) causes pipelines to break on minor UI updates.
Fix: Target semantic attributes (data-testid, aria-label, role), implement fallback chains, or use structural heuristics (e.g., "first table with >5 columns"). Log selector misses to trigger manual review.
2. Session Decay & Cookie Rotation
Explanation: Authentication tokens expire, and session cookies rotate. Hardcoding cookies into HTTP clients leads to silent 401/403 failures.
Fix: Implement a token refresh loop. For headless browsers, use context.storageState() to persist and reload sessions. For HTTP clients, bridge extension cookies via chrome.cookies.getAll() and serialize them into fetch headers.
3. Ignoring Rate Limits & Behavioral Fingerprinting
Explanation: Sending requests at machine speed triggers WAFs and CAPTCHAs. Default headless fingerprints (navigator.webdriver, missing WebGL, fixed viewport) are easily flagged.
Fix: Inject randomized delays (1.5β4.0s), rotate user agents, randomize viewport dimensions, and use playwright-extra with stealth plugins. Monitor response codes; implement exponential backoff on 429/403.
4. Over-Engineering Ad-Hoc Tasks
Explanation: Building a full Playwright pipeline for a one-time export wastes engineering time and introduces maintenance overhead.
Fix: Apply the "10-minute rule". If manual export or extension extraction takes less time than writing, testing, and deploying the automation, skip the pipeline. Document the manual process instead.
5. Memory Leaks in Headless Batches
Explanation: Creating pages without closing them, or failing to clear caches, causes Node.js processes to OOM after hundreds of iterations.
Fix: Strictly manage page lifecycle: page.close() after extraction, context.clearCookies() between sessions, and run batches inside isolated Docker containers with memory limits (--memory=512m).
6. Assuming Client-Side Data Equals Server-Side
Explanation: Parsing the DOM when the actual data originates from an XHR/fetch call adds unnecessary latency and complexity.
Fix: Always inspect the Network tab first. If a JSON endpoint exists, route through the HTTP client. DOM parsing should be a fallback, not the primary strategy.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| One-time export from authenticated dashboard | Browser Extension | Zero auth overhead, instant DOM access, no infrastructure | $0 infra, ~5 min dev |
| High-frequency public data (1k+ pages/day) | HTTP Client + Queue | Lowest resource footprint, highest concurrency, predictable scaling | Low compute, moderate dev |
| Interactive SPA with dynamic pagination | Playwright/Puppeteer | Full JS execution, click/scroll simulation, deterministic waits | High compute, moderate dev |
| API endpoint available | Direct API Integration | Structured JSON, official rate limits, versioned contracts | Lowest dev, predictable cost |
| Complex table with merged cells & custom headers | Extension + CSV Export | Handles DOM quirks natively, zero parsing logic required | $0 infra, ~2 min dev |
Configuration Template
// extractor.config.ts
export const ExtractionConfig = {
strategies: {
http: {
enabled: true,
maxConcurrency: 50,
timeoutMs: 10000,
retryAttempts: 2,
headers: {
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9'
}
},
headless: {
enabled: true,
maxConcurrency: 10,
timeoutMs: 30000,
viewport: { width: 1280, height: 720 },
stealth: true,
waitUntil: 'networkidle'
},
extension: {
enabled: true,
trigger: 'manual',
messageTypes: ['REQUEST_TABLE_DATA'],
storageKey: 'extraction_session'
}
},
fallback: {
enabled: true,
order: ['http', 'headless', 'extension'],
maxRetries: 1
},
monitoring: {
logLevel: 'info',
alertOnFailure: true,
metricsEndpoint: '/api/v1/extraction/metrics'
}
};
Quick Start Guide
- Install Dependencies: Run
npm install playwright node-html-parser typescript @types/node. Initialize TypeScript with npx tsc --init.
- Create the Router: Copy the
ExtractionRequest and ExtractionResult interfaces into types.ts. Implement the strategy switcher that routes to extractViaHttp, extractViaHeadless, or extension messaging based on req.strategy.
- Test with a Static Page: Point the HTTP module to a server-rendered table. Verify parsing accuracy and measure latency.
- Scale to Dynamic Content: Switch strategy to
headless. Add randomized delays and viewport rotation. Run a batch of 10 URLs and monitor memory usage.
- Deploy with Monitoring: Wrap the runner in a Docker container. Add health checks, log extraction metrics, and configure alerts for selector failures or HTTP 4xx/5xx spikes.