ssion persistence.
Implementation Patterns
The following TypeScript examples illustrate the implementation differences. Note that while the source material references Python, TypeScript is used here to demonstrate modern type-safe automation patterns common in production engineering.
Pattern A: Lightweight Extraction (Static Targets)
This pattern is suitable for static job boards or documentation sites. It relies on direct HTTP requests and DOM parsing.
import axios from 'axios';
import * as cheerio from 'cheerio';
interface LeadRecord {
title: string;
company: string;
url: string;
}
export async function extractStaticLeads(targetUrl: string): Promise<LeadRecord[]> {
try {
// Lightweight request; fails if target requires JS execution
const response = await axios.get(targetUrl, {
timeout: 10000,
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; LeadBot/1.0)',
},
});
const $ = cheerio.load(response.data);
const leads: LeadRecord[] = [];
// Selector assumes static HTML structure
$('.job-listing').each((_index, element) => {
leads.push({
title: $(element).find('h2').text().trim(),
company: $(element).find('.company-name').text().trim(),
url: $(element).find('a').attr('href') || '',
});
});
return leads;
} catch (error) {
// Blocks or empty responses likely here for dynamic targets
console.error('Extraction failed:', error);
return [];
}
}
Pattern B: Browser-Native Automation (Dynamic Targets)
For targets requiring JavaScript execution, Playwright provides a robust automation framework. This example demonstrates a configuration that attempts to mitigate detection via stealth plugins and proxy rotation, though it remains subject to structural limitations on high-security targets.
import { chromium, Browser, BrowserContext, Page } from 'playwright';
import stealthPlugin from 'playwright-extra-plugin-stealth';
import { ProxyConfiguration } from './proxy-config';
// Apply stealth plugin to mask automation signals
const useStealth = require('playwright-extra-plugin-stealth');
export class DynamicLeadExtractor {
private browser: Browser | null = null;
private context: BrowserContext | null = null;
async initialize(proxyConfig: ProxyConfiguration): Promise<void> {
this.browser = await chromium.launch({
headless: true,
args: ['--disable-blink-features=AutomationControlled'],
});
this.context = await this.browser.newContext({
proxy: {
server: proxyConfig.server,
username: proxyConfig.username,
password: proxyConfig.password,
},
viewport: { width: 1920, height: 1080 },
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
});
// Register stealth plugin to patch navigator.webdriver and other leaks
await useStealth(this.context);
}
async extractLeads(url: string): Promise<any[]> {
if (!this.context) throw new Error('Extractor not initialized');
const page = await this.context.newPage();
try {
// Navigate and wait for network idle to ensure XHR calls complete
await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
// Wait for dynamic content injection
await page.waitForSelector('.lead-card-container', { state: 'visible' });
// Extract data after JS execution
const leads = await page.evaluate(() => {
const elements = document.querySelectorAll('.lead-card');
return Array.from(elements).map((el) => ({
name: el.querySelector('.name')?.textContent?.trim(),
role: el.querySelector('.role')?.textContent?.trim(),
contact: el.querySelector('.contact-info')?.getAttribute('data-email'),
}));
});
return leads;
} finally {
await page.close();
}
}
async teardown(): Promise<void> {
if (this.browser) await this.browser.close();
}
}
Pattern C: Browser-Native Context (Extension-Based)
Browser-native extensions operate within the user's active Chromium instance. They inherit the existing TLS fingerprint, cookies, session tokens, and browsing history. This eliminates the need for proxy rotation to mimic identity and bypasses headless detection entirely.
The implementation shifts from automation scripts to content scripts that interact with the live DOM.
// content-script.ts
// Runs inside the real browser context; no TLS or headless issues
interface ExtractionResult {
success: boolean;
data: any[];
error?: string;
}
export function extractCurrentPageLeads(): ExtractionResult {
try {
// Access live DOM with full session context
const leadElements = document.querySelectorAll('[data-testid="lead-item"]');
const results: any[] = [];
leadElements.forEach((el) => {
const name = el.querySelector('.lead-name')?.textContent;
const phone = el.querySelector('.lead-phone')?.getAttribute('data-phone');
if (name) {
results.push({ name, phone, timestamp: Date.now() });
}
});
return { success: true, data: results };
} catch (err) {
return { success: false, error: 'DOM access denied or structure changed' };
}
}
Rationale for Choices:
- TypeScript over Python: TypeScript provides strict typing for data schemas, reducing runtime errors during pipeline integration. It also aligns with the ecosystem of modern browser automation tools like Playwright.
- Network Idle Strategy: Dynamic targets load data asynchronously. Waiting for
networkidle or specific selectors is mandatory to capture XHR-injected content.
- Stealth Plugins: In automation patterns, stealth plugins are necessary to patch common detection vectors like
navigator.webdriver. However, they cannot fully replicate the behavioral and TLS consistency of a real browser session.
- Extension Architecture: For high-value targets, the extension approach minimizes infrastructure overhead. There is no need to manage proxy pools or browser instances; the extraction leverages the user's existing environment.
Pitfall Guide
-
The Empty Shell Trap
- Explanation: Fetching the initial HTML response and parsing it immediately. Modern SPAs return an empty
<div id="app"></div> shell, with data loaded via JavaScript 200–500ms later.
- Fix: Always wait for network idle or specific DOM selectors indicating data injection before extraction. Use
waitUntil: 'networkidle' in automation or event listeners in extensions.
-
TLS Fingerprint Mismatch
- Explanation: HTTP clients like
requests generate distinct TLS handshakes (JA3/JA4 signatures) that differ from standard browsers. Anti-bot systems flag these mismatches instantly.
- Fix: Use browser-based extraction or tools that emulate browser TLS fingerprints. Lightweight HTTP clients cannot easily spoof TLS signatures without significant complexity.
-
Headless Detection Heuristics
- Explanation: Sites check for automation artifacts such as
navigator.webdriver, missing plugins, or inconsistent canvas fingerprints. Headless browsers often leak these signals.
- Fix: Apply stealth patches and override detection vectors. However, on high-security targets like LinkedIn, headless detection remains effective. Browser-native context is the only reliable mitigation.
-
Vendor Data Staleness
- Explanation: Relying on purchased B2B lists without verifying freshness. Vendor data can be months old, with phone accuracy dropping to 61% and email accuracy to 48%.
- Fix: Treat vendor lists as starting points only. Validate critical fields via live scraping or verification services. Prioritize live extraction for time-sensitive campaigns.
-
Proxy Over-Engineering
- Explanation: Investing heavily in residential proxy rotation while ignoring TLS and behavioral signals. Proxies mask IP reputation but do not fix fingerprint mismatches or headless detection.
- Fix: Prioritize context fidelity over IP rotation. Ensure the extraction agent mimics a real browser's TLS and headers before scaling proxy infrastructure.
-
Behavioral Rate Limiting
- Explanation: Even with valid proxies and headers, automation scripts may trigger rate limits due to unnatural request timing or lack of mouse/keyboard events.
- Fix: Introduce randomized delays and simulate user interactions where possible. Browser-native extensions inherit natural user behavior, avoiding this pitfall.
-
Selector Fragility
- Explanation: Hardcoding CSS selectors that break when the target site updates its DOM structure.
- Fix: Use robust selectors based on data attributes or semantic structure. Implement fallback logic and monitoring to detect extraction failures early.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Static HTML Job Boards | Lightweight Automation (Python/TS) | No JS rendering required; low detection risk. | Low (Minimal infra) |
| LinkedIn Lead Extraction | Browser-Native Extension | High security; headless detection active; session context critical. | Medium (Extension dev/maint) |
| Google Maps Business Data | Playwright + Residential Proxies | Dynamic content; moderate detection; proxy rotation effective. | Medium-High (Proxy costs) |
| Bulk Vendor List Enrichment | Vendor API + Live Verification | High volume; freshness validation needed; scraping too slow. | High (API costs) |
| Yelp Category Scraping | Managed Cloud Actors | Balanced cost/performance; handles JS and proxies. | Medium (SaaS fees) |
Configuration Template
Playwright Automation Configuration with Stealth and Proxy Rotation
This template provides a robust setup for dynamic extraction, incorporating stealth patches and proxy management.
// playwright.config.ts
import { PlaywrightTestConfig } from '@playwright/test';
import { ProxyRotationStrategy } from './proxy-rotation';
const config: PlaywrightTestConfig = {
use: {
headless: true,
viewport: { width: 1920, height: 1080 },
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
javaScriptEnabled: true,
bypassCSP: false,
ignoreHTTPSErrors: false,
acceptDownloads: false,
proxy: ProxyRotationStrategy.getNextProxy(),
extraHTTPHeaders: {
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
},
},
timeout: 60000,
retries: 3,
expect: {
timeout: 10000,
},
projects: [
{
name: 'chromium',
use: {
browserName: 'chromium',
launchOptions: {
args: [
'--disable-blink-features=AutomationControlled',
'--disable-features=IsolateOrigins,site-per-process',
'--window-size=1920,1080',
],
},
},
},
],
};
export default config;
Proxy Rotation Strategy Stub
// proxy-rotation.ts
export class ProxyRotationStrategy {
private static proxies: string[] = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
// Add residential proxy endpoints
];
private static currentIndex = 0;
static getNextProxy(): { server: string; username?: string; password?: string } | undefined {
if (this.proxies.length === 0) return undefined;
const proxyUrl = this.proxies[this.currentIndex];
this.currentIndex = (this.currentIndex + 1) % this.proxies.length;
const url = new URL(proxyUrl);
return {
server: `${url.protocol}//${url.host}`,
username: url.username,
password: url.password,
};
}
}
Quick Start Guide
- Identify Target Architecture: Inspect the target site's network traffic. If data loads via XHR/fetch after page load, classify as dynamic. If HTML contains all data, classify as static.
- Select Extraction Tool:
- Static: Use TypeScript
axios + cheerio.
- Dynamic/Low Security: Use Playwright with stealth.
- Dynamic/High Security: Use browser-native extension or managed actor.
- Implement Context Strategy: Configure user agents, headers, and proxies. For automation, apply stealth patches. For extensions, ensure content scripts run in the active tab.
- Validate Sample Yield: Run a test extraction on 50 records. Measure block rate and data completeness. Adjust selectors or context settings if yield is low.
- Scale with Monitoring: Deploy the extraction pipeline with error handling and logging. Monitor block rates and freshness metrics. Rotate proxies or update selectors as needed.