lver
Construct a resolver that evaluates signals in order of stability. The resolver should attempt extraction at each tier, returning immediately on success. If a lower tier is reached, emit telemetry. This creates a canary system that alerts you to frontend changes before they cause silent data corruption.
Step 3: Integrate Telemetry and Circuit Breakers
Log every fallback event. Track the frequency of CSS fallback triggers. If fallback rates exceed a threshold, pause the pipeline and alert the engineering team. This prevents silent degradation and forces proactive selector updates.
Wrap the resolver in a reusable abstraction that accepts target configuration and returns normalized data. Decouple the resolution logic from business transformation rules.
Implementation Example (TypeScript)
import { Page, Locator } from 'playwright';
export type SignalTier = 'structured' | 'semantic' | 'attribute' | 'visual';
export interface ExtractionConfig {
targetName: string;
jsonLdPath?: string;
semanticRole?: string;
semanticLabel?: string;
dataAttribute?: string;
visualSelector?: string;
}
export class SignalResolver {
private readonly page: Page;
private readonly fallbackCounter: Map<SignalTier, number> = new Map();
constructor(page: Page) {
this.page = page;
}
async resolve(config: ExtractionConfig): Promise<string | null> {
// Tier 1: Structured Data
if (config.jsonLdPath) {
const structuredValue = await this.extractFromJsonLd(config.jsonLdPath);
if (structuredValue) return structuredValue;
}
// Tier 2: Semantic Identifiers
if (config.semanticRole || config.semanticLabel) {
const semanticValue = await this.extractFromSemantics(config);
if (semanticValue) return semanticValue;
}
// Tier 3: Explicit Data Attributes
if (config.dataAttribute) {
const attrValue = await this.extractFromAttribute(config.dataAttribute);
if (attrValue) return attrValue;
}
// Tier 4: Visual Selectors (Monitored Fallback)
if (config.visualSelector) {
this.recordFallback('visual');
return await this.extractFromVisual(config.visualSelector);
}
return null;
}
private async extractFromJsonLd(jsonPath: string): Promise<string | null> {
const scriptEl = this.page.locator('script[type="application/ld+json"]').first();
const raw = await scriptEl.textContent();
if (!raw) return null;
try {
const parsed = JSON.parse(raw);
const value = this.resolveJsonPath(parsed, jsonPath);
return value ? String(value) : null;
} catch {
return null;
}
}
private async extractFromSemantics(config: ExtractionConfig): Promise<string | null> {
let locator: Locator | null = null;
if (config.semanticRole) {
locator = this.page.getByRole(config.semanticRole as any);
} else if (config.semanticLabel) {
locator = this.page.getByLabel(new RegExp(config.semanticLabel, 'i'));
}
if (locator && (await locator.count()) > 0) {
return await locator.first().textContent();
}
return null;
}
private async extractFromAttribute(attrSelector: string): Promise<string | null> {
const locator = this.page.locator(`[${attrSelector}]`);
if (await locator.count() > 0) {
return await locator.first().textContent();
}
return null;
}
private async extractFromVisual(selector: string): Promise<string | null> {
const locator = this.page.locator(selector);
if (await locator.count() > 0) {
return await locator.first().textContent();
}
return null;
}
private resolveJsonPath(obj: any, path: string): any {
return path.split('.').reduce((current, key) => current?.[key], obj);
}
private recordFallback(tier: SignalTier): void {
const count = this.fallbackCounter.get(tier) || 0;
this.fallbackCounter.set(tier, count + 1);
console.warn(`[SelectorAudit] Fallback triggered for ${tier}. Total: ${count + 1}`);
}
public getFallbackMetrics(): Map<SignalTier, number> {
return new Map(this.fallbackCounter);
}
}
Architecture Rationale
The resolver enforces a strict evaluation order. Structured data is evaluated first because it exists independently of the rendering engine and is explicitly authored for machine consumption. Semantic identifiers are second because they reflect the accessibility contract, which frontend teams treat as stable infrastructure. Data attributes are third because they are explicit developer hooks, often preserved across design iterations. Visual selectors are last because they represent styling tokens, which change frequently and carry no semantic guarantee.
The telemetry layer is critical. By counting fallback events, you transform selector degradation into a measurable metric. When visual fallbacks spike, it indicates a frontend deployment has removed higher-tier signals. This gives you a 3β7 day window to update selectors before data extraction fails completely.
Pitfall Guide
1. The Index Trap
Explanation: Using :nth-child(), :nth-of-type(), or array indices to locate elements. Layout changes, dynamic content injection, or ad placements instantly invalidate positional selectors.
Fix: Anchor to semantic roles, data attributes, or structured data. If positioning is unavoidable, validate the parent container's identity first and use relative positioning only within a stable subtree.
2. The Class Chain Mirage
Explanation: Building long CSS chains like .product-list > .card > .price > span. Class names are styling tokens, not identity markers. Component libraries, CSS-in-JS, and design system updates routinely rename or restructure these chains.
Fix: Collapse chains to a single stable anchor. Use data-* attributes or accessibility roles as the primary locator. Reserve CSS classes only for visual state verification, not identification.
3. Ignoring the JSON-LD Layer
Explanation: Skipping structured data audits because the page renders visually. JSON-LD blocks are often embedded in the <head> or at the end of <body>, completely decoupled from the DOM tree.
Fix: Always parse application/ld+json scripts first. Map your extraction targets to JSON-LD properties (offers.price, name, description). This eliminates DOM traversal entirely for the majority of targets.
4. Silent Fallbacks
Explanation: Dropping to CSS selectors without logging or alerting. The scraper continues running, but data quality degrades silently. Breakages are only discovered during downstream data validation or customer complaints.
Fix: Implement explicit fallback logging with rate limiting. Trigger alerts when fallback frequency exceeds a baseline. Treat fallback events as technical debt that requires immediate resolution.
5. Over-Engineering XPath
Explanation: Writing complex XPath expressions to navigate deeply nested structures. XPath is brittle, harder to debug, and often slower than CSS or semantic queries. It encourages fragile traversal patterns.
Fix: Prefer CSS or semantic locators. Use XPath only when navigating sibling relationships or text content that cannot be accessed via standard APIs. Keep expressions shallow and anchored to stable identifiers.
6. Assuming Static DOM States
Explanation: Writing selectors against the initial HTML response without accounting for client-side rendering, lazy loading, or dynamic content injection. Selectors fail intermittently based on network latency or script execution order.
Fix: Use explicit wait strategies (waitForSelector, waitForFunction) tied to the target signal, not arbitrary timeouts. Verify that the identified element is visible and populated before extraction.
7. Neglecting Accessibility Trees
Explanation: Overlooking the accessibility tree as a stable identification source. Frontend teams invest heavily in ARIA roles and labels to meet compliance standards. These signals are rarely changed without deliberate effort.
Fix: Query getByRole(), getByLabel(), or getByText() before falling back to CSS. These methods align with how assistive technologies parse the page, making them inherently more stable than visual selectors.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| E-commerce product pages | JSON-LD + Semantic Roles | Structured data covers 90%+ of fields; roles survive layout changes | Low maintenance, high initial audit effort |
| News/Content sites | Semantic Roles + Data Attributes | Articles use consistent heading structures and author metadata | Medium maintenance, predictable updates |
| Dynamic SPAs with heavy JS | Data Attributes + Explicit Waits | Client-side rendering obscures structured data; dev hooks remain stable | Medium maintenance, requires wait strategy tuning |
| Legacy/Static HTML | Anchored CSS + Fallback Logging | Limited semantic signals; CSS chains must be minimized and monitored | High maintenance, requires frequent audits |
| High-frequency monitoring | Structured Data + Circuit Breakers | Speed and reliability critical; fallbacks trigger pipeline pauses | Low operational cost, prevents data corruption |
Configuration Template
// selector-config.ts
import { ExtractionConfig } from './SignalResolver';
export const TARGET_CONFIGS: Record<string, ExtractionConfig> = {
productPrice: {
targetName: 'price',
jsonLdPath: 'offers.price',
semanticRole: 'text',
semanticLabel: 'price',
dataAttribute: 'data-price-value',
visualSelector: '.product-price',
},
productTitle: {
targetName: 'title',
jsonLdPath: 'name',
semanticRole: 'heading',
semanticLabel: 'product name',
dataAttribute: 'data-product-title',
visualSelector: '.product-title',
},
availability: {
targetName: 'availability',
jsonLdPath: 'offers.availability',
semanticRole: 'status',
semanticLabel: 'stock status',
dataAttribute: 'data-stock-status',
visualSelector: '.stock-indicator',
},
};
// Usage in pipeline
const resolver = new SignalResolver(page);
const price = await resolver.resolve(TARGET_CONFIGS.productPrice);
const metrics = resolver.getFallbackMetrics();
if (metrics.get('visual') > 5) {
throw new Error('Selector degradation detected: visual fallback threshold exceeded');
}
Quick Start Guide
- Install Dependencies: Add
playwright and typescript to your project. Initialize a new scraper module.
- Map Signals: Open the target page in DevTools. Check the Accessibility tab, search for
application/ld+json, and inspect elements for data-* attributes. Document findings.
- Deploy Resolver: Copy the
SignalResolver class and configuration template into your codebase. Replace placeholder selectors with your audited values.
- Add Telemetry: Integrate fallback logging into your monitoring stack. Set alert thresholds at 3β5 fallback events per run.
- Validate & Iterate: Run the pipeline against staging or cached pages. Verify extraction accuracy. Monitor fallback metrics for 48 hours before promoting to production.