the required data. Implementation follows a four-step progression: schema definition, async job dispatch, resilient polling, and agent integration.
Before writing retrieval logic, establish a strict contract. Use a validation library to enforce type safety and handle missing fields gracefully.
import { z } from "zod";
export const ProductSchema = z.object({
identifier: z.string().uuid(),
title: z.string().min(1),
price: z.number().positive(),
currency: z.enum(["USD", "EUR", "GBP"]),
stock_status: z.enum(["in_stock", "low_stock", "out_of_stock"]),
last_updated: z.string().datetime()
});
export type ProductData = z.infer<typeof ProductSchema>;
Defining the schema upfront prevents downstream parsing errors and allows the extraction service to fail fast if the source page structure changes.
Step 2: Implement Async Job Dispatch
Long-running extraction must never block the agent's reasoning cycle. Submit work asynchronously and receive a job identifier.
interface ExtractionClientConfig {
baseUrl: string;
apiKey: string;
timeoutMs: number;
}
export class DataExtractionClient {
private config: ExtractionClientConfig;
constructor(config: ExtractionClientConfig) {
this.config = config;
}
async submitExtractionJob(url: string, schemaName: string): Promise<string> {
const response = await fetch(`${this.config.baseUrl}/v1/extractions`, {
method: "POST",
headers: {
"Authorization": `Bearer ${this.config.apiKey}`,
"Content-Type": "application/json",
"X-Request-Timeout": String(this.config.timeoutMs)
},
body: JSON.stringify({
target_url: url,
output_schema: schemaName,
retry_policy: "exponential"
})
});
if (!response.ok) {
throw new Error(`Submission failed: ${response.status}`);
}
const payload = await response.json();
return payload.job_id as string;
}
}
This design isolates network I/O from agent logic. The client returns immediately, allowing the orchestrator to queue additional tasks or proceed with parallel reasoning.
Step 3: Build Resilient Polling with Backoff
Polling must handle transient failures, rate limits, and terminal states without consuming excessive resources.
export class JobPoller {
private readonly MAX_ATTEMPTS = 12;
private readonly BASE_DELAY_MS = 1000;
async resolveJob<T>(jobId: string, schema: z.ZodType<T>): Promise<T> {
let attempt = 0;
const startTime = Date.now();
while (attempt < this.MAX_ATTEMPTS) {
const response = await fetch(`https://api.example.com/v1/jobs/${jobId}`, {
headers: { "Authorization": `Bearer ${process.env.API_KEY}` }
});
if (response.status === 429) {
const retryAfter = parseInt(response.headers.get("Retry-After") || "2", 10);
await this.sleep(retryAfter * 1000);
continue;
}
if (response.status >= 500) {
const backoff = Math.min(this.BASE_DELAY_MS * Math.pow(2, attempt), 30000);
await this.sleep(backoff);
attempt++;
continue;
}
if (!response.ok) {
throw new Error(`Polling failed: ${response.status}`);
}
const data = await response.json();
const status = data.status as string;
if (status === "completed") {
return schema.parse(data.result);
}
if (status === "failed") {
throw new Error(`Extraction failed: ${data.error || "Unknown reason"}`);
}
const delay = Math.min(this.BASE_DELAY_MS * Math.pow(2, attempt), 15000);
await this.sleep(delay);
attempt++;
}
throw new TimeoutError(`Job ${jobId} exceeded maximum polling attempts`);
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
The polling logic enforces exponential backoff, respects Retry-After headers, validates against the schema on completion, and caps attempts to prevent infinite loops. This behavior belongs in application code, not in natural language prompts.
Step 4: Integrate into the Agent Workflow
Combine the client and poller into a deterministic retrieval step.
export async function fetchProductData(url: string): Promise<ProductData> {
const client = new DataExtractionClient({
baseUrl: "https://extraction.internal",
apiKey: process.env.EXTRACTION_API_KEY!,
timeoutMs: 45000
});
const poller = new JobPoller();
const jobId = await client.submitExtractionJob(url, "product_v2");
const result = await poller.resolveJob<ProductData>(jobId, ProductSchema);
return result;
}
Architecture Decisions & Rationale
- Async-first dispatch: Prevents agent timeout cascades. Synchronous HTTP calls block the event loop and force the model to wait for network conditions outside its control.
- Schema validation at ingestion: Catches structural drift immediately. Without validation, malformed data propagates into the reasoning loop, causing hallucinations or silent failures.
- Backoff with jitter: Reduces thundering herd effects during upstream outages. Fixed delays amplify load; exponential backoff with a cap stabilizes recovery.
- Separation of concerns: The extraction service handles rendering, parsing, and retries. The agent handles planning, ranking, and response generation. Mixing these responsibilities creates brittle, untestable systems.
Pitfall Guide
1. Treating networkidle as a Reliable Signal
Modern websites emit continuous telemetry, WebSocket heartbeats, and ad requests. networkidle often never triggers or fires prematurely. Relying on it causes false timeouts or incomplete data.
Fix: Use explicit DOM readiness checks, wait for specific network responses, or switch to structured extraction that bypasses rendering entirely.
2. Polling Without Rate Limit Awareness
Aggressive polling during upstream throttling amplifies load and triggers stricter rate limits. Ignoring Retry-After headers guarantees 429 escalation.
Fix: Always parse Retry-After, implement exponential backoff, and cap maximum attempts. Log rate limit events for capacity planning.
3. Mixing I/O Retries with LLM Retries
When extraction fails, retrying the entire agent step wastes tokens and compounds latency. The model should not repeat reasoning because a network request timed out.
Fix: Isolate I/O retries in a dedicated layer. Only retry the extraction job. If it fails after N attempts, surface a structured error to the agent for fallback planning.
4. Ignoring Schema Drift
Websites change layouts, rename attributes, or restructure JSON responses. Hardcoded selectors or loose parsing break silently.
Fix: Validate all extraction results against a strict schema. Implement versioned schemas and alert on validation failures. Treat schema mismatches as infrastructure incidents, not agent errors.
5. Over-Provisioning Browser Contexts
Launching parallel browsers without connection pooling or resource limits exhausts memory and CPU. The system becomes unstable before the agent even begins reasoning.
Fix: Use structured extraction for predictable data. Reserve browsers only for multi-step authentication, CAPTCHA resolution, or visual verification. Implement circuit breakers to halt browser launches when resource thresholds are breached.
6. Synchronous Blocking in Async Runtimes
Awaiting long extraction calls in a single-threaded event loop blocks other tasks, degrading throughput and increasing latency across the entire agent pipeline.
Fix: Always dispatch extraction as async jobs. Use promise.all for parallel submissions, but resolve them independently. Keep the reasoning loop non-blocking.
Without metrics on success rates, latency percentiles, and failure reasons, teams cannot distinguish between agent reasoning failures and infrastructure bottlenecks.
Fix: Instrument extraction clients with OpenTelemetry. Track job submission latency, polling duration, schema validation failures, and upstream error codes. Correlate metrics with agent step outcomes.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Public catalog data with predictable structure | Structured JSON Extraction | Bypasses rendering, scales horizontally, low latency | 60β80% reduction vs browsers |
| Multi-step OAuth or SSO login flow | Browser Automation | Requires session state, redirect handling, and user simulation | High infrastructure cost, unavoidable |
| Real-time inventory with official API | Direct Platform API | Lowest latency, highest reliability, native rate limits | Minimal compute, predictable pricing |
| Visual verification or layout-dependent data | Browser Automation + Screenshot | DOM structure is irrelevant; pixel/visual confirmation required | High memory, low concurrency |
| High-volume scraping with frequent layout changes | Structured Extraction + Schema Versioning | Decouples parsing from rendering, enables fast fallbacks | Moderate cost, high maintainability |
Configuration Template
// extraction.config.ts
export const extractionConfig = {
api: {
baseUrl: process.env.EXTRACTION_API_URL || "https://extraction.internal",
apiKey: process.env.EXTRACTION_API_KEY,
defaultTimeoutMs: 45000,
maxRetries: 3
},
polling: {
maxAttempts: 12,
baseDelayMs: 1000,
maxDelayMs: 30000,
respectRetryAfter: true
},
fallback: {
enableBrowserFallback: true,
browserTimeoutMs: 30000,
maxConcurrentBrowsers: 10
},
observability: {
enableMetrics: true,
metricPrefix: "agent.extraction",
logLevel: "info"
}
};
Quick Start Guide
- Install dependencies:
npm install zod @opentelemetry/api
- Define your schema: Create a Zod object matching the exact fields your agent requires. Validate against sample payloads before deployment.
- Initialize the client: Configure
DataExtractionClient with your API endpoint and credentials. Set timeouts that align with your agent's SLA.
- Dispatch and poll: Submit the extraction job, capture the
job_id, and run JobPoller.resolveJob() with your schema. Handle validation errors separately from network errors.
- Integrate into the agent: Replace synchronous browser calls with the async retrieval function. Add circuit breakers and metrics to monitor extraction health independently from reasoning performance.