leakage. Model routing is configured to default to cost-optimized providers (DeepSeek V4 Flash, GPT-4o mini) unless visual validation is explicitly required.
3. Execution Layer: Playwright handles state mutations. The orchestrator maps LLM output to concrete locator strategies (page.getByRole, page.locator). This keeps the AI from generating raw CSS selectors, which are notoriously brittle.
4. Feedback Loop: After each action, we capture a state diff (URL change, DOM mutation, network response) to verify progress. This prevents infinite loops and enables graceful fallbacks.
Implementation
import { Page, BrowserContext } from 'playwright';
import OpenAI from 'openai';
interface InteractionSnapshot {
url: string;
title: string;
elements: string[];
}
interface AgentAction {
type: 'click' | 'fill' | 'select' | 'wait' | 'complete';
targetIndex: number;
value?: string;
reasoning: string;
}
class StructuralTestOrchestrator {
private client: OpenAI;
private maxContextTokens = 4000;
private conversationHistory: { role: 'user' | 'assistant'; content: string }[] = [];
constructor(apiKey: string) {
this.client = new OpenAI({
apiKey,
baseURL: 'https://api.deepseek.com/v1',
});
}
private async extractSnapshot(page: Page): Promise<InteractionSnapshot> {
const data = await page.evaluate(() => {
const selectors = 'button, a, input, select, textarea, [role="button"], [tabindex]';
const nodes = Array.from(document.querySelectorAll(selectors));
return {
url: window.location.href,
title: document.title,
elements: nodes
.filter(el => el.offsetParent !== null && !el.hasAttribute('disabled'))
.map((el, idx) => {
const tag = el.tagName.toLowerCase();
const text = el.textContent?.trim().slice(0, 40) || '';
const placeholder = (el as HTMLInputElement).placeholder || '';
const name = el.getAttribute('name') || el.getAttribute('aria-label') || '';
return `[${idx}] <${tag}>${text ? ` "${text}"` : ''}${placeholder ? ` placeholder="${placeholder}"` : ''}${name ? ` name="${name}"` : ''}`;
})
};
});
return data;
}
private async decideNextStep(snapshot: InteractionSnapshot, task: string): Promise<AgentAction> {
const contextPrompt = `
Current URL: ${snapshot.url}
Page Title: ${snapshot.title}
Available Elements:
${snapshot.elements.join('\n')}
Task: ${task}
Respond with a JSON object containing: type, targetIndex, value (if applicable), reasoning.
`;
this.conversationHistory.push({ role: 'user', content: contextPrompt });
// Trim history to stay within token limits
if (this.conversationHistory.length > 6) {
this.conversationHistory = this.conversationHistory.slice(-6);
}
const response = await this.client.chat.completions.create({
model: 'deepseek-chat',
messages: this.conversationHistory,
temperature: 0.1,
response_format: { type: 'json_object' },
});
const content = response.choices[0].message.content || '{}';
const action: AgentAction = JSON.parse(content);
this.conversationHistory.push({ role: 'assistant', content: content });
return action;
}
async executeTask(page: Page, task: string, maxSteps: number = 20): Promise<void> {
for (let step = 0; step < maxSteps; step++) {
const snapshot = await this.extractSnapshot(page);
const action = await this.decideNextStep(snapshot, task);
if (action.type === 'complete') {
console.log(`β
Task completed: ${action.reasoning}`);
return;
}
const target = snapshot.elements[action.targetIndex];
if (!target) throw new Error(`Invalid target index: ${action.targetIndex}`);
switch (action.type) {
case 'click':
await page.locator(`[data-idx="${action.targetIndex}"]`).click();
break;
case 'fill':
await page.locator(`[data-idx="${action.targetIndex}"]`).fill(action.value || '');
break;
case 'select':
await page.locator(`[data-idx="${action.targetIndex}"]`).selectOption({ label: action.value });
break;
case 'wait':
await page.waitForTimeout(1000);
break;
}
await page.waitForLoadState('networkidle');
}
throw new Error('Max steps exceeded. Task may require visual validation or manual intervention.');
}
}
Why This Architecture Works
- Deterministic Mapping: By assigning sequential indices to filtered elements, we eliminate CSS selector fragility. The LLM references indices, not selectors, making the pipeline resilient to DOM restructuring.
- Token Budgeting: The
maxContextTokens guard and history slicing prevent runaway costs. LLMs don't need full conversation history to make stateless UI decisions; they only need the current snapshot and task context.
- Model Agnosticism: The orchestrator accepts any OpenAI-compatible endpoint. Swapping to
gpt-4o-mini or gemini-2.0-flash requires only a configuration change, not a code rewrite.
- Graceful Degradation: When the agent hits
maxSteps, it fails fast rather than looping indefinitely. This signals that the task likely requires visual assertions or hybrid mobile handling, prompting a strategic pivot instead of silent budget drain.
Pitfall Guide
1. Vision Overload for Structural Tasks
Explanation: Teams default to screenshot-based models for form filling, navigation, and CRUD operations. This forces the LLM to perform OCR and layout analysis on data that already exists as structured text.
Fix: Enforce a DOM-first policy. Only route to vision models when the task explicitly requires visual regression, CAPTCHA resolution, or canvas/SVG interaction.
2. Context Window Bloat
Explanation: Dumping raw HTML or maintaining unbounded conversation history inflates token usage. LLMs waste compute parsing irrelevant markup, and costs scale linearly with page complexity.
Fix: Filter extraction to interactive elements only. Strip CSS, scripts, and invisible nodes. Implement a sliding window for conversation history and reset state after task completion.
Explanation: Hybrid apps (React Native WebView, Flutter, Uni-app) often ignore set_text() because it bypasses the Input Method Editor (IME). Tests appear to succeed but leave fields empty.
Fix: Use IME-aware input methods like send_keys() or platform-specific text injection. Always verify input state via get_attribute('value') after typing.
4. Token Leakage in Retry Loops
Explanation: When a test fails and retries, the conversation history accumulates duplicate prompts. This causes exponential token growth and unpredictable model behavior.
Fix: Clear or snapshot the conversation history before each retry. Use deterministic step IDs to track state without relying on LLM memory.
5. Flaky Test Accumulation
Explanation: Teams keep automated tests that rarely catch bugs but require constant maintenance. The 80/20 rule applies: 20% of tests catch 80% of defects. The rest drain budget.
Fix: Implement quarterly ROI audits. Remove tests that haven't triggered failures in 90 days. Prioritize happy paths and critical business flows over edge-case automation.
6. GPU Self-Hosting Illusion
Explanation: Running local LLMs seems cost-free until you factor in cloud instance pricing (~$3,000/mo for A100), hardware depreciation, and engineering maintenance. It only breaks even above 100k steps/month.
Fix: Use API-based models for development and moderate CI workloads. Reserve self-hosting for high-volume production pipelines where data privacy or latency requirements justify the infrastructure overhead.
7. Ignoring Network State in Assertions
Explanation: AI agents often verify UI changes without checking underlying network responses. A button may appear disabled, but the API call could still be pending or failing silently.
Fix: Pair DOM snapshots with network interception. Assert on API response codes and payload structure before proceeding to the next step.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Standard web CRUD & navigation | DOM extraction + text LLM | Structured data eliminates perception overhead | ~$0.00015β$0.00035/step |
| Visual layout regression | Vision model (Qwen-VL / GPT-4o) | Pixel-level comparison required for CSS/rendering bugs | ~$0.008β$0.015/step |
| High-volume CI pipeline (>100k steps/mo) | Self-hosted GPU or reserved API capacity | Volume discounts and infrastructure amortization lower unit cost | Break-even at ~300k steps/mo |
| Hybrid mobile apps (WebView/Flutter) | uiautomator2 + IME-aware input | Native view hierarchy + proper text injection bypasses rendering quirks | $0 API cost + local compute |
| Budget-constrained solo testing | DOM + DeepSeek V4 Flash / GPT-4o mini | Lowest per-step pricing with sufficient reasoning capability | $5β$15/month total |
Configuration Template
// test-agent.config.ts
export const TestAgentConfig = {
models: {
structural: {
provider: 'deepseek',
endpoint: 'https://api.deepseek.com/v1',
model: 'deepseek-chat',
maxTokens: 2000,
temperature: 0.1,
},
visual: {
provider: 'openai',
endpoint: 'https://api.openai.com/v1',
model: 'gpt-4o',
maxTokens: 4000,
temperature: 0.2,
},
},
extraction: {
maxElements: 50,
stripInvisible: true,
includeAttributes: ['name', 'aria-label', 'placeholder', 'type'],
},
limits: {
maxStepsPerTask: 25,
maxHistoryEntries: 6,
monthlyBudgetUSD: 20,
alertThresholdUSD: 15,
},
routing: {
useVisionFor: ['visual_regression', 'captcha', 'canvas_svg'],
defaultToStructural: true,
},
};
Quick Start Guide
- Initialize the orchestrator: Install
playwright and openai. Create a .env file with your API key and import the StructuralTestOrchestrator class.
- Define a structural task: Write a plain-English task description (e.g.,
"Navigate to login, enter credentials, submit form"). The agent will parse the DOM, select elements by index, and execute actions sequentially.
- Run a dry test: Execute the task against a staging environment. Monitor the console output for step-by-step reasoning and verify that network requests complete successfully.
- Validate cost & latency: Check the API dashboard for token consumption. A 10-step structural task should consume <2,000 input tokens and complete in under 5 seconds. Adjust
maxContextTokens if latency spikes.
- Integrate into CI: Add the test script to your GitHub Actions or Jenkins pipeline. Configure the monthly budget alert to trigger a workflow pause if spend exceeds the defined threshold.