This shim implements the most frequently used DOM APIs: element creation, attribute manipulation, query selection, event registration, and storage interfaces. External scripts are fetched synchronously to maintain document order execution, and XMLHttpRequest calls are routed through a synchronous HTTP client to support runtime template loading.
The JavaScript engine itself is embedded directly into the host application. QuickJS is a common choice due to its ES2023 compliance, small footprint, and clean Rust bindings. For environments lacking a C toolchain, a pure-Rust alternative like boa_engine provides a fallback, albeit with a narrower compatibility surface. The engine evaluates scripts in a sandboxed context, exposing only the stubbed browser globals.
Phase 3: DOM Serialization
After all scripts complete, the mutated DOM tree is traversed and serialized back into an HTML string. This output can be written to standard output, saved to a file, or piped into downstream processing stages. The serialization step must preserve attribute order, handle self-closing tags correctly, and maintain the exact structure produced by the JavaScript execution phase.
Implementation Architecture (Rust-Based Pipeline)
Below is a conceptual implementation demonstrating the pipeline architecture. The structure emphasizes separation of concerns, explicit error boundaries, and deterministic execution order.
use std::fs;
use std::io::{self, Read};
/// Core pipeline orchestrator
struct DynamicContentExtractor {
source_html: String,
runtime: JsExecutionContext,
environment_stub: String,
}
impl DynamicContentExtractor {
fn from_input(raw_input: &str) -> Self {
// Parse HTML into a mutable tree structure
let parsed_tree = html5_parser::parse_document(raw_input);
Self {
source_html: raw_input.to_string(),
runtime: JsExecutionContext::new(),
environment_stub: fs::read_to_string("browser_shim.js")
.unwrap_or_else(|_| String::new()),
}
}
fn run_extraction(&mut self) -> Result<String, ExtractionError> {
// 1. Inject environment shim before page scripts
self.runtime.evaluate_script(&self.environment_stub)?;
// 2. Collect and execute scripts in document order
let script_nodes = self.collect_ordered_scripts();
for node in script_nodes {
match node {
ScriptNode::Inline(content) => {
self.runtime.evaluate_script(&content)?;
}
ScriptNode::External(url) => {
let fetched_code = sync_network_fetch(&url)?;
self.runtime.evaluate_script(&fetched_code)?;
}
}
}
// 3. Serialize mutated DOM back to HTML
let rendered_output = self.serialize_dom_tree();
Ok(rendered_output)
}
fn collect_ordered_scripts(&self) -> Vec<ScriptNode> {
// DOM traversal logic to extract <script> elements
// Returns inline content or external src URLs
vec![]
}
fn serialize_dom_tree(&self) -> String {
// Traverse DOM nodes and reconstruct HTML string
String::new()
}
}
enum ScriptNode {
Inline(String),
External(String),
}
Architecture Decisions & Rationale
- Synchronous Script Fetching: Maintains document order execution, which is critical for frameworks that rely on sequential initialization. Asynchronous loading would require a complex microtask queue simulation and could break dependency chains.
- Shim Injection Order: The environment stub runs first to ensure global objects exist before page scripts execute. This prevents
ReferenceError failures during framework bootstrapping and guarantees consistent global state.
- Layout API Fallbacks: Methods like
getBoundingClientRect or offsetWidth return zero. This is intentional. Returning null or throwing errors would crash frameworks that perform feature detection. Zero values allow execution to continue while accurately reflecting the absence of a rendering engine.
- Engine Selection Strategy: QuickJS provides near-native ES2023 support with minimal overhead. The pure-Rust fallback ensures deployment flexibility in constrained environments, trading minor compatibility for zero native dependencies. Runtime isolation prevents state leakage between sequential extraction tasks.
Pitfall Guide
1. Assuming Layout Metrics Are Available
Explanation: Frameworks often use offsetHeight, getComputedStyle, or window.innerWidth for responsive logic and virtual scrolling. Without a CSS engine, these return zero or empty strings, causing miscalculated dimensions.
Fix: Mock critical layout values in the shim if targeting specific applications, or accept that layout-dependent features will degrade gracefully. Never rely on pixel-perfect measurements in a headless execution context.
2. Expecting Native ES Module Support
Explanation: The execution environment typically lacks a module resolution system, import.meta, and network-based import() handling. Dynamic imports will fail silently or throw syntax errors.
Fix: Pre-bundle dependencies or convert ES modules to IIFE/UMD formats before injection. Avoid dynamic import() calls in target pages. If module loading is unavoidable, implement a custom resolver that maps module specifiers to pre-fetched strings.
3. Ignoring Global State Contamination
Explanation: Running multiple extractions in the same process without resetting the JS context leads to leaked variables, cached modules, and unpredictable behavior. Frameworks like React or Vue may retain internal state from previous runs.
Fix: Instantiate a fresh JavaScript runtime for each extraction task, or explicitly clear global properties, event listeners, and storage backends between runs. Implement a strict lifecycle: create context β inject shim β execute β serialize β destroy context.
4. Overlooking Synchronous XHR Limitations
Explanation: Modern browsers deprecate synchronous XMLHttpRequest on the main thread. The lightweight renderer implements it synchronously for simplicity, which can cause deadlocks if the underlying HTTP client isn't properly configured or if network timeouts are misaligned.
Fix: Ensure the HTTP client supports blocking calls without blocking the JS event loop. Set explicit timeout thresholds (e.g., 3β5 seconds) and implement retry logic with exponential backoff. Consider routing external fetches through a connection pool to avoid socket exhaustion.
5. Relying on Browser Fingerprinting APIs
Explanation: Security scripts and anti-bot measures check navigator.plugins, window.chrome, navigator.hardwareConcurrency, or canvas fingerprinting. The stub environment lacks these properties, causing immediate detection or script termination.
Fix: Use a full headless browser for sites with aggressive fingerprinting. Do not attempt to patch every missing property; it becomes a maintenance nightmare. If fingerprinting is unavoidable, inject a comprehensive navigator mock that matches standard browser profiles.
6. Misusing CSS Selector Filtering
Explanation: Filtering output via selectors before script execution completes yields empty or partial results. Many frameworks defer DOM insertion until after initial hydration, meaning early filtering captures only the skeleton.
Fix: Always serialize the full DOM first, then apply selector filtering in a post-processing step. This guarantees all asynchronous mutations, setTimeout callbacks, and microtask resolutions have completed before extraction.
7. Neglecting Error Boundaries in Script Execution
Explanation: A single unhandled exception in a page script can halt the entire pipeline. Frameworks often contain optional plugins or analytics scripts that fail in non-browser environments, crashing the extraction process.
Fix: Wrap script evaluation in try-catch blocks at the engine level. Log failures but continue processing remaining scripts to maximize content recovery. Implement a continue_on_error flag for production pipelines where partial extraction is preferable to total failure.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume content archiving | Lightweight JS Renderer | Minimal footprint, instant startup, scales horizontally | Low infrastructure cost, reduced CI runtime |
| E-commerce checkout testing | Full Headless Browser | Requires layout, CSS, and secure context simulation | Higher compute cost, longer pipeline duration |
| Framework compatibility validation | Lightweight JS Renderer | Covers 90%+ of standard DOM interactions without GPU overhead | Moderate setup, low ongoing maintenance |
| Anti-bot bypass / Fingerprinting | Full Headless Browser | Needs authentic navigator properties and canvas rendering | High complexity, requires proxy rotation |
Configuration Template
A production-ready pipeline configuration typically separates engine settings, network behavior, and output formatting. Below is a structured template for deployment:
extraction_pipeline:
engine:
type: quickjs
es_version: "2023"
memory_limit_mb: 64
timeout_ms: 5000
continue_on_error: true
network:
fetch_mode: synchronous
proxy: null
max_redirects: 3
user_agent: "ContentExtractor/1.0"
timeout_seconds: 5
dom_stub:
inject_order: "pre_execution"
layout_fallback: "zero_return"
storage_backend: "memory"
expose_console: false
output:
format: "html"
selector_filter: null
minify: false
encoding: "utf-8"
preserve_comments: false
Quick Start Guide
- Install the binary: Download the precompiled executable for your target architecture. No runtime dependencies or package managers are required.
- Extract a remote page: Run
extractor https://target-site.com to fetch, execute scripts, and output the rendered HTML to standard output.
- Process local files: Pipe a local document directly:
cat page.html | extractor --stdin. This bypasses network requests and executes scripts against the provided markup.
- Filter output: Append a CSS selector to isolate specific content:
extractor --selector "main.content" https://target-site.com. Filtering occurs after full DOM serialization.
- Route through proxy: Add network anonymity or bypass geo-restrictions:
extractor --proxy socks5://127.0.0.1:9050 https://target-site.com. The proxy applies to all external script and XHR fetches.