rakers — a headless JS renderer in Rust

By Codcompass Team·2026-05-20·8 min read

Minimalist JavaScript Execution for Dynamic Content Extraction

Current Situation Analysis

Modern web applications have largely shifted toward client-side rendering (CSR). The initial HTTP response typically contains a minimal HTML skeleton, with the actual content, routing, and interactive state populated entirely through JavaScript execution after the page loads. Extracting this content for archiving, automated testing, data processing, or static site generation requires running the embedded scripts first.

The industry standard has been to deploy full headless browsers: Playwright, Puppeteer, or raw Chromium instances. These tools work reliably because they replicate the entire browser environment. However, they carry substantial operational overhead. A standard Chromium installation consumes approximately 300 MB of disk space, requires 1–2 seconds to initialize, and spawns multiple sandboxed processes. In continuous integration environments, this translates to longer pipeline durations, higher memory pressure, and complex dependency management. Many engineering teams overlook a critical architectural distinction: they rarely need a rendering engine. CSS layout calculations, GPU compositing, WebGL contexts, and pixel-perfect viewport measurements are irrelevant when the sole objective is to execute scripts and capture the resulting DOM structure. This misconception leads to bloated automation pipelines, inflated infrastructure costs, and unnecessary complexity in content extraction workflows.

WOW Moment: Key Findings

Stripping away layout and visual rendering subsystems reveals a massive efficiency gap. When the execution environment is reduced to pure script evaluation and DOM mutation, resource consumption drops by over 95%, and initialization becomes nearly instantaneous. This shift enables high-throughput extraction pipelines that can run on modest hardware, scale horizontally without container bloat, and integrate seamlessly into CI/CD stages that previously couldn't support browser automation.

Approach	Binary Footprint	Cold Start Time	CI Resource Overhead	DOM API Coverage
Full Headless Browser	~300 MB	1.0–2.0 s	High (GPU, sandbox, multi-process)	Complete (layout, CSS, WebGL, IndexedDB)
Lightweight JS Renderer	~10 MB	< 50 ms	Minimal (single process, no GPU)	Core DOM, XHR, Storage, Event Loop

This finding matters because it decouples script execution from visual rendering. Teams can now process dynamic content at scale, reduce cloud compute costs, and maintain extraction pipelines that start instantly without waiting for browser process initialization.

Core Solution

Building a lightweight JavaScript execution pipeline requires three distinct phases: HTML parsing, script execution with a simulated browser environment, and DOM serialization. The architecture deliberately omits layout and styling engines to maintain a minimal footprint while preserving compatibility with standard framework bootstrapping sequences.

Phase 1: HTML Parsing

The input document is parsed into a traversable, mutable DOM tree. Rather than implementing a custom parser, mature HTML5 specifications are leveraged. The parser must handle malformed markup gracefully, resolve character encodings, and produce a node tree that supports dynamic mutation. Standards-compliant implementations ensure that <script> tags, inline event handlers, and document structure are preserved accurately before execution begins.

Phase 2: Environment Simulation & Script Execution

Client-side frameworks expect a global window and document object. Since no actual browser exists, a JavaScript environment shim is injected before any page scripts run.

This shim implements the most frequently used DOM APIs: element creation, attribute manipulation, query selection, event registration, and storage interfaces. External scripts are fetched synchronously to maintain document order execution, and XMLHttpRequest calls are routed through a synchronous HTTP client to support runtime template loading.

The JavaScript engine itself is embedded directly into the host application. QuickJS is a common choice due to its ES2023 compliance, small footprint, and clean Rust bindings. For environments lacking a C toolchain, a pure-Rust alternative like boa_engine provides a fallback, albeit with a narrower compatibility surface. The engine evaluates scripts in a sandboxed context, exposing only the stubbed browser globals.

Phase 3: DOM Serialization

After all scripts complete, the mutated DOM tree is traversed and serialized back into an HTML string. This output can be written to standard output, saved to a file, or piped into downstream processing stages. The serialization step must preserve attribute order, handle self-closing tags correctly, and maintain the exact structure produced by the JavaScript execution phase.

Implementation Architecture (Rust-Based Pipeline)

Below is a conceptual implementation demonstrating the pipeline architecture. The structure emphasizes separation of concerns, explicit error boundaries, and deterministic execution order.

use std::fs;
use std::io::{self, Read};

/// Core pipeline orchestrator
struct DynamicContentExtractor {
    source_html: String,
    runtime: JsExecutionContext,
    environment_stub: String,
}

impl DynamicContentExtractor {
    fn from_input(raw_input: &str) -> Self {
        // Parse HTML into a mutable tree structure
        let parsed_tree = html5_parser::parse_document(raw_input);
        
        Self {
            source_html: raw_input.to_string(),
            runtime: JsExecutionContext::new(),
            environment_stub: fs::read_to_string("browser_shim.js")
                .unwrap_or_else(|_| String::new()),
        }
    }

    fn run_extraction(&mut self) -> Result<String, ExtractionError> {
        // 1. Inject environment shim before page scripts
        self.runtime.evaluate_script(&self.environment_stub)?;

        // 2. Collect and execute scripts in document order
        let script_nodes = self.collect_ordered_scripts();
        for node in script_nodes {
            match node {
                ScriptNode::Inline(content) => {
                    self.runtime.evaluate_script(&content)?;
                }
                ScriptNode::External(url) => {
                    let fetched_code = sync_network_fetch(&url)?;
                    self.runtime.evaluate_script(&fetched_code)?;
                }
            }
        }

        // 3. Serialize mutated DOM back to HTML
        let rendered_output = self.serialize_dom_tree();
        Ok(rendered_output)
    }

    fn collect_ordered_scripts(&self) -> Vec<ScriptNode> {
        // DOM traversal logic to extract <script> elements
        // Returns inline content or external src URLs
        vec![]
    }

    fn serialize_dom_tree(&self) -> String {
        // Traverse DOM nodes and reconstruct HTML string
        String::new()
    }
}

enum ScriptNode {
    Inline(String),
    External(String),
}

Architecture Decisions & Rationale

Synchronous Script Fetching: Maintains document order execution, which is critical for frameworks that rely on sequential initialization. Asynchronous loading would require a complex microtask queue simulation and could break dependency chains.
Shim Injection Order: The environment stub runs first to ensure global objects exist before page scripts execute. This prevents ReferenceError failures during framework bootstrapping and guarantees consistent global state.
Layout API Fallbacks: Methods like getBoundingClientRect or offsetWidth return zero. This is intentional. Returning null or throwing errors would crash frameworks that perform feature detection. Zero values allow execution to continue while accurately reflecting the absence of a rendering engine.
Engine Selection Strategy: QuickJS provides near-native ES2023 support with minimal overhead. The pure-Rust fallback ensures deployment flexibility in constrained environments, trading minor compatibility for zero native dependencies. Runtime isolation prevents state leakage between sequential extraction tasks.

Pitfall Guide

1. Assuming Layout Metrics Are Available

Explanation: Frameworks often use offsetHeight, getComputedStyle, or window.innerWidth for responsive logic and virtual scrolling. Without a CSS engine, these return zero or empty strings, causing miscalculated dimensions. Fix: Mock critical layout values in the shim if targeting specific applications, or accept that layout-dependent features will degrade gracefully. Never rely on pixel-perfect measurements in a headless execution context.

2. Expecting Native ES Module Support

Explanation: The execution environment typically lacks a module resolution system, import.meta, and network-based import() handling. Dynamic imports will fail silently or throw syntax errors. Fix: Pre-bundle dependencies or convert ES modules to IIFE/UMD formats before injection. Avoid dynamic import() calls in target pages. If module loading is unavoidable, implement a custom resolver that maps module specifiers to pre-fetched strings.

3. Ignoring Global State Contamination

Explanation: Running multiple extractions in the same process without resetting the JS context leads to leaked variables, cached modules, and unpredictable behavior. Frameworks like React or Vue may retain internal state from previous runs. Fix: Instantiate a fresh JavaScript runtime for each extraction task, or explicitly clear global properties, event listeners, and storage backends between runs. Implement a strict lifecycle: create context → inject shim → execute → serialize → destroy context.

4. Overlooking Synchronous XHR Limitations

Explanation: Modern browsers deprecate synchronous XMLHttpRequest on the main thread. The lightweight renderer implements it synchronously for simplicity, which can cause deadlocks if the underlying HTTP client isn't properly configured or if network timeouts are misaligned. Fix: Ensure the HTTP client supports blocking calls without blocking the JS event loop. Set explicit timeout thresholds (e.g., 3–5 seconds) and implement retry logic with exponential backoff. Consider routing external fetches through a connection pool to avoid socket exhaustion.

5. Relying on Browser Fingerprinting APIs

Explanation: Security scripts and anti-bot measures check navigator.plugins, window.chrome, navigator.hardwareConcurrency, or canvas fingerprinting. The stub environment lacks these properties, causing immediate detection or script termination. Fix: Use a full headless browser for sites with aggressive fingerprinting. Do not attempt to patch every missing property; it becomes a maintenance nightmare. If fingerprinting is unavoidable, inject a comprehensive navigator mock that matches standard browser profiles.

6. Misusing CSS Selector Filtering

Explanation: Filtering output via selectors before script execution completes yields empty or partial results. Many frameworks defer DOM insertion until after initial hydration, meaning early filtering captures only the skeleton. Fix: Always serialize the full DOM first, then apply selector filtering in a post-processing step. This guarantees all asynchronous mutations, setTimeout callbacks, and microtask resolutions have completed before extraction.

7. Neglecting Error Boundaries in Script Execution

Explanation: A single unhandled exception in a page script can halt the entire pipeline. Frameworks often contain optional plugins or analytics scripts that fail in non-browser environments, crashing the extraction process. Fix: Wrap script evaluation in try-catch blocks at the engine level. Log failures but continue processing remaining scripts to maximize content recovery. Implement a continue_on_error flag for production pipelines where partial extraction is preferable to total failure.

Production Bundle

Action Checklist

Verify target pages use standard DOM APIs and avoid layout-dependent rendering logic.
Pre-bundle external dependencies to eliminate network fetch failures during extraction.
Implement context isolation to prevent state leakage between sequential extraction runs.
Configure synchronous HTTP timeouts to prevent pipeline hangs on slow external scripts.
Add post-processing selector filtering after full DOM serialization, not before.
Monitor JavaScript engine memory usage; reset runtimes periodically to avoid fragmentation.
Validate output against known benchmarks (e.g., TodoMVC implementations) to track compatibility drift.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume content archiving	Lightweight JS Renderer	Minimal footprint, instant startup, scales horizontally	Low infrastructure cost, reduced CI runtime
E-commerce checkout testing	Full Headless Browser	Requires layout, CSS, and secure context simulation	Higher compute cost, longer pipeline duration
Framework compatibility validation	Lightweight JS Renderer	Covers 90%+ of standard DOM interactions without GPU overhead	Moderate setup, low ongoing maintenance
Anti-bot bypass / Fingerprinting	Full Headless Browser	Needs authentic `navigator` properties and canvas rendering	High complexity, requires proxy rotation

Configuration Template

A production-ready pipeline configuration typically separates engine settings, network behavior, and output formatting. Below is a structured template for deployment:

extraction_pipeline:
  engine:
    type: quickjs
    es_version: "2023"
    memory_limit_mb: 64
    timeout_ms: 5000
    continue_on_error: true
  network:
    fetch_mode: synchronous
    proxy: null
    max_redirects: 3
    user_agent: "ContentExtractor/1.0"
    timeout_seconds: 5
  dom_stub:
    inject_order: "pre_execution"
    layout_fallback: "zero_return"
    storage_backend: "memory"
    expose_console: false
  output:
    format: "html"
    selector_filter: null
    minify: false
    encoding: "utf-8"
    preserve_comments: false

Quick Start Guide

Install the binary: Download the precompiled executable for your target architecture. No runtime dependencies or package managers are required.
Extract a remote page: Run extractor https://target-site.com to fetch, execute scripts, and output the rendered HTML to standard output.
Process local files: Pipe a local document directly: cat page.html | extractor --stdin. This bypasses network requests and executes scripts against the provided markup.
Filter output: Append a CSS selector to isolate specific content: extractor --selector "main.content" https://target-site.com. Filtering occurs after full DOM serialization.
Route through proxy: Add network anonymity or bypass geo-restrictions: extractor --proxy socks5://127.0.0.1:9050 https://target-site.com. The proxy applies to all external script and XHR fetches.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back