How I Fixed a Race Condition in rrweb That Was Breaking 60% of My Session Recordings
Architecting Reliable Session Replay: Eliminating Initialization Races and Aggregation Drift
Current Situation Analysis
Session replay technology has become a standard requirement for product analytics, yet implementation patterns frequently introduce silent data corruption. The most pervasive issue in production environments is the Initialization Race Condition. Developers often treat recording libraries like standard analytics scripts, applying lazy-loading patterns to optimize initial page weight. This approach is fundamentally incompatible with state-capture libraries.
Unlike event-based analytics, which can tolerate dropped packets or delayed transmission, session replay requires a deterministic baseline. Libraries like rrweb rely on capturing a full DOM snapshot (Type 4 event) immediately upon initialization. Subsequent events are mutations relative to this snapshot. If the recording engine starts after the initial DOM has already rendered or mutated, the snapshot is either missing or stale. The result is a session that appears valid in the database but contains no playable content.
In high-traffic deployments, this race condition can silently invalidate more than 50% of recordings. The failure mode is particularly dangerous because it produces no client-side errors. The recording session is created, the script loads eventually, and mutation events may even be captured, but without the anchor snapshot, the playback engine cannot reconstruct the page. Additionally, backend aggregation logic often introduces Duration Drift, where batched event writes overwrite session metadata with non-monotonic values, resulting in duration metrics that regress rather than accumulate.
WOW Moment: Key Findings
The difference between a lazy-load strategy and a preload strategy is not merely performance; it is binary viability. The following comparison illustrates the impact on recording integrity based on production telemetry from replay implementations.
| Approach | Snapshot Capture Rate | Playback Viability | Risk Profile |
|---|---|---|---|
| Lazy / Dynamic Injection | ~40% | Low | High race condition risk. Missing Type 4 events on slow networks or fast user interactions. |
| Immediate Preload | >99% | High | Deterministic. Script download overlaps with critical rendering path. Baseline secured. |
| Lazy + Mutation Fallback | ~65% | Medium | Captures mutations but lacks context. Playback shows blank screens or partial renders. |
Why this matters: A 40% capture rate renders replay data statistically insignificant for UX analysis. Preloading shifts the network latency cost to the browser's parallel download queue, ensuring the recording engine is ready before the first paint completes. This transforms replay from a "best effort" feature into a reliable data pipeline.
Core Solution
To eliminate initialization races and aggregation drift, the architecture must enforce two principles: Deterministic Initialization and Idempotent Aggregation.
1. Deterministic Initialization via Promise-Based Preload
The recording library must begin downloading immediately upon tracker execution, decoupling network latency from the recording start command. We implement a ReplayManager that initiates the script injection in its constructor and exposes a promise that resolves when the library is available. This allows downstream code to request recording start without blocking, while guaranteeing the library is loaded.
Implementation Strategy:
- Inject the script tag at the earliest execution point (e.g., top of IIFE or module initialization).
- Wrap the load state in a singleton promise to prevent duplicate injections.
- Resolve the promise only when
window.rrwebis populated. - Pin the library version to prevent breaking changes from altering the snapshot format.
TypeScript Implementation:
interface ReplayConfig {
cdnUrl: string;
version: string;
recordOptions: rrweb.recordOptions;
}
declare global {
interface Window {
rrweb?: typeof import('rrweb');
}
}
export class ReplayManager {
private loadPromise: Promise<typeof import('rrweb')>;
private config: ReplayConfig;
constructor(config: ReplayConfig) {
this.config = config;
this.loadPromise = this.initializeLoad();
}
private initializeLoad(): Promise<typeof import('rrweb')> {
return new Promise((resolve, reject) => {
// Fast path: Library already loaded
if (typeof window !== 'undefined' && window.rrweb) {
resolve(window.rrweb);
return;
}
const script = document.createElement('script');
script.src = `${this.config.cdnUrl}/rrweb@${this.config.version}/dist/rrweb.min.js`;
script.onload = () => {
if (window.rrweb) {
resolve(window.rrweb);
} else {
reject(new Error('rrweb script loaded but global object missing'));
}
};
script.onerror = () => reject(new Error('Failed to load rrweb script'));
// Inject immediately to overlap with page load
document.head.appendChild(script);
});
}
public async startRecording(sessionId: string): Promise<void> {
try {
const rrweb = await this.loadPromise;
// Verify snapshot capability before starting
const stopFn = rrweb.record({
...this.config.recordOptions,
emit(event) {
// Dispatch event to backend with sessionId
this.dispatchEvent(event, sessionId);
}.bind(this)
});
// Store stopFn for cleanup if needed
this.attachCleanupHandler(stopFn, sessionId);
} catch (error) {
console.error('Replay initialization failed:', error);
// Fallback logic or telemetry alert
}
}
private dispatchEvent(event: rrweb.eventWithTime, sessionId: string): void {
// Implementation for batching and sending events
// Ensure Type 4 snapshot is validated in the pipeline
}
private attachCleanupHandler(stopFn: () => void, sessionId: string): void {
window.addEventListener('beforeunload', () => {
stopFn();
// Flush remaining events
});
}
}
Rationale:
- Constructor Injection: By calling
initializeLoad()in the constructor, the script download begins before any route checks or async operations. This ensures the network request is in-flight during the critical rendering path. - Promise Singleton: The
loadPromiseis created once. Multiple calls tostartRecordingwill await the same promise, preventing race conditions where multiple script tags might be injected. - Version Pinning: The
versionfield in the config enforces strict versioning. Using dynamic versions (e.g.,latest) risks snapshot format changes that break playback compatibility.
2. Idempotent Aggregation for Session Metadata
Backend systems processing replay events often receive data in batches. A common anti-pattern is using "last-write-wins" logic for session metadata like duration. This causes duration metrics to fluctuate or regress when out-of-order batches arrive.
Implementation Strategy:
- Treat session duration as a monotonic counter.
- Use
Math.maxwhen merging incoming batch data with existing session state. - Apply the same logic to other cumulative metrics like scroll depth or interaction count.
TypeScript Implementation:
interface SessionBatch {
sessionId: string;
duration: number;
scrollDepth: number;
eventCount: number;
}
interface SessionDocument {
sessionId: string;
metadata: {
duration: number;
scrollDepth: number;
eventCount: number;
lastUpdated: Date;
};
}
export function mergeSessionBatch(
batch: SessionBatch,
existing: SessionDocument
): SessionDocument {
return {
...existing,
metadata: {
duration: Math.max(
batch.duration ?? 0,
existing.metadata.duration ?? 0
),
scrollDepth: Math.max(
batch.scrollDepth ?? 0,
existing.metadata.scrollDepth ?? 0
),
eventCount: existing.metadata.eventCount + (batch.eventCount ?? 0),
lastUpdated: new Date()
}
};
}
Rationale:
- Monotonic Growth:
Math.maxensures duration never decreases, regardless of batch arrival order. This eliminates the "duration regression" bug where a 39-second session might report 20 seconds due to a late-arriving batch with a lower timestamp. - Idempotency: The merge function is idempotent. Applying the same batch multiple times yields the same result, which is essential for retry logic in distributed systems.
Pitfall Guide
The Lazy-Load Trap
- Explanation: Dynamically injecting the recording script after route resolution or user interaction.
- Impact: The initial DOM snapshot is missed. The session contains mutations but no baseline, rendering it unplayable.
- Fix: Preload the script at the earliest execution point. Use the Promise-based manager to decouple download from start.
Version Drift
- Explanation: Loading
rrwebfrom a URL that resolves to the latest version. - Impact: A library update may change the snapshot format or event structure. Existing recordings may become unplayable, and new recordings may fail to parse.
- Fix: Pin versions explicitly in the CDN URL. Test upgrades in a staging environment before rolling out.
- Explanation: Loading
Orphaned Mutations
- Explanation: Recording starts, but the Type 4 snapshot event is filtered out or lost due to network issues.
- Impact: Playback engine cannot reconstruct the page. The session appears as a blank screen with error logs.
- Fix: Implement pipeline validation. Reject sessions that do not contain a Type 4 event within the first N events. Alert on snapshot loss.
Duration Overwrite
- Explanation: Backend logic sets session duration to the value from the latest batch without comparison.
- Impact: Duration metrics regress. A session lasting 60 seconds may show 10 seconds if a late batch reports a lower value.
- Fix: Use
Math.maxfor all cumulative metrics. Ensure duration is calculated based on the maximum timestamp seen.
Script Injection Order
- Explanation: Appending the script tag after other heavy resources or deferring execution.
- Impact: Increased time-to-record. On slow connections, the delay may still cause the snapshot to be missed.
- Fix: Inject the script tag immediately. Avoid
deferorasyncattributes that delay execution. The script should be blocking or high-priority.
Promise Race Conditions
- Explanation: Checking for
window.rrweband injecting a script in separate steps without synchronization. - Impact: Multiple script tags may be injected, or the code may attempt to use the library before it is fully initialized.
- Fix: Use a singleton promise pattern. The promise should handle both the check and the injection atomically.
- Explanation: Checking for
Assuming Synchronous Availability
- Explanation: Calling
rrweb.record()immediately after script injection without awaiting load. - Impact:
ReferenceErroror silent failure. The library is not yet available in the global scope. - Fix: Always await the load promise before calling recording APIs. The
ReplayManagerpattern enforces this.
- Explanation: Calling
Production Bundle
Action Checklist
- Pin Library Version: Update CDN URLs to use specific version tags (e.g.,
rrweb@2.0.0-alpha.13). Removelatestreferences. - Implement Preload Manager: Deploy the
ReplayManagerclass to handle script injection and promise resolution. - Validate Type 4 Snapshots: Add pipeline logic to verify every session contains a Type 4 event. Alert if capture rate drops below 95%.
- Fix Duration Aggregation: Update backend merge logic to use
Math.maxfor duration and scroll depth. - Monitor Initialization Latency: Track the time between tracker load and
record()call. Alert if latency exceeds 500ms. - Test on Throttled Networks: Verify recording integrity on 3G and slow connections. Ensure snapshots are captured.
- Review Cleanup Handlers: Ensure
stop()is called onbeforeunloadand remaining events are flushed.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| High-Traffic Landing Page | Immediate Preload | Maximizes capture rate. Page weight is less critical than data integrity. | Slight increase in initial bandwidth. High ROI on data quality. |
| Internal Admin Dashboard | Lazy Load | Recording is optional. Performance is prioritized. Low risk of missing critical user flows. | Lower bandwidth. Reduced capture rate acceptable. |
| Multi-Page Application | Preload + Route Guard | Preload ensures baseline. Route guard prevents duplicate recordings. | Balanced performance and reliability. |
| Strict CSP Environment | Self-Hosted Preload | CDN may be blocked. Self-hosting ensures availability. | Increased storage/CDN costs. Full control over versioning. |
Configuration Template
// replay.config.ts
import { ReplayManager } from './ReplayManager';
const replayConfig = {
cdnUrl: 'https://cdn.yourdomain.com/libs',
version: '2.0.0-alpha.13',
recordOptions: {
emit: (event: any) => {
// Custom emit handler
},
sampling: {
mousemove: 500,
scroll: 100
},
packFn: (event: any) => {
// Custom packing logic
}
}
};
export const replayManager = new ReplayManager(replayConfig);
Quick Start Guide
- Install Dependencies: Add
rrwebto your project dependencies. Ensure the version matches your CDN pin. - Create Manager: Copy the
ReplayManagerclass into your codebase. Configure the CDN URL and version. - Initialize Early: Instantiate
ReplayManagerat the top of your entry script or IIFE. This triggers the preload. - Start Recording: Call
replayManager.startRecording(sessionId)when the session begins. The manager handles the async load. - Verify: Check your backend pipeline. Confirm Type 4 events are present and duration metrics are monotonic.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
