y experiment must declare its evaluation contract before code is written. This contract specifies the hypothesis, primary metric, guardrail metrics, and success threshold.
interface ExperimentContract {
id: string;
hypothesis: string;
primaryMetric: 'conversion' | 'retention' | 'engagement';
guardrailMetrics: ('latency_p95' | 'error_rate' | 'crash_rate')[];
minDetectableEffect: number; // e.g., 0.05 for 5% lift
alpha: number; // 0.05 standard
power: number; // 0.80 standard
}
const CHECKOUT_FLOW_V2: ExperimentContract = {
id: 'exp-checkout-v2',
hypothesis: 'Reducing form fields to 3 will increase conversion by ≥5%',
primaryMetric: 'conversion',
guardrailMetrics: ['latency_p95', 'error_rate'],
minDetectableEffect: 0.05,
alpha: 0.05,
power: 0.80,
};
Step 2: Implement Deterministic Assignment
Client-side randomization is unstable. Sessions refresh, devices change, and A/B test pollution occurs when users see multiple variants. Deterministic assignment uses consistent hashing to map user/context identifiers to a stable variant bucket.
import { createHash } from 'crypto';
type Variant = 'control' | 'treatment_A' | 'treatment_B';
function assignVariant(
userId: string,
experimentId: string,
trafficAllocation: number = 1.0
): Variant | null {
const hashInput = `${userId}::${experimentId}`;
const hash = createHash('sha256').update(hashInput).digest('hex');
const hashInt = parseInt(hash.slice(0, 8), 16);
const normalized = hashInt / 0xffffffff;
if (normalized > trafficAllocation) return null; // Outside traffic
if (normalized < 0.5) return 'control';
return 'treatment_A';
}
This approach guarantees that the same user always receives the same variant across sessions, devices, and page reloads. The trafficAllocation parameter enables safe rollout by gradually expanding the hash range without re-randomizing existing users.
Step 3: Instrument Telemetry Without Blocking the Main Thread
Experimentation telemetry must be fire-and-forget. Synchronous network calls degrade performance and skew latency metrics. Use the Beacon API or async queues for event emission.
interface TelemetryEvent {
experimentId: string;
variant: Variant;
metric: string;
value: number;
timestamp: number;
sessionId: string;
}
const telemetryQueue: TelemetryEvent[] = [];
function emitExperimentEvent(event: TelemetryEvent) {
telemetryQueue.push(event);
if (telemetryQueue.length >= 10 || typeof navigator !== 'undefined') {
flushTelemetry();
}
}
function flushTelemetry() {
if (telemetryQueue.length === 0) return;
const payload = JSON.stringify(telemetryQueue.splice(0, telemetryQueue.length));
if (typeof navigator !== 'undefined' && navigator.sendBeacon) {
navigator.sendBeacon('/api/experiments/telemetry', payload);
} else {
fetch('/api/experiments/telemetry', {
method: 'POST',
body: payload,
keepalive: true,
headers: { 'Content-Type': 'application/json' }
}).catch(() => {}); // Non-blocking failure tolerance
}
}
Step 4: React Integration with Evaluation Hook
Frontend components should consume variants through a hook that handles assignment, fallback, and telemetry emission.
import { useState, useEffect, useCallback } from 'react';
function useExperiment(
contract: ExperimentContract,
userId: string,
trafficAllocation: number = 1.0
): { variant: Variant | null; isLoading: boolean } {
const [variant, setVariant] = useState<Variant | null>(null);
const [isLoading, setIsLoading] = useState(true);
useEffect(() => {
const assigned = assignVariant(userId, contract.id, trafficAllocation);
setVariant(assigned);
setIsLoading(false);
if (assigned) {
emitExperimentEvent({
experimentId: contract.id,
variant: assigned,
metric: 'impression',
value: 1,
timestamp: Date.now(),
sessionId: crypto.randomUUID()
});
}
}, [userId, contract.id, trafficAllocation]);
return { variant, isLoading };
}
// Usage in component
function CheckoutForm() {
const { variant, isLoading } = useExperiment(CHECKOUT_FLOW_V2, 'user_12345', 0.2);
if (isLoading) return <LoadingSkeleton />;
if (!variant) return <LegacyCheckout />;
return variant === 'treatment_A' ? <OptimizedCheckout /> : <LegacyCheckout />;
}
Architecture Decisions and Rationale
-
Deterministic Hashing over Random Assignment: Randomization causes context switching when users return. SHA-256 hashing on userId::experimentId produces a uniform distribution across variants while guaranteeing session stability. This eliminates novelty bias and simplifies statistical analysis.
-
Server-Side Evaluation Fallback: The client implementation above is lightweight. In production, variant evaluation should be resolved server-side via a feature flag service (e.g., LaunchDarkly, Unleash, or custom gRPC endpoint). Client-side hashing is used only for edge cases where latency-sensitive UI rendering cannot wait for a network round-trip.
-
Async Telemetry Pipeline: Synchronous event emission blocks the main thread and inflates latency_p95 guardrail metrics. The Beacon API + queue pattern ensures telemetry is delivered even during page unload, without degrading user experience.
-
Metric Contract Enforcement: Defining minDetectableEffect, alpha, and power upfront prevents post-hoc rationalization. These parameters feed directly into sample size calculators and sequential testing engines, ensuring statistical validity before traffic is allocated.
-
Decoupled Analytics Ingestion: Telemetry endpoints should route to an event streaming layer (Kafka, Kinesis, or managed equivalents) rather than writing directly to a warehouse. This enables real-time SRM (Sample Ratio Mismatch) detection, anomaly alerts, and pipeline backpressure handling.
Pitfall Guide
1. Running Concurrent Tests Without Orthogonal Allocation
Running multiple experiments on the same user cohort creates interaction bias. If Experiment A alters checkout flow and Experiment B modifies payment options, their effects compound unpredictably.
Best Practice: Use orthogonal traffic allocation or mutually exclusive user segments. Reserve 10–15% of traffic for overlapping tests, but document interaction assumptions explicitly.
2. Statistical Peeking and Early Stopping
Checking results daily and stopping a test when p < 0.05 inflates false positive rates to 30–50%. Traditional frequentist statistics assume a fixed sample size.
Best Practice: Implement sequential testing (SPRT) or Bayesian hierarchical models. Pre-calculate required sample size and enforce a minimum run duration (usually 1–2 full business cycles).
3. Ignoring Guardrail Metrics
Optimizing for a primary metric often degrades system health. A checkout variant may increase conversion but spike error rates or latency.
Best Practice: Always track guardrail metrics. Implement automated rollback triggers: if error_rate exceeds baseline by >0.5% or latency_p95 degrades by >200ms, pause the experiment immediately.
4. Hardcoding Variants Instead of Using a Flag Service
Inline conditionals (if (user.id % 2 === 0)) create technical debt. They cannot be toggled remotely, audited, or cleaned up systematically.
Best Practice: Route all variant logic through a centralized flag service. Tag flags with expiration dates and owner metadata. Automate cleanup via CI/CD pipeline checks.
5. Inconsistent Randomization Seeds Across Environments
Using different hashing algorithms or seed values in staging vs production causes environment drift. Tests that pass in staging may fail in production due to allocation mismatches.
Best Practice: Enforce identical assignment logic across all environments. Validate allocation distribution using A/A tests before launching production experiments.
6. Treating Feature Flags as Permanent Architecture
Flags accumulate. Teams forget to remove them, leading to branching complexity, increased bundle size, and unpredictable runtime behavior.
Best Practice: Implement flag lifecycle management. Require documentation, owner assignment, and expiration dates. Run monthly flag audits and automate deprecation warnings in code reviews.
7. Skipping Sample Ratio Mismatch (SRM) Checks
SRM occurs when observed traffic distribution deviates significantly from expected allocation. It indicates instrumentation bugs, ad blockers, or routing errors.
Best Practice: Run chi-squared goodness-of-fit tests on impression counts. If SRM p-value < 0.01, halt analysis and debug telemetry routing before proceeding.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / MVP Validation | Client-side hashing + lightweight telemetry | Fast iteration, minimal infrastructure overhead, validates core hypotheses quickly | Low initial cost; scales poorly beyond 10 concurrent tests |
| Enterprise / High-Traffic Product | Server-side evaluation + event streaming + sequential testing engine | Handles millions of daily events, prevents client manipulation, supports complex orthogonal allocation | Higher infrastructure cost; reduces engineering waste by 30–40% |
| Regulated / Compliance-Heavy Domain | Deterministic assignment + immutable audit logs + Bayesian analysis | Meets regulatory requirements for change tracking, avoids p-hacking, ensures reproducible results | Moderate cost; avoids compliance penalties and rollback liabilities |
| Mobile / Offline-First Apps | Local flag caching + deferred telemetry sync | Maintains UX during connectivity loss, ensures variant stability, batches events for efficient upload | Low network cost; requires careful cache invalidation strategy |
Configuration Template
// config/experimentation.ts
export const EXPERIMENTATION_CONFIG = {
sdk: {
evaluation: 'server-side', // 'client-side' | 'server-side' | 'hybrid'
assignment: {
algorithm: 'sha256',
salt: process.env.EXPERIMENT_SALT || 'default-salt',
trafficBucketSize: 10000, // 0.01% precision
},
telemetry: {
endpoint: '/api/v1/experiments/telemetry',
batchSize: 10,
flushIntervalMs: 5000,
keepalive: true,
retryAttempts: 3,
},
guardrails: {
enabled: true,
metrics: ['latency_p95', 'error_rate', 'crash_rate'],
thresholds: {
latency_p95: { baselineMs: 450, maxIncreaseMs: 200 },
error_rate: { baselinePercent: 0.8, maxIncreasePercent: 0.5 },
crash_rate: { baselinePercent: 0.1, maxIncreasePercent: 0.2 },
},
},
analysis: {
method: 'sequential', // 'frequentist' | 'bayesian' | 'sequential'
minSampleSize: 2000,
minDurationHours: 48,
srmCheckEnabled: true,
},
},
lifecycle: {
autoCleanup: true,
maxFlagAgeDays: 90,
requireOwner: true,
requireExpiration: true,
},
};
Quick Start Guide
- Install SDK & Configure Environment: Add your experimentation client library to the project. Set
EXPERIMENT_SALT and telemetry endpoint variables in your environment configuration. Initialize the SDK during application bootstrap.
- Create Experiment Contract: Define the hypothesis, primary metric, guardrails, and statistical parameters in a TypeScript interface. Commit the contract to version control alongside feature code.
- Implement Assignment Hook: Integrate the deterministic assignment logic into your UI layer using the provided hook pattern. Wrap variant-dependent components and emit impression telemetry on mount.
- Validate with A/A Test: Deploy the implementation to a staging environment. Run a control-vs-control allocation for 24 hours. Verify uniform distribution, confirm SRM p-value > 0.01, and validate telemetry ingestion.
- Launch & Monitor: Enable traffic allocation in production. Monitor guardrail metrics and SRM checks for the first 48 hours. Pause or rollback automatically if thresholds are breached. Proceed to statistical evaluation only after minimum sample size and duration are met.