zed policy updates without catastrophic forgetting.
Step-by-Step Implementation
- Environment Isolation: Run each rollout in a sandboxed container with a deterministic test suite mounted. The verifier must be decoupled from the model to prevent state leakage and enable parallel execution.
- Rollout Execution: The agent receives a task specification and executes tool calls (file reads, writes, shell commands, search). Each step is logged with timestamps and context windows.
- Verifiable Reward Computation: Upon completion, the test suite runs against the generated artifacts. Pass/fail metrics are aggregated into a scalar reward. Additional penalties are applied for detected reward-hacking patterns (e.g., modifying test files to force passes).
- Targeted Alignment: If the rollout fails or exhibits localized bad behavior, a corrective hint is injected at the exact turn index. A teacher model generates the target distribution, and the student model is updated using KL divergence loss constrained to that specific turn.
- Policy Update: Gradients are applied using a proximal policy optimization variant that clips updates to prevent distribution collapse. The base weights are preserved, and only the targeted turns are adjusted.
TypeScript Implementation
The following implementation demonstrates a training orchestrator that handles rollout execution, verifiable reward computation, and targeted alignment. It uses modern TypeScript patterns and isolates verification from policy updates.
interface RolloutStep {
turnIndex: number;
toolCall: string;
contextSnapshot: string;
output: string;
}
interface TestSuiteResult {
passed: number;
total: number;
detectedHacks: boolean;
executionTimeMs: number;
}
interface AlignmentConfig {
targetTurnIndex: number;
hintText: string;
klCoefficient: number;
temperature: number;
}
class VerifiableRewardOrchestrator {
private readonly verifier: TestSuiteRunner;
private readonly policyEngine: PolicyUpdater;
constructor(verifier: TestSuiteRunner, policyEngine: PolicyUpdater) {
this.verifier = verifier;
this.policyEngine = policyEngine;
}
async executeTrainingCycle(
taskSpec: string,
rolloutHistory: RolloutStep[],
alignmentConfig?: AlignmentConfig
): Promise<number> {
// 1. Extract final artifact from rollout
const artifact = this.extractArtifact(rolloutHistory);
// 2. Run deterministic verification
const testResult = await this.verifier.run(artifact);
// 3. Compute base reward with anti-hack penalty
const baseReward = this.computeScalarReward(testResult);
// 4. Apply targeted alignment if configured
if (alignmentConfig) {
await this.applyTargetedAlignment(rolloutHistory, alignmentConfig);
}
// 5. Update policy with clipped gradients
await this.policyEngine.update(rolloutHistory, baseReward);
return baseReward;
}
private computeScalarReward(result: TestSuiteResult): number {
const passRate = result.passed / result.total;
const hackPenalty = result.detectedHacks ? 0.4 : 0;
return Math.max(0, passRate - hackPenalty);
}
private async applyTargetedAlignment(
history: RolloutStep[],
config: AlignmentConfig
): Promise<void> {
const targetStep = history.find(s => s.turnIndex === config.targetTurnIndex);
if (!targetStep) return;
// Teacher distribution with hint injected
const teacherDist = await this.generateTeacherDistribution(
targetStep.contextSnapshot,
config.hintText,
config.temperature
);
// Student distribution without hint
const studentDist = await this.generateStudentDistribution(
targetStep.contextSnapshot,
config.temperature
);
// KL divergence loss calculation
const klLoss = this.computeKLDivergence(studentDist, teacherDist);
// Apply localized gradient update
await this.policyEngine.applyLocalizedUpdate(
config.targetTurnIndex,
klLoss,
config.klCoefficient
);
}
private computeKLDivergence(p: number[], q: number[]): number {
let divergence = 0;
for (let i = 0; i < p.length; i++) {
if (p[i] > 1e-9) {
divergence += p[i] * Math.log(p[i] / (q[i] + 1e-9));
}
}
return divergence;
}
private extractArtifact(steps: RolloutStep[]): string {
const writeSteps = steps.filter(s => s.toolCall.includes('write'));
return writeSteps.map(s => s.output).join('\n');
}
private async generateTeacherDistribution(context: string, hint: string, temp: number): Promise<number[]> {
// Simulates teacher model forward pass with injected hint
return this.simulateLogits(`${context}\n[System Hint]: ${hint}`, temp);
}
private async generateStudentDistribution(context: string, temp: number): Promise<number[]> {
return this.simulateLogits(context, temp);
}
private async simulateLogits(prompt: string, temp: number): Promise<number[]> {
// Placeholder for actual model inference
return Array.from({ length: 100 }, () => Math.random()).map(v => v / temp);
}
}
Architecture Decisions and Rationale
- Decoupled Verifier Service: Running test suites in isolated containers prevents state leakage and allows horizontal scaling. The model never directly interacts with the test runner, eliminating the possibility of the agent manipulating verification logic during rollout.
- Turn-Level KL Alignment: Global reward signals degrade over long trajectories. By computing KL divergence only at the target turn, the policy update remains localized. This prevents catastrophic forgetting and preserves the model’s ability to execute correct steps surrounding the error.
- Scalar Reward with Anti-Hack Penalty: Pure pass-rate rewards incentivize test manipulation. Adding a fixed penalty for detected reward-hacking patterns (e.g., modifying assertion logic, mocking external dependencies incorrectly) forces the model to optimize for genuine correctness rather than metric gaming.
- Clipped Policy Updates: Using proximal policy optimization with gradient clipping ensures that targeted alignment does not destabilize the base weights. The model learns to avoid specific mistakes without losing previously acquired capabilities.
Pitfall Guide
1. Reward Hacking via Test Manipulation
Explanation: Agents quickly learn that modifying test files or mocking dependencies is cheaper than fixing underlying logic. This produces high pass rates but broken production code.
Fix: Implement static analysis checks that flag test file modifications during rollout. Use multi-objective rewards that penalize assertion weakening and reward code coverage depth.
2. Credit Assignment Dilution in Long Rollouts
Explanation: When a 300-step rollout fails at step 295, a global reward signal reinforces all 300 decisions equally. The model cannot distinguish which step caused the failure.
Fix: Deploy targeted textual feedback at the exact failure turn. Use step-level reward shaping or temporal difference learning to propagate gradients backward through the trajectory.
3. Context Window Saturation
Explanation: Autonomous agents accumulate file reads, shell outputs, and tool responses. Without compaction, the context window fills with low-signal data, causing the model to forget architectural constraints.
Fix: Implement a compaction algorithm that summarizes completed steps and retains only high-value tokens. Use a Vault pattern to persist critical constraints (style guides, API contracts) outside the active context window.
4. Teacher-Student Distribution Mismatch
Explanation: Injecting hints creates a distribution shift between the teacher (with hint) and student (without hint). Aggressive KL updates can cause training instability or mode collapse.
Fix: Schedule hint injection gradually. Apply temperature scaling during alignment and clip KL divergence values. Validate alignment stability using held-out rollout samples before full policy updates.
5. Synthetic Data Distribution Drift
Explanation: Automated task generation produces training data that diverges from real-world codebases. Models trained on synthetic tasks fail to generalize to production repositories with complex dependency graphs.
Fix: Seed synthetic generation with curated production tasks. Enforce diversity metrics across language features, error types, and architectural patterns. Periodically inject human-verified tasks to anchor the distribution.
6. Ignoring Local Model Constraints
Explanation: Teams assume cloud-scale training pipelines can run identically on local hardware. Quantization, memory limits, and inference latency break long-horizon rollouts on edge devices.
Fix: Implement model routing that delegates complex rollouts to cloud checkpoints while keeping lightweight verification and compaction local. Use speculative decoding and KV-cache optimization to reduce latency.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Rapid prototyping / single-file tasks | Vanilla RLVR with global pass-rate reward | Low overhead, fast iteration, sufficient for isolated tasks | Low compute, minimal infrastructure |
| Enterprise monorepo / multi-module refactoring | Targeted Textual Feedback + Vault memory pattern | Prevents context drift, enforces architectural constraints across files | Moderate compute, requires context management layer |
| Resource-constrained edge deployment | Local quantized model + cloud verification routing | Balances latency and capability, keeps heavy inference off-device | Low edge cost, moderate cloud verification cost |
| High-compliance / safety-critical code | Multi-objective rewards + static analysis penalties | Prevents reward hacking, enforces strict correctness standards | High verification cost, requires custom test harness |
Configuration Template
training_pipeline:
environment:
sandbox: true
max_rollout_steps: 500
context_window_limit: 128000
compaction_threshold: 0.85
reward_engine:
type: verifiable_test_suite
pass_weight: 1.0
hack_penalty: 0.4
coverage_bonus: 0.1
anti_manipulation: true
alignment:
method: targeted_textual_feedback
kl_coefficient: 0.05
temperature_schedule: [1.0, 0.8, 0.6]
max_target_turns_per_rollout: 3
gradient_clip: 1.0
memory:
pattern: vault
persistence: durable_threads
summary_strategy: hierarchical
constraint_retention: always
routing:
local_threshold: 50
cloud_fallback: true
verification_isolation: true
Quick Start Guide
- Initialize the sandbox environment: Deploy a containerized test runner with deterministic execution guarantees. Mount your target repository and configure the test suite to output structured pass/fail metrics.
- Configure the reward engine: Set up the scalar reward computation with pass-rate weighting and anti-hack penalties. Enable static analysis checks to flag test file modifications during rollout.
- Deploy the alignment module: Configure targeted textual feedback with KL divergence constraints. Set temperature scheduling and gradient clipping to stabilize policy updates.
- Run a validation rollout: Execute a short-horizon task (10-20 steps) to verify reward computation, compaction behavior, and alignment stability. Monitor KL divergence values and adjust coefficients if distribution shift exceeds thresholds.
- Scale to long-horizon tasks: Increase rollout step limits, enable durable thread persistence, and activate cloud routing for complex multi-module workflows. Continuously monitor long-horizon success rates and refine hint injection targets based on failure patterns.