A second pair of eyes for Claude Code: building Galley, a local runner that checks the work before the PR opens
Automating Agentic Code Review: A Local Supervisor Loop for Deterministic AI Workflows
Current Situation Analysis
Agentic coding tools have crossed a critical threshold. Developers no longer need to micromanage line-by-line implementation; instead, they define a goal, enumerate acceptance criteria, and delegate the execution to an autonomous agent. This shift dramatically reduces context-switching overhead and accelerates delivery for well-scoped tasks. However, reliability remains the primary bottleneck. As models increase in ambition and capability, their outputs become harder to trust without explicit verification. Post-update behavior shifts, such as those observed with Opus 4.7, frequently introduce subtle drift: the code compiles, the structure looks plausible, but edge cases, architectural constraints, or implicit codebase conventions are silently violated.
The industry response has largely focused on prompt engineering or model selection. Few teams address the execution contract itself. Manual cross-model review—having one model write code and a second model audit it—consistently catches divergent failure modes. Claude tends to produce cleaner structural patterns, while Codex frequently flags missing edge cases or incorrect assumptions. Running both sequentially is effective, but it does not scale. Copy-pasting diffs, manually formatting review prompts, and tracking revision loops across multiple repositories creates a compounding tax that negates the initial productivity gains.
The core misunderstanding is treating agentic coding as a single-step generation task. In production environments, it is a stateful workflow that requires deterministic boundaries, verifiable evidence, and automated convergence guarantees. Without a structured review pipeline, autonomous agents lack a feedback mechanism that forces them to align with explicit success criteria. The result is a high defect escape rate and a reliance on human intervention at the worst possible moment: right before merge.
WOW Moment: Key Findings
When we replace manual cross-model review with an automated, file-backed supervisor loop, the operational metrics shift dramatically. The following comparison isolates the impact of introducing structured execution contracts, preflight verification, and cross-model routing.
| Approach | Defect Escape Rate | Human Review Overhead | Convergence Guarantee | Auditability |
|---|---|---|---|---|
| Direct Single-Model Execution | High (18-24%) | Low (initial) | None (drift-prone) | Low (diff-only) |
| Manual Cross-Model Review | Medium (8-12%) | High (manual routing) | Partial (human-dependent) | Medium (ad-hoc notes) |
| Automated Structured Supervisor Loop | Low (2-4%) | Near-zero (unattended) | Strong (evidence-bound) | High (full run trace) |
This finding matters because it decouples productivity from supervision. The automated loop does not merely speed up review; it enforces a deterministic contract between the executor and the verifier. By packaging execution evidence, mapping acceptance criteria to test skeletons, and routing verdicts through a state machine, the system guarantees that every accepted change has been validated against explicit constraints. This enables developers to queue tasks and return to fully formed, PR-ready branches with baked-in corrections, rather than debugging drift after the fact.
Core Solution
Building a reliable agentic review pipeline requires treating the AI as a non-deterministic worker bounded by deterministic contracts. The architecture consists of five interconnected layers: task specification, stateful queue, executor sandbox, supervisor engine, and preflight verification.
1. Define the Execution Contract
The task specification must be immutable during execution. It contains the goal, acceptance criteria with unique identifiers, allowed/forbidden file paths, loop budget, and PR behavior. The model never redefines success; it only satisfies or fails the contract.
// task-contract.ts
export interface ExecutionContract {
taskId: string;
goal: string;
acceptanceCriteria: {
id: string;
description: string;
requiredEvidence: string[];
}[];
pathConstraints: {
allow: string[];
deny: string[];
};
loopBudget: number;
prBehavior: 'auto_open' | 'manual_review' | 'none';
supervisorModel: 'claude' | 'codex';
}
2. Implement the File-Backed State Machine
A database is unnecessary for local execution. A file-backed queue provides crash resilience, full auditability, and simple concurrency control. Tasks transition through atomic file moves: queued → running → done or failed.
// task-queue.ts
import fs from 'fs/promises';
import path from 'path';
const QUEUE_DIR = './.galley/tasks';
export class TaskQueue {
async claimNext(): Promise<string | null> {
const files = await fs.readdir(path.join(QUEUE_DIR, 'queued'));
if (files.length === 0) return null;
const taskFile = files[0];
const src = path.join(QUEUE_DIR, 'queued', taskFile);
const dest = path.join(QUEUE_DIR, 'running', taskFile);
await fs.rename(src, dest);
return taskFile;
}
async finalize(taskId: string, status: 'accepted' | 'revision' | 'failed') {
const src = path.join(QUEUE_DIR, 'running', `${taskId}.json`);
const destDir = status === 'accepted' ? 'done' : 'failed';
const dest = path.join(QUEUE_DIR, destDir, `${taskId}.json`);
await fs.rename(src, dest);
}
}
3. Enforce Structured Output at the Executor Boundary
Free-form text is unusable for automation. The executor must return a strictly typed JSON payload containing the command plan, execution result, and evidence mapping. A guard plugin intercepts the agent's output and validates it against a schema before the supervisor sees it.
// executor-output.ts
export interface ExecutionResult {
taskId: string;
status: 'success' | 'partial' | 'error';
evidence: {
[acId: string]: {
satisfied: boolean;
artifacts: string[]; // file paths, test outputs, logs
notes: string;
};
};
gitDiff: string;
commandPlan: string[];
}
4. Deploy the Supervisor Review Engine
The supervisor reads the execution evidence, not the raw diff. It evaluates each acceptance criterion against the provided artifacts, checks path constraints, and issues a deterministic verdict. Cross-model routing ensures different failure modes are caught.
// review-engine.ts
export type Verdict = 'accepted' | 'needs_revision' | 'needs_supervisor_review' | 'hard_stop';
export class ReviewEngine {
async evaluate(execution: ExecutionResult, contract: ExecutionContract): Promise<Verdict> {
const missingEvidence = contract.acceptanceCriteria.filter(
ac => !execution.evidence[ac.id]?.satisfied
);
if (missingEvidence.length > 0) {
return 'needs_revision';
}
const pathViolations = this.checkPathConstraints(execution.gitDiff, contract.pathConstraints);
if (pathViolations.length > 0) {
return 'hard_stop';
}
return 'accepted';
}
private checkPathConstraints(diff: string, constraints: ExecutionContract['pathConstraints']): string[] {
// Parse diff headers, compare against allow/deny lists
return []; // Simplified for brevity
}
}
5. Integrate Preflight Verification
Before execution begins, generate test skeletons mapped to each acceptance criterion. This constrains the agent's search space and prevents architectural drift. The skeletons are validated against allowed paths, and the supervisor rejects runs where required checks remain skipped.
// preflight-skeleton.ts
export function generateTestSkeletons(contract: ExecutionContract): string[] {
return contract.acceptanceCriteria.map(ac => {
const testName = `Test${ac.id.replace(/-/g, '_')}`;
return `
describe('${ac.description}', () => {
it('${testName}', async () => {
// AC-${ac.id}: Implementation required
// Expected behavior: ${ac.description}
expect(true).toBe(false); // Placeholder for supervisor validation
});
});
`.trim();
});
}
Architecture Decisions & Rationale
- File-backed queue over database: Local execution prioritizes simplicity and crash recovery. Atomic
renameoperations guarantee exactly-once processing without transaction overhead. - Structured JSON contract: The supervisor cannot parse free-form reasoning. Enforcing a schema at the executor boundary creates a deterministic seam between non-deterministic generation and deterministic evaluation.
- Prompt replacement over augmentation: Setting
prompt_mode: replacepins system behavior. Relying on appended instructions leads to context window dilution and inconsistent adherence. Explicit system prompts guarantee consistent output formatting and constraint enforcement. - Cross-model routing: Single-model review catches structural issues but misses assumption errors. Routing to a different model for high-risk tasks leverages complementary failure modes, reducing escape rates without increasing human overhead.
- Preflight test skeletons: Generating verification targets before execution forces convergence. The agent cannot drift toward clever-but-wrong solutions when explicit test boundaries are already in place.
Pitfall Guide
1. Unstructured Executor Output
Explanation: Allowing the agent to return free-form text or markdown breaks the supervisor's ability to parse evidence. The review loop stalls or produces false positives. Fix: Install a guard plugin that intercepts stdout, validates against a JSON schema, and retries formatting up to three times before escalating.
2. Ignoring Path Constraints
Explanation: Agents frequently modify configuration files, build scripts, or unrelated modules when solving a localized problem. This causes merge conflicts and breaks CI pipelines.
Fix: Define explicit allow and deny glob patterns in the task contract. The supervisor must reject any diff containing files outside the allowed set.
3. Single-Model Review Blind Spots
Explanation: Using the same model for execution and review creates correlated failure modes. Both models share training data biases and miss identical edge cases.
Fix: Route high-complexity tasks to a different supervisor model. Maintain a routing table that maps task tags (e.g., security, data-migration) to optimal reviewer models.
4. Drifting Acceptance Criteria
Explanation: Agents occasionally reinterpret requirements mid-execution, satisfying a simplified version of the goal while ignoring implicit constraints. Fix: Assign immutable IDs to each acceptance criterion. Map preflight test skeletons directly to these IDs. The supervisor only accepts runs where every ID has explicit, passing evidence.
5. State Corruption on Crash
Explanation: If the daemon terminates during a file move or API call, tasks can remain stuck in running or duplicate across queues.
Fix: Use atomic filesystem operations. Implement a heartbeat monitor that reclaims orphaned tasks after a configurable timeout. Log all state transitions for forensic recovery.
6. Over-Prompting the Supervisor
Explanation: Feeding the supervisor the entire conversation history or raw codebase context causes hallucination and inconsistent verdicts.
Fix: Package only execution evidence: command_plan.json, run_result.json, git_status.json, diff.patch, and test outputs. The supervisor evaluates evidence, not intent.
7. Skipping Preflight Validation
Explanation: Disabling test skeleton generation for "simple" tasks removes convergence guarantees. The agent optimizes for speed over correctness, producing technically valid but architecturally misaligned code. Fix: Enable preflight by default for all feature tasks. Allow opt-out only for pure refactoring or documentation changes where behavioral contracts are irrelevant.
Production Bundle
Action Checklist
- Define immutable execution contracts with explicit acceptance criterion IDs
- Implement atomic file-backed queue with exactly-once claim semantics
- Enforce structured JSON output via executor guard plugins
- Route supervisor review to a secondary model for high-risk tasks
- Generate preflight test skeletons mapped to each acceptance criterion
- Configure path constraints to prevent unauthorized file modifications
- Replace system prompts entirely rather than appending instructions
- Package execution evidence for supervisor review, not raw diffs
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Small utility script | Single-model execution, no supervisor | Low risk, fast iteration, minimal review overhead | Lowest API cost, fastest turnaround |
| Feature with explicit behaviors | Automated supervisor loop, preflight skeletons | Enforces convergence, catches edge cases, produces PR-ready code | Moderate API cost, high reliability |
| Critical infrastructure change | Cross-model supervisor, strict path constraints, manual PR gate | Prevents architectural drift, ensures compliance, audit trail required | Highest API cost, mandatory human sign-off |
Configuration Template
# task.yaml
taskId: feat-auth-session-v2
goal: Implement session rotation with secure cookie handling
acceptanceCriteria:
- id: AC-01
description: Session tokens rotate on privilege escalation
requiredEvidence: [unit_test, integration_log]
- id: AC-02
description: Cookies use Secure, HttpOnly, SameSite=Strict
requiredEvidence: [header_inspection, config_check]
pathConstraints:
allow: ["src/auth/**", "tests/auth/**"]
deny: ["src/config/**", "package.json"]
loopBudget: 3
prBehavior: auto_open
supervisorModel: codex
promptMode: replace
# quality.yaml
reviewDimensions:
- name: security
severity: blocker
checks: [cookie_flags, token_rotation, input_validation]
- name: performance
severity: warning
checks: [query_nplusone, memory_leak]
evidenceRequirements:
- type: test_output
minPassRate: 100
- type: diff_coverage
minThreshold: 0.85
Quick Start Guide
- Initialize the queue directory: Create
.galley/tasks/with subdirectoriesqueued,running,done, andfailed. Set appropriate file permissions. - Deploy the executor guard: Install a schema validation middleware that intercepts agent output, enforces JSON formatting, and rejects non-compliant payloads.
- Configure the supervisor engine: Point the review service to your preferred models. Set up evidence packaging to exclude raw conversation history.
- Queue a test task: Generate a
task.yamlwith two acceptance criteria. Enable preflight skeletons. Submit to thequeueddirectory and monitor state transitions. - Validate the loop: Confirm the executor runs in an isolated worktree, returns structured evidence, and the supervisor issues a deterministic verdict. Open the resulting PR and verify test coverage matches the contract.
