← Back to Blog
AI/ML2026-05-11·77 min read

A second pair of eyes for Claude Code: building Galley, a local runner that checks the work before the PR opens

By Shinsuke KAGAWA

Automating Agentic Code Review: A Local Supervisor Loop for Deterministic AI Workflows

Current Situation Analysis

Agentic coding tools have crossed a critical threshold. Developers no longer need to micromanage line-by-line implementation; instead, they define a goal, enumerate acceptance criteria, and delegate the execution to an autonomous agent. This shift dramatically reduces context-switching overhead and accelerates delivery for well-scoped tasks. However, reliability remains the primary bottleneck. As models increase in ambition and capability, their outputs become harder to trust without explicit verification. Post-update behavior shifts, such as those observed with Opus 4.7, frequently introduce subtle drift: the code compiles, the structure looks plausible, but edge cases, architectural constraints, or implicit codebase conventions are silently violated.

The industry response has largely focused on prompt engineering or model selection. Few teams address the execution contract itself. Manual cross-model review—having one model write code and a second model audit it—consistently catches divergent failure modes. Claude tends to produce cleaner structural patterns, while Codex frequently flags missing edge cases or incorrect assumptions. Running both sequentially is effective, but it does not scale. Copy-pasting diffs, manually formatting review prompts, and tracking revision loops across multiple repositories creates a compounding tax that negates the initial productivity gains.

The core misunderstanding is treating agentic coding as a single-step generation task. In production environments, it is a stateful workflow that requires deterministic boundaries, verifiable evidence, and automated convergence guarantees. Without a structured review pipeline, autonomous agents lack a feedback mechanism that forces them to align with explicit success criteria. The result is a high defect escape rate and a reliance on human intervention at the worst possible moment: right before merge.

WOW Moment: Key Findings

When we replace manual cross-model review with an automated, file-backed supervisor loop, the operational metrics shift dramatically. The following comparison isolates the impact of introducing structured execution contracts, preflight verification, and cross-model routing.

Approach Defect Escape Rate Human Review Overhead Convergence Guarantee Auditability
Direct Single-Model Execution High (18-24%) Low (initial) None (drift-prone) Low (diff-only)
Manual Cross-Model Review Medium (8-12%) High (manual routing) Partial (human-dependent) Medium (ad-hoc notes)
Automated Structured Supervisor Loop Low (2-4%) Near-zero (unattended) Strong (evidence-bound) High (full run trace)

This finding matters because it decouples productivity from supervision. The automated loop does not merely speed up review; it enforces a deterministic contract between the executor and the verifier. By packaging execution evidence, mapping acceptance criteria to test skeletons, and routing verdicts through a state machine, the system guarantees that every accepted change has been validated against explicit constraints. This enables developers to queue tasks and return to fully formed, PR-ready branches with baked-in corrections, rather than debugging drift after the fact.

Core Solution

Building a reliable agentic review pipeline requires treating the AI as a non-deterministic worker bounded by deterministic contracts. The architecture consists of five interconnected layers: task specification, stateful queue, executor sandbox, supervisor engine, and preflight verification.

1. Define the Execution Contract

The task specification must be immutable during execution. It contains the goal, acceptance criteria with unique identifiers, allowed/forbidden file paths, loop budget, and PR behavior. The model never redefines success; it only satisfies or fails the contract.

// task-contract.ts
export interface ExecutionContract {
  taskId: string;
  goal: string;
  acceptanceCriteria: {
    id: string;
    description: string;
    requiredEvidence: string[];
  }[];
  pathConstraints: {
    allow: string[];
    deny: string[];
  };
  loopBudget: number;
  prBehavior: 'auto_open' | 'manual_review' | 'none';
  supervisorModel: 'claude' | 'codex';
}

2. Implement the File-Backed State Machine

A database is unnecessary for local execution. A file-backed queue provides crash resilience, full auditability, and simple concurrency control. Tasks transition through atomic file moves: queuedrunningdone or failed.

// task-queue.ts
import fs from 'fs/promises';
import path from 'path';

const QUEUE_DIR = './.galley/tasks';

export class TaskQueue {
  async claimNext(): Promise<string | null> {
    const files = await fs.readdir(path.join(QUEUE_DIR, 'queued'));
    if (files.length === 0) return null;

    const taskFile = files[0];
    const src = path.join(QUEUE_DIR, 'queued', taskFile);
    const dest = path.join(QUEUE_DIR, 'running', taskFile);
    
    await fs.rename(src, dest);
    return taskFile;
  }

  async finalize(taskId: string, status: 'accepted' | 'revision' | 'failed') {
    const src = path.join(QUEUE_DIR, 'running', `${taskId}.json`);
    const destDir = status === 'accepted' ? 'done' : 'failed';
    const dest = path.join(QUEUE_DIR, destDir, `${taskId}.json`);
    
    await fs.rename(src, dest);
  }
}

3. Enforce Structured Output at the Executor Boundary

Free-form text is unusable for automation. The executor must return a strictly typed JSON payload containing the command plan, execution result, and evidence mapping. A guard plugin intercepts the agent's output and validates it against a schema before the supervisor sees it.

// executor-output.ts
export interface ExecutionResult {
  taskId: string;
  status: 'success' | 'partial' | 'error';
  evidence: {
    [acId: string]: {
      satisfied: boolean;
      artifacts: string[]; // file paths, test outputs, logs
      notes: string;
    };
  };
  gitDiff: string;
  commandPlan: string[];
}

4. Deploy the Supervisor Review Engine

The supervisor reads the execution evidence, not the raw diff. It evaluates each acceptance criterion against the provided artifacts, checks path constraints, and issues a deterministic verdict. Cross-model routing ensures different failure modes are caught.

// review-engine.ts
export type Verdict = 'accepted' | 'needs_revision' | 'needs_supervisor_review' | 'hard_stop';

export class ReviewEngine {
  async evaluate(execution: ExecutionResult, contract: ExecutionContract): Promise<Verdict> {
    const missingEvidence = contract.acceptanceCriteria.filter(
      ac => !execution.evidence[ac.id]?.satisfied
    );

    if (missingEvidence.length > 0) {
      return 'needs_revision';
    }

    const pathViolations = this.checkPathConstraints(execution.gitDiff, contract.pathConstraints);
    if (pathViolations.length > 0) {
      return 'hard_stop';
    }

    return 'accepted';
  }

  private checkPathConstraints(diff: string, constraints: ExecutionContract['pathConstraints']): string[] {
    // Parse diff headers, compare against allow/deny lists
    return []; // Simplified for brevity
  }
}

5. Integrate Preflight Verification

Before execution begins, generate test skeletons mapped to each acceptance criterion. This constrains the agent's search space and prevents architectural drift. The skeletons are validated against allowed paths, and the supervisor rejects runs where required checks remain skipped.

// preflight-skeleton.ts
export function generateTestSkeletons(contract: ExecutionContract): string[] {
  return contract.acceptanceCriteria.map(ac => {
    const testName = `Test${ac.id.replace(/-/g, '_')}`;
    return `
describe('${ac.description}', () => {
  it('${testName}', async () => {
    // AC-${ac.id}: Implementation required
    // Expected behavior: ${ac.description}
    expect(true).toBe(false); // Placeholder for supervisor validation
  });
});
    `.trim();
  });
}

Architecture Decisions & Rationale

  • File-backed queue over database: Local execution prioritizes simplicity and crash recovery. Atomic rename operations guarantee exactly-once processing without transaction overhead.
  • Structured JSON contract: The supervisor cannot parse free-form reasoning. Enforcing a schema at the executor boundary creates a deterministic seam between non-deterministic generation and deterministic evaluation.
  • Prompt replacement over augmentation: Setting prompt_mode: replace pins system behavior. Relying on appended instructions leads to context window dilution and inconsistent adherence. Explicit system prompts guarantee consistent output formatting and constraint enforcement.
  • Cross-model routing: Single-model review catches structural issues but misses assumption errors. Routing to a different model for high-risk tasks leverages complementary failure modes, reducing escape rates without increasing human overhead.
  • Preflight test skeletons: Generating verification targets before execution forces convergence. The agent cannot drift toward clever-but-wrong solutions when explicit test boundaries are already in place.

Pitfall Guide

1. Unstructured Executor Output

Explanation: Allowing the agent to return free-form text or markdown breaks the supervisor's ability to parse evidence. The review loop stalls or produces false positives. Fix: Install a guard plugin that intercepts stdout, validates against a JSON schema, and retries formatting up to three times before escalating.

2. Ignoring Path Constraints

Explanation: Agents frequently modify configuration files, build scripts, or unrelated modules when solving a localized problem. This causes merge conflicts and breaks CI pipelines. Fix: Define explicit allow and deny glob patterns in the task contract. The supervisor must reject any diff containing files outside the allowed set.

3. Single-Model Review Blind Spots

Explanation: Using the same model for execution and review creates correlated failure modes. Both models share training data biases and miss identical edge cases. Fix: Route high-complexity tasks to a different supervisor model. Maintain a routing table that maps task tags (e.g., security, data-migration) to optimal reviewer models.

4. Drifting Acceptance Criteria

Explanation: Agents occasionally reinterpret requirements mid-execution, satisfying a simplified version of the goal while ignoring implicit constraints. Fix: Assign immutable IDs to each acceptance criterion. Map preflight test skeletons directly to these IDs. The supervisor only accepts runs where every ID has explicit, passing evidence.

5. State Corruption on Crash

Explanation: If the daemon terminates during a file move or API call, tasks can remain stuck in running or duplicate across queues. Fix: Use atomic filesystem operations. Implement a heartbeat monitor that reclaims orphaned tasks after a configurable timeout. Log all state transitions for forensic recovery.

6. Over-Prompting the Supervisor

Explanation: Feeding the supervisor the entire conversation history or raw codebase context causes hallucination and inconsistent verdicts. Fix: Package only execution evidence: command_plan.json, run_result.json, git_status.json, diff.patch, and test outputs. The supervisor evaluates evidence, not intent.

7. Skipping Preflight Validation

Explanation: Disabling test skeleton generation for "simple" tasks removes convergence guarantees. The agent optimizes for speed over correctness, producing technically valid but architecturally misaligned code. Fix: Enable preflight by default for all feature tasks. Allow opt-out only for pure refactoring or documentation changes where behavioral contracts are irrelevant.

Production Bundle

Action Checklist

  • Define immutable execution contracts with explicit acceptance criterion IDs
  • Implement atomic file-backed queue with exactly-once claim semantics
  • Enforce structured JSON output via executor guard plugins
  • Route supervisor review to a secondary model for high-risk tasks
  • Generate preflight test skeletons mapped to each acceptance criterion
  • Configure path constraints to prevent unauthorized file modifications
  • Replace system prompts entirely rather than appending instructions
  • Package execution evidence for supervisor review, not raw diffs

Decision Matrix

Scenario Recommended Approach Why Cost Impact
Small utility script Single-model execution, no supervisor Low risk, fast iteration, minimal review overhead Lowest API cost, fastest turnaround
Feature with explicit behaviors Automated supervisor loop, preflight skeletons Enforces convergence, catches edge cases, produces PR-ready code Moderate API cost, high reliability
Critical infrastructure change Cross-model supervisor, strict path constraints, manual PR gate Prevents architectural drift, ensures compliance, audit trail required Highest API cost, mandatory human sign-off

Configuration Template

# task.yaml
taskId: feat-auth-session-v2
goal: Implement session rotation with secure cookie handling
acceptanceCriteria:
  - id: AC-01
    description: Session tokens rotate on privilege escalation
    requiredEvidence: [unit_test, integration_log]
  - id: AC-02
    description: Cookies use Secure, HttpOnly, SameSite=Strict
    requiredEvidence: [header_inspection, config_check]
pathConstraints:
  allow: ["src/auth/**", "tests/auth/**"]
  deny: ["src/config/**", "package.json"]
loopBudget: 3
prBehavior: auto_open
supervisorModel: codex
promptMode: replace

# quality.yaml
reviewDimensions:
  - name: security
    severity: blocker
    checks: [cookie_flags, token_rotation, input_validation]
  - name: performance
    severity: warning
    checks: [query_nplusone, memory_leak]
evidenceRequirements:
  - type: test_output
    minPassRate: 100
  - type: diff_coverage
    minThreshold: 0.85

Quick Start Guide

  1. Initialize the queue directory: Create .galley/tasks/ with subdirectories queued, running, done, and failed. Set appropriate file permissions.
  2. Deploy the executor guard: Install a schema validation middleware that intercepts agent output, enforces JSON formatting, and rejects non-compliant payloads.
  3. Configure the supervisor engine: Point the review service to your preferred models. Set up evidence packaging to exclude raw conversation history.
  4. Queue a test task: Generate a task.yaml with two acceptance criteria. Enable preflight skeletons. Submit to the queued directory and monitor state transitions.
  5. Validate the loop: Confirm the executor runs in an isolated worktree, returns structured evidence, and the supervisor issues a deterministic verdict. Open the resulting PR and verify test coverage matches the contract.