AdamsReview: Multi-Agent PR Reviews for Claude Code, Reviewed

By Codcompass Team·2026-05-17·7 min read

Beyond the Single Pass: Orchestrating Focused AI Agents for Production PR Reviews

Current Situation Analysis

Automated code review tools have reached a functional plateau. When a pull request crosses a moderate complexity threshold, single-pass LLM evaluations consistently degrade into surface-level feedback. They flag indentation inconsistencies, suggest obvious null checks, and highlight stale TODOs while silently missing structural vulnerabilities, race conditions, or inefficient data flows. This isn't a model intelligence problem; it's an orchestration problem.

The industry overlooks this limitation because developers assume context windows scale linearly with review quality. They don't. Three structural failure modes emerge when a single model processes a multi-file diff:

Context Window Fragmentation: As diff length increases, the model's attention mechanism distributes probability mass across tokens. Early files receive genuine reasoning; trailing files trigger pattern-matching heuristics. The model stops analyzing and starts completing.
Validation Bias: Single prompts force the model into a cooperative stance. It reads the diff as authored and asks, "Does this align with itself?" rather than, "What input sequence breaks this assumption?" Constructive feedback and adversarial stress-testing are mutually exclusive in a single inference pass.
Absence of Cross-Verification: Hallucinated signatures, misread return types, or incorrect dependency assumptions go unchecked. Human review relies on sequential verification; single-agent AI lacks a feedback loop to catch its own misreads.

The operational threshold for multi-agent review isn't perfection. It's whether focused agents surface structural defects that a broad-brief agent suppresses, within a token budget your team can sustain. Both detection lift and cost control must be measured.

WOW Moment: Key Findings

When you replace a monolithic review prompt with parallel, scope-isolated agents, the trade-off curve shifts dramatically. The following comparison illustrates the structural impact on review quality and resource consumption:

Approach	Detection Depth	Token Cost per 100 Lines	Review Latency	False Positive Rate
Single-Pass LLM	Surface-level (formatting, obvious bugs)	Low	Fast (1-2 min)	High (over-indexes on style)
Multi-Agent Orchestration	Structural (race conditions, security, perf)	Moderate-High	Medium (3-5 min)	Low (scope-constrained)

This finding matters because it decouples review depth from PR size. Instead of paying for a bloated context window that dilutes reasoning, you pay for targeted inference passes that compound into a cohesive audit. Teams can now treat AI review as a mechanical filter rather than a rubber stamp, reserving human attention for architecture, naming conventions, and product alignment.

Core Solution

Building a production-ready multi-agent review pipeline requires three layers: agent scoping, parallel dispatch, and output consolidation. The execution layer should run on top of Claude Code's CLI/agent runtime rather than direct API calls. This preserves access to local tooling, shell execution, file system reads, and any MCP servers wired into your environment.

Step 1: Define Agent Scopes

Each agent receives a narrow system prompt and a restricted context window. T

ypical scopes include correctness, security exposure, test coverage, and performance. Isolation prevents attention bleed and forces adversarial reasoning.

Step 2: Build the Dispatch Layer

Agents run in parallel. The orchestrator clones the diff context, injects scope-specific instructions, and spawns independent Claude Code sessions. Parallel execution minimizes wall-clock latency while maintaining strict prompt boundaries.

Step 3: Implement Consolidation

Raw agent outputs are noisy. A consolidation layer deduplicates findings, weights severity, resolves contradictions, and formats a single comment thread. This transforms fragmented reports into an actionable checklist.

Architecture Implementation (TypeScript)

import { spawn } from 'child_process';
import { readFileSync } from 'fs';

interface AgentConfig {
  id: string;
  scope: 'correctness' | 'security' | 'performance' | 'testing';
  systemPrompt: string;
  maxTokens: number;
}

interface ReviewFinding {
  agentId: string;
  file: string;
  line: number;
  severity: 'critical' | 'high' | 'medium' | 'low';
  description: string;
  suggestion?: string;
}

class ReviewOrchestrator {
  private agents: AgentConfig[];
  private diffPath: string;

  constructor(agents: AgentConfig[], diffPath: string) {
    this.agents = agents;
    this.diffPath = diffPath;
  }

  async dispatch(): Promise<ReviewFinding[]> {
    const diffContent = readFileSync(this.diffPath, 'utf-8');
    const promises = this.agents.map(agent => this.runAgent(agent, diffContent));
    const rawResults = await Promise.all(promises);
    return this.consolidate(rawResults.flat());
  }

  private runAgent(agent: AgentConfig, diff: string): Promise<ReviewFinding[]> {
    return new Promise((resolve, reject) => {
      const claudeProcess = spawn('claude', [
        '--non-interactive',
        '--max-tokens', agent.maxTokens.toString(),
        '--system-prompt', agent.systemPrompt,
        '--input', diff
      ]);

      let output = '';
      claudeProcess.stdout.on('data', chunk => output += chunk.toString());
      claudeProcess.on('close', code => {
        if (code === 0) resolve(this.parseAgentOutput(agent.id, output));
        else reject(new Error(`Agent ${agent.id} failed with exit code ${code}`));
      });
    });
  }

  private parseAgentOutput(agentId: string, raw: string): ReviewFinding[] {
    // Structured JSON extraction from agent stdout
    try {
      const parsed = JSON.parse(raw);
      return parsed.findings.map((f: any) => ({ ...f, agentId }));
    } catch {
      return [];
    }
  }

  private consolidate(findings: ReviewFinding[]): ReviewFinding[] {
    const dedupMap = new Map<string, ReviewFinding>();
    
    findings.forEach(f => {
      const key = `${f.file}:${f.line}:${f.severity}`;
      const existing = dedupMap.get(key);
      if (!existing || f.severity === 'critical') {
        dedupMap.set(key, f);
      }
    });

    return Array.from(dedupMap.values())
      .sort((a, b) => {
        const severityOrder = { critical: 0, high: 1, medium: 2, low: 3 };
        return severityOrder[a.severity] - severityOrder[b.severity];
      });
  }
}

Architecture Rationale

Parallel Dispatch: Sequential agent execution compounds latency. Parallel spawning keeps wall-clock time proportional to the slowest agent, not the sum.
CLI/Runtime Binding: Running through Claude Code preserves local context. Agents can execute grep, read package.json, or query MCP tools for dependency graphs. Direct API calls lose this environmental awareness.
Consolidation Layer: Raw LLM output is unstructured. Deduplication by file/line/severity prevents comment spam. Severity weighting ensures critical findings surface first.
Scope Isolation: Narrow system prompts force the model into adversarial or analytical modes. A security agent instructed to "enumerate attack vectors" behaves differently than a general reviewer told to "check for issues."

Pitfall Guide

1. Prompt Contamination

Explanation: Agents bleed into each other's scope when system prompts overlap or context windows share unfiltered diff data. A performance agent starts commenting on security flaws, diluting its focus. Fix: Enforce strict system prompt boundaries. Pass only the diff subset relevant to each scope. Use JSON schema validation on agent output to reject out-of-scope findings.

2. Unbounded Token Spend

Explanation: Running five agents on a 50-line PR burns tokens on ceremony. Cost scales with agent count, not diff size. Without gating, budgets explode on trivial changes. Fix: Implement size-based routing. Use git diff --stat to calculate changed lines. Route PRs under 150 lines to a single fast model. Trigger multi-agent orchestration only above threshold.

3. Context Cache Neglect

Explanation: Each agent re-reads repository context from scratch. Without prompt caching, identical baseline code is tokenized repeatedly, inflating costs by 3-4x. Fix: Enable Claude Code's prompt caching for repository roots. Pre-warm context with README.md, architecture docs, and shared type definitions. Cache hits drop per-agent costs significantly.

4. Verdict Automation

Explanation: Teams treat consolidated output as a final approval gate. AI lacks product context, business logic awareness, and architectural intent. Automating merges based on AI review guarantees production incidents. Fix: Design output as a mechanical checklist. Require human sign-off for architecture, naming, and business alignment. AI surfaces; humans decide.

5. Shallow Adversarial Scoping

Explanation: Security agents default to regex patterns (eval, unsafe-inline). They miss logic flaws, IDOR vulnerabilities, or race conditions because prompts lack threat modeling structure. Fix: Structure security prompts around attack vectors: input validation, privilege escalation, data leakage, state manipulation. Force enumeration before conclusion.

6. Merge Logic Conflicts

Explanation: Two agents flag the same line with contradictory suggestions. Without resolution logic, developers receive noise instead of direction. Fix: Implement conflict resolution in the consolidation layer. Prioritize critical severity. If severity matches, prefer the agent with higher historical accuracy. Log contradictions for human review.

7. CI/CD Bottlenecks

Explanation: Multi-agent review blocks PR merges while waiting for all agents to finish. Slow agents or API rate limits stall pipelines. Fix: Decouple review from merge gates. Post findings as PR comments asynchronously. Use status checks only for critical security agents. Allow developers to address findings without blocking CI.

Production Bundle

Action Checklist

Gate multi-agent orchestration on PR size (e.g., >200 changed lines)
Enable prompt caching for repository context and shared types
Log token consumption per agent per PR to track budget drift
Structure system prompts around explicit attack vectors and analysis scopes
Treat AI output as a mechanical checklist, not an approval verdict
Implement deduplication and severity weighting in the consolidation layer
Run agents asynchronously in CI to avoid pipeline blocking
Audit agent prompts monthly for prompt drift and coverage gaps

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Trivial PR (<50 lines)	Single fast model or manual merge	Orchestration overhead exceeds value	Minimal
Medium refactor (150-400 lines)	Multi-agent with correctness + testing scopes	Catches regression patterns without full suite	Moderate
Large feature (>500 lines)	Full multi-agent orchestration	Attention dilution makes single-pass unreliable	High
Security-critical module	Dedicated security agent + human audit	AI misses business-logic flaws; human verifies intent	High + human time
Dependency bump	Automated lint + single model	Structural analysis unnecessary for version updates	Low

Configuration Template

review_orchestrator:
  size_threshold: 200
  cache_enabled: true
  cache_ttl_minutes: 30
  token_budget_per_agent: 8000
  parallel_limit: 4

agents:
  - id: correctness
    scope: correctness
    system_prompt: |
      Analyze the diff for logical errors, type mismatches, and control flow issues.
      Focus on edge cases, null propagation, and state mutations.
      Return findings as JSON with file, line, severity, description, suggestion.
    max_tokens: 6000

  - id: security
    scope: security
    system_prompt: |
      Evaluate the diff for injection vectors, privilege escalation, data exposure,
      and insecure deserialization. Enumerate attack paths before concluding.
      Return findings as JSON with file, line, severity, description, suggestion.
    max_tokens: 7000

  - id: performance
    scope: performance
    system_prompt: |
      Identify N+1 queries, unnecessary allocations, synchronous blocking calls,
      and algorithmic inefficiencies. Suggest concrete optimizations.
      Return findings as JSON with file, line, severity, description, suggestion.
    max_tokens: 5000

consolidation:
  dedup_strategy: file_line_severity
  severity_priority: [critical, high, medium, low]
  conflict_resolution: prefer_critical_or_log
  output_format: markdown_checklist

Quick Start Guide

Install the runtime: Ensure Claude Code CLI is installed and authenticated. Verify MCP servers and local tooling are accessible.
Configure scopes: Copy the configuration template. Adjust size_threshold, token_budget_per_agent, and system prompts to match your codebase conventions.
Initialize the orchestrator: Run node review-orchestrator.js --diff pr-142.diff --config review-config.yaml. The script spawns parallel agents, waits for completion, and outputs a consolidated markdown checklist.
Integrate with CI: Add a GitHub Action or GitLab CI step that triggers the orchestrator on PR creation. Configure it to post findings as comments rather than blocking status checks.
Validate and iterate: Review the first 10 runs. Check token logs, verify deduplication accuracy, and refine system prompts based on false positive patterns. Adjust thresholds as your team adopts the workflow.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back