del A influences Model B, the ensemble loses its statistical independence.
2. Semantic Deduplication: Findings from different models often describe the same issue using different wording. A deduplication layer is essential to merge these into a single actionable item.
3. Severity-Based Triage: Not all findings require immediate action. The system must classify issues as Critical, High, Medium, or Low to guide developer effort.
4. Human-in-the-Loop Validation: AI suggestions are never auto-applied. The developer must validate fixes, ensuring deep understanding of the codebase.
Implementation: TypeScript Orchestrator
The following TypeScript example demonstrates the core logic of the orchestrator. This implementation uses a plugin-based agent system, allowing integration with various model providers.
// types.ts
export type Severity = 'critical' | 'high' | 'medium' | 'low';
export interface Finding {
id: string;
agent: string;
severity: Severity;
description: string;
location: { file: string; line: number };
suggestion: string;
}
export interface ReviewAgent {
name: string;
review(diff: string, context: string): Promise<Finding[]>;
}
export interface ConsolidatedReport {
findings: Finding[];
summary: {
critical: number;
high: number;
medium: number;
low: number;
};
recommendation: 'merge' | 'fix' | 'abandon';
}
// orchestrator.ts
import { ReviewAgent, Finding, Severity, ConsolidatedReport } from './types';
export class QualityGateOrchestrator {
private agents: ReviewAgent[];
constructor(agents: ReviewAgent[]) {
this.agents = agents;
}
async executeReview(diff: string, context: string): Promise<ConsolidatedReport> {
// 1. Independent execution
const rawFindings = await Promise.all(
this.agents.map(agent => agent.review(diff, context))
);
// 2. Flatten and deduplicate
const allFindings = rawFindings.flat();
const deduplicated = this.deduplicateFindings(allFindings);
// 3. Rank by severity
const ranked = this.rankBySeverity(deduplicated);
// 4. Generate recommendation
const recommendation = this.generateRecommendation(ranked);
return {
findings: ranked,
summary: this.calculateSummary(ranked),
recommendation
};
}
private deduplicateFindings(findings: Finding[]): Finding[] {
// Semantic deduplication logic would go here.
// In production, use embeddings or LLM-based clustering to merge
// findings that describe the same issue.
// For this example, we group by location and severity as a proxy.
const uniqueMap = new Map<string, Finding>();
findings.forEach(finding => {
const key = `${finding.location.file}:${finding.location.line}:${finding.severity}`;
if (!uniqueMap.has(key)) {
uniqueMap.set(key, finding);
} else {
// Append agent name to indicate consensus
const existing = uniqueMap.get(key)!;
existing.agent += `, ${finding.agent}`;
}
});
return Array.from(uniqueMap.values());
}
private rankBySeverity(findings: Finding[]): Finding[] {
const severityOrder: Record<Severity, number> = {
critical: 0,
high: 1,
medium: 2,
low: 3
};
return findings.sort((a, b) => severityOrder[a.severity] - severityOrder[b.severity]);
}
private generateRecommendation(findings: Finding[]): 'merge' | 'fix' | 'abandon' {
const criticals = findings.filter(f => f.severity === 'critical');
const highs = findings.filter(f => f.severity === 'high');
if (criticals.length > 0) {
// If criticals suggest fundamental architectural flaws, recommend abandon.
// This requires heuristic analysis of the description.
const hasArchFlaw = criticals.some(f =>
f.description.toLowerCase().includes('architecture') ||
f.description.toLowerCase().includes('design')
);
return hasArchFlaw ? 'abandon' : 'fix';
}
if (highs.length > 2) {
return 'fix';
}
return 'merge';
}
private calculateSummary(findings: Finding[]): ConsolidatedReport['summary'] {
return {
critical: findings.filter(f => f.severity === 'critical').length,
high: findings.filter(f => f.severity === 'high').length,
medium: findings.filter(f => f.severity === 'medium').length,
low: findings.filter(f => f.severity === 'low').length
};
}
}
Triage Workflow
Once the orchestrator produces a ConsolidatedReport, the developer follows a strict triage protocol:
- Critical/High Findings: Must be resolved. The developer reviews the AI's suggestion but implements the fix manually, ensuring the solution aligns with system constraints. Blind acceptance is prohibited.
- Medium Findings: Evaluated based on cost-benefit analysis. If the fix introduces complexity disproportionate to the risk, it may be deferred or documented as a known limitation.
- Abandonment Criteria: If the report indicates that the PR's approach is fundamentally flawed (e.g., critical architectural violations), the PR should be abandoned. This prevents sunk cost fallacy and encourages re-architecting before further investment.
Understanding Tools Integration
To deepen code comprehension, the workflow can be augmented with "understanding tools." These tools ask the AI to explain the changeset, generate Mermaid diagrams of data flow, or simulate a Q&A session where the AI challenges the developer's understanding. This ensures the developer retains ownership of the logic and can articulate the system's behavior.
Pitfall Guide
Implementing an ensemble review system requires discipline. The following pitfalls are common in production environments.
| Pitfall | Explanation | Fix |
|---|
| Blind Acceptance | Developers treat AI suggestions as authoritative truth, applying fixes without validation. This propagates subtle errors or introduces new bugs. | Enforce a policy where all AI suggestions must be manually implemented and reviewed. Use the AI for diagnosis, not prescription. |
| Single-Model Reliance | Using only one LLM for review. This model may have blind spots or hallucinate issues specific to its training data. | Deploy an ensemble of at least two distinct models. Diversity in model architecture reduces correlated errors. |
| Cost Neglect | Running expensive models on every trivial change, leading to prohibitive API costs. | Implement gating logic. Run full ensemble reviews only on PRs exceeding a complexity threshold or touching critical modules. |
| Legacy Blindness | AI surfaces pre-existing bugs in unrelated files, and developers ignore them, missing opportunities to improve codebase health. | Configure the orchestrator to flag pre-existing bugs separately. Schedule dedicated "bug bashes" to address these findings. |
| Abandonment Resistance | Developers refuse to abandon a PR despite critical architectural flaws, continuing to patch a broken approach. | Define explicit abandonment criteria in the triage policy. Encourage a culture where killing a bad PR is viewed as a success. |
| Context Overflow | Feeding the entire codebase to the AI, causing context window saturation and degraded review quality. | Limit input to the PR diff plus relevant context files. Use chunking strategies for large changesets. |
| Deduplication Failure | Multiple models report the same issue, cluttering the report and wasting developer time. | Implement robust semantic deduplication. Group findings by location and semantic similarity before presenting to the developer. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Critical System PR | Full Ensemble Review | High risk requires maximum bug coverage and low false positives. | High (Multiple models, full context) |
| Minor UI Tweak | Single Model or Skip | Low risk; ensemble overhead outweighs benefits. | Low |
| Refactoring Large Module | Ensemble + Abandon Check | Refactoring often exposes architectural flaws; abandon option is crucial. | Medium-High |
| Prototype / Spike | No AI Review | Speed is priority; quality gates are unnecessary for disposable code. | None |
Configuration Template
Use this JSON template to configure the orchestrator. Adjust thresholds and model endpoints based on your environment.
{
"ensemble": {
"agents": [
{
"name": "claude-sonnet",
"endpoint": "https://api.anthropic.com/v1/messages",
"weight": 1.0
},
{
"name": "codex-mini",
"endpoint": "https://api.openai.com/v1/chat/completions",
"weight": 1.0
}
],
"deduplication": {
"strategy": "semantic",
"threshold": 0.85
}
},
"triage": {
"abandon_threshold": {
"critical_count": 1,
"arch_flaw_detected": true
},
"medium_deferral": {
"max_mediums": 5,
"require_doc": true
}
},
"gating": {
"min_diff_size": 50,
"critical_paths": ["src/core", "src/auth"]
}
}
Quick Start Guide
- Install Dependencies: Add the orchestrator package and model SDKs to your project.
npm install @codcompass/ai-review-orchestrator @anthropic-ai/sdk openai
- Configure Agents: Create a configuration file with your API keys and model endpoints.
- Run Dry-Run: Execute the orchestrator on a recent PR to validate findings and deduplication logic.
npx ai-review --dry-run --pr-id 1234
- Integrate Hook: Add the orchestrator to your pre-merge CI job. Block merges if the recommendation is
fix or abandon.
# .github/workflows/ai-review.yml
jobs:
ai-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- run: npx ai-review --pr ${{ github.event.pull_request.number }}
env:
AI_REVIEW_CONFIG: ./config/review.json
This architecture shifts the paradigm from AI as a velocity multiplier to AI as a quality multiplier. By embracing slower, more rigorous review processes, teams can ship code that is not only faster to write but significantly more robust, reducing long-term maintenance costs and improving system reliability.