sable stages. Each stage must be independently testable, routable, and observable. The architecture follows a five-step pipeline: classification, parallel execution, context grounding, arbitration, and feedback ingestion.
Step 1: Task Classification and Routing
A lightweight classifier evaluates the diff metadata, file types, and change magnitude to assign a task category. Categories map to specific model tiers or static tools. Simple structural changes (import ordering, dead code removal, formatting) route to fast, low-cost models or linters. Complex logic changes (state machine modifications, concurrency patterns, cross-module dependencies) route to frontier reasoning models.
interface TaskCategory {
id: 'style' | 'logic' | 'security' | 'compliance';
confidence: number;
targetTier: 'fast' | 'standard' | 'frontier';
}
class TaskClassifier {
async categorize(diff: DiffContext): Promise<TaskCategory> {
const features = this.extractFeatures(diff);
const prediction = await this.model.predict(features);
return {
id: this.mapToCategory(prediction.label),
confidence: prediction.score,
targetTier: this.resolveTier(prediction.score, diff.complexity)
};
}
private resolveTier(score: number, complexity: number): 'fast' | 'standard' | 'frontier' {
if (complexity > 0.75 || score < 0.6) return 'frontier';
if (complexity > 0.4) return 'standard';
return 'fast';
}
}
Step 2: Parallel Deterministic and Probabilistic Execution
Static analysis and model invocation run concurrently. Deterministic tools provide ground truth that the orchestrator uses to anchor probabilistic outputs. This prevents the model from hallucinating fixes for issues that compilers or type checkers already caught.
interface AnalysisResult {
source: 'static' | 'model' | 'scanner';
findings: ReviewComment[];
latencyMs: number;
}
async function executeParallelReview(
diff: DiffContext,
tier: 'fast' | 'standard' | 'frontier'
): Promise<AnalysisResult[]> {
const staticPromise = StaticAnalyzer.run(diff);
const modelPromise = ModelGateway.invoke(diff, tier);
const securityPromise = SecretScanner.scan(diff);
const [staticRes, modelRes, securityRes] = await Promise.all([
staticPromise, modelPromise, securityPromise
]);
return [
{ source: 'static', findings: staticRes, latencyMs: staticRes.latency },
{ source: 'model', findings: modelRes, latencyMs: modelRes.latency },
{ source: 'scanner', findings: securityRes, latencyMs: securityRes.latency }
];
}
Step 3: Repository-Aware Context Grounding
Pure model review lacks institutional memory. Retrieval-augmented generation over the repository injects prior review threads, commit messages, and design documents scoped to the touched modules. This shifts the model from generic best-practice advice to comments that align with established codebase conventions.
class ContextRetriever {
async fetchScopedContext(diff: DiffContext): Promise<string[]> {
const touchedModules = this.extractModules(diff);
const historicalThreads = await ReviewArchive.query({
modules: touchedModules,
limit: 5,
sort: 'recency'
});
return historicalThreads.map(thread =>
`Module: ${thread.module}\nPattern: ${thread.summary}\nResolution: ${thread.outcome}`
);
}
}
Step 4: Arbitration and Confidence Thresholding
The orchestrator merges findings, removes duplicates, and applies calibrated confidence thresholds. Comments below the threshold are suppressed. Historical dismissal patterns are hashed and compared against new findings to prevent recurring noise.
class ArbitrationEngine {
private readonly THRESHOLD = 0.78;
private readonly dismissalCache = new Map<string, number>();
async filterAndRank(results: AnalysisResult[]): Promise<ReviewComment[]> {
const merged = this.deduplicate(results.flatMap(r => r.findings));
const filtered = merged.filter(comment => {
const hash = this.computeSignature(comment);
const pastDismissals = this.dismissalCache.get(hash) || 0;
if (pastDismissals >= 2) return false;
return comment.confidence >= this.THRESHOLD;
});
return filtered.sort((a, b) => b.severity - a.severity);
}
}
Step 5: Closed Feedback Loop
Every accepted or dismissed comment becomes a training signal. The system updates routing weights, adjusts confidence thresholds, and logs compliance metadata. Without this loop, false-positive rates remain static. With it, precision trends upward per release cycle.
class FeedbackCollector {
async ingest(action: 'accept' | 'dismiss', comment: ReviewComment): Promise<void> {
const signal = {
commentId: comment.id,
action,
timestamp: Date.now(),
routingPath: comment.routingMetadata
};
await TelemetryPipeline.push(signal);
await this.updateThresholds(signal);
await this.refreshDismissalCache(comment);
}
}
Architecture Rationale:
- Parallel execution minimizes wall-clock latency. Static analyzers complete in milliseconds; models take seconds. Running them concurrently ensures the orchestrator doesn't block on deterministic checks.
- Static grounding prevents hallucination. The model receives compiler/type-checker output as context, forcing it to focus on architectural and logical gaps rather than re-inventing deterministic rules.
- Scoped RAG reduces context window waste. Feeding entire repositories degrades precision and increases cost. Module-scoped retrieval preserves relevance.
- Thresholding + historical dedup protects trust. Confidence scores are calibrated against a validation set of past PRs with human accept/reject labels. Dismissal hashing prevents the same pattern from resurfacing after repeated rejections.
- Feedback ingestion closes the loop. Routing weights and thresholds are recalibrated weekly using offline evaluation sets, ensuring the system adapts to model updates and codebase evolution.
Pitfall Guide
1. Model-Centric Design
Explanation: Treating the language model as the product rather than a component in a larger arbitration system. Teams spend weeks optimizing prompts while ignoring routing, grounding, and feedback infrastructure.
Fix: Decouple model invocation from orchestration. Build the router, static analysis pipeline, and feedback loop first. Treat the model as a pluggable execution tier.
2. Ignoring Deterministic Anchors
Explanation: Running the model without parallel static analysis. The model wastes tokens re-checking type errors, missing await, or unused imports, increasing latency and hallucination risk.
Fix: Always execute compilers, type checkers, and linters in parallel. Feed their output as ground truth context to the model. Suppress model comments that duplicate deterministic findings.
3. Unbounded Confidence Output
Explanation: Surfacing every model comment regardless of confidence score. Low-confidence suggestions flood the PR, degrading trust and increasing cognitive load.
Fix: Implement calibrated confidence thresholding. Run an offline evaluation set to determine the precision-recall trade-off curve. Set the threshold at the point where false positives drop below 3% without sacrificing critical logic catches.
4. Compliance as a Post-Processing Step
Explanation: Adding GDPR, HIPAA, or PCI-DSS checks after the review pipeline completes. This creates gaps in data transfer, retention, and access control that surface during audits.
Fix: Treat compliance as a first-class routing constraint. Tag diffs containing regulated data at ingestion. Route PHI/PCI payloads to BAA-compliant endpoints with redaction layers. Enforce retention policies on prompt/completion logs at the orchestrator level.
5. Static Routing Weights
Explanation: Hardcoding task classification rules or model routing decisions. As models update and codebases evolve, static weights drift, causing misclassification and confident-sounding nonsense on complex changes.
Fix: Implement evaluation-driven A/B routing. Maintain an offline dataset of past PRs with human accept/reject outcomes. Score model variants on precision and recall per slice. Update routing weights every release cycle based on empirical performance.
6. Dismissing Historical Patterns
Explanation: Failing to track past developer dismissals. The same flawed comment pattern resurfaces across files, eroding trust and wasting developer time.
Fix: Hash comment signatures (pattern + file scope + severity). Cache dismissal counts. Suppress findings that match historically rejected patterns. Log these hashes for periodic review and threshold adjustment.
7. Context Window Overload
Explanation: Feeding the entire repository or large design documents into the model prompt. This dilutes relevance, increases token cost, and degrades precision on targeted changes.
Fix: Implement scoped retrieval. Index prior review threads, commit messages, and module-level design docs. Query only the touched files, owners, and recent architectural decisions. Limit context to 4-6 highly relevant snippets.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume open source repository | Fast-tier routing + static grounding + strict thresholding | Volume demands low latency and cost; precision matters more than exhaustive reasoning | -$0.02/PR vs frontier-only |
| Regulated fintech or healthcare | Compliance-first routing + PHI/PCI redaction + BAA endpoints | Legal constraints dictate data handling; model choice is secondary to compliance | +$0.01/PR for redaction/compliance layer |
| Legacy monolith with sparse docs | Scoped RAG + historical dedup + fallback chains | Institutional knowledge is fragmented; grounding prevents hallucination and repeated noise | Neutral cost, +15% precision |
| Rapid prototyping / internal tools | Single-pass frontier model + minimal filtering | Speed and iteration matter more than precision; false positives are tolerable | +$0.12/PR, lower trust retention |
Configuration Template
orchestrator:
version: "2.1"
routing:
classifier:
model: "task-router-v3"
confidence_floor: 0.65
fallback:
enabled: true
max_retries: 2
escalation_tier: "frontier"
evaluation:
dataset: "pr-eval-v2024-q3"
update_cycle: "weekly"
metric: "precision_at_0.78"
grounding:
static_analyzers:
- "typescript-compiler"
- "eslint-config-strict"
- "null-safety-checker"
context_retrieval:
scope: "module"
max_snippets: 5
sources: ["review_threads", "commit_messages", "design_docs"]
arbitration:
confidence_threshold: 0.78
deduplication:
enabled: true
dismissal_limit: 2
hash_algorithm: "sha256"
compliance:
gdpr:
transfer_restriction: true
retention_days: 30
hipaa:
baa_required: true
min_necessary_access: true
pci_dss:
redact_cardholder_data: true
tokenization: true
feedback:
pipeline: "telemetry-ingest-v2"
threshold_adjustment: "adaptive"
logging:
level: "info"
compliance_audit: true
Quick Start Guide
- Initialize the orchestrator skeleton: Deploy the routing layer with a lightweight classifier and map three task categories (
style, logic, security) to fast, standard, and frontier tiers.
- Wire parallel execution: Connect your existing static analyzers and model gateway to run concurrently. Ensure deterministic outputs are injected as context before model invocation.
- Calibrate thresholds: Run an offline evaluation against 500 past PRs with human accept/reject labels. Plot precision-recall curves and set the confidence threshold at the point where false positives drop below 3%.
- Enable feedback ingestion: Deploy the telemetry collector to log accept/dismiss actions. Schedule a weekly job that recalibrates routing weights and updates the dismissal cache.
- Validate compliance routing: Tag a sample diff containing regulated data. Verify that the orchestrator routes it to a BAA-compliant endpoint, applies redaction, and enforces retention policies before surfacing comments.