mentation pattern using TypeScript.
Step 1: Context Assembly Architecture
AI models degrade in accuracy when context is fragmented. The review system must bundle three distinct layers:
- Intent & Scope: What the change is supposed to achieve
- Constraints: Security, performance, backward compatibility, and deployment rules
- Diff Payload: The complete patch or file set under review
interface ReviewContext {
objective: string;
domain: 'frontend' | 'backend' | 'data-pipeline' | 'infrastructure';
constraints: string[];
affectedFiles: string[];
diffPayload: string;
}
class ReviewContextBuilder {
private context: ReviewContext;
constructor() {
this.context = {
objective: '',
domain: 'backend',
constraints: [],
affectedFiles: [],
diffPayload: ''
};
}
setObjective(goal: string): this {
this.context.objective = goal;
return this;
}
addConstraint(rule: string): this {
this.context.constraints.push(rule);
return this;
}
setDiff(patch: string): this {
this.context.diffPayload = patch;
return this;
}
build(): ReviewContext {
if (!this.context.objective || !this.context.diffPayload) {
throw new Error('Review context requires objective and diff payload');
}
return { ...this.context };
}
}
Step 2: Prompt Template Engineering
The prompt must enforce categorical evaluation and suppress conversational filler. It should explicitly forbid assumption generation and require evidence-backed findings.
const REVIEW_TEMPLATE = `
You are a senior systems engineer conducting a pre-merge risk assessment.
Your mandate is to identify failure modes, not validate implementation choices.
Context:
- Objective: {{objective}}
- Domain: {{domain}}
- Constraints: {{constraints}}
- Affected Files: {{affectedFiles}}
Diff Payload:
{{diffPayload}}
Evaluation Categories:
1. Scope Alignment: Does the change solve the stated objective without introducing unrelated modifications?
2. Correctness Boundaries: Are null, empty, malformed, and boundary inputs handled? Are error states contained?
3. Security Posture: Are authentication boundaries preserved? Is user input validated? Are secrets or tokens exposed in logs or responses?
4. Failure Resilience: How does the system behave under dependency timeout, partial failure, or network degradation? Does it fail open or closed?
5. Performance Topology: Are there unbounded loops, N+1 query patterns, or hot-path latency additions?
6. Test Validity: Do tests verify intent or just implementation? Are negative and failure paths covered?
7. Maintainability: Is duplication introduced? Are naming conventions consistent with the existing codebase?
8. Deployment Topology: Does this require migrations, feature flags, or config updates? Is rollback safe?
Output Requirements:
- Verdict: SAFE | REQUIRES_CHANGES | HIGH_RISK
- Risk Register: List 3-7 risks. Each must include: category, severity, code evidence, and mitigation.
- Test Gaps: Identify missing validation scenarios.
- Clarification Queue: Questions that must be resolved before merge.
- Minimal Alternative: If scope is excessive, propose a constrained implementation.
Rules:
- Reference specific functions, lines, or modules.
- Do not invent code or assume missing context.
- If uncertain, state what evidence would resolve the ambiguity.
`;
Step 3: Response Parsing & Validation
AI outputs are unstructured by default. Production systems should enforce schema validation to extract actionable data.
interface ReviewOutput {
verdict: 'SAFE' | 'REQUIRES_CHANGES' | 'HIGH_RISK';
riskRegister: Array<{
category: string;
severity: 'LOW' | 'MEDIUM' | 'HIGH' | 'CRITICAL';
evidence: string;
mitigation: string;
}>;
testGaps: string[];
clarificationQueue: string[];
minimalAlternative?: string;
}
function parseReviewResponse(raw: string): ReviewOutput {
// In production, use a JSON schema validator or structured output API
// This is a simplified extraction pattern
const verdictMatch = raw.match(/Verdict:\s*(SAFE|REQUIRES_CHANGES|HIGH_RISK)/i);
if (!verdictMatch) throw new Error('Invalid review output: missing verdict');
return {
verdict: verdictMatch[1] as ReviewOutput['verdict'],
riskRegister: extractRisks(raw),
testGaps: extractList(raw, 'Test Gaps:'),
clarificationQueue: extractList(raw, 'Clarification Queue:'),
minimalAlternative: extractBlock(raw, 'Minimal Alternative:')
};
}
Architecture Decisions & Rationale
- Separation of Context and Diff: LLMs perform significantly better when constraints are isolated from the code payload. Mixing them causes attention dilution.
- Explicit Output Schema: Forcing categorical responses prevents the model from defaulting to conversational summaries. It also enables programmatic integration with CI/CD pipelines.
- Evidence Requirement: Mandating code references prevents hallucination and forces the model to ground its analysis in the actual diff.
- Verdict Tiers: SAFE, REQUIRES_CHANGES, and HIGH_RISK provide clear merge gates. This aligns with standard engineering risk matrices and simplifies reviewer decision-making.
Pitfall Guide
1. The Validation Trap
Explanation: Asking Is this code good? or Does this look okay? triggers the model's alignment training toward reassurance. The output becomes a checklist of compliments with minor style suggestions.
Fix: Replace approval-seeking prompts with risk-seeking directives. Use What failure modes exist in this change? or Identify the highest-risk assumptions before merge.
2. Context Fragmentation
Explanation: Pasting isolated functions or truncated diffs forces the model to guess surrounding dependencies, type definitions, and project conventions. This dramatically increases false positives and missed risks.
Fix: Always provide the complete patch, relevant type definitions, and configuration files. If the diff exceeds context limits, split the review by module and aggregate findings.
3. Constraint Blindness
Explanation: Omitting business rules, security policies, or performance SLAs causes the model to evaluate code against generic best practices rather than your actual production requirements.
Fix: Inject explicit constraints into the context payload. Examples: Tokens must expire within 15 minutes, No direct database writes in the request path, Must support idempotent retries.
4. Test Illusion
Explanation: AI-generated tests often validate the exact path the model used to generate the code. They confirm the happy path but ignore error boundaries, malformed inputs, or concurrent execution states.
Fix: Require the review prompt to explicitly audit test intent vs. implementation coupling. Mandate negative tests, timeout simulations, and boundary condition coverage.
5. Rollback Neglect
Explanation: Developers focus on forward deployment and ignore backward compatibility, migration steps, or feature flag requirements. AI models rarely infer deployment topology unless explicitly prompted.
Fix: Include a deployment topology category in every review. Require answers to: Can this be reverted without data loss?, Does it require config changes?, What monitoring signals indicate failure?
6. Prompt Drift
Explanation: Modifying the checklist structure mid-review or adding ad-hoc questions breaks the evaluation consistency. The model loses its categorical anchor and reverts to general commentary.
Fix: Version control your review templates. Treat prompts as infrastructure code. Use configuration files or environment variables to inject constraints rather than rewriting the prompt manually.
7. Over-Reliance on AI Verdicts
Explanation: Treating the AI output as authoritative rather than advisory. The model lacks visibility into incident history, undocumented business rules, traffic patterns, and internal security audits.
Fix: Use AI findings as a pre-filter. Human reviewers must validate severity ratings, dismiss false positives, and approve mitigations. The AI surfaces risks; engineers own the decision.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Minor UI tweak or copy update | Ad-hoc review + linter | Low risk, minimal system impact | Negligible |
| New API endpoint or data model | Structured checklist protocol | Requires security, validation, and rollback assessment | Low (saves 30-40 min human review) |
| Authentication or payment flow | Structured checklist + manual security audit | High compliance risk, requires domain expertise | Medium (justified by incident prevention) |
| Infrastructure or deployment change | Structured checklist + rollback simulation | Failure modes are catastrophic, require topology validation | High (prevents outages) |
| Legacy code refactor | Structured checklist + regression test suite | High risk of silent behavioral changes | Medium (reduces regression bugs) |
Configuration Template
# .ai-review-config.yaml
review:
template_version: "2.1"
output_format: "structured"
required_categories:
- scope_alignment
- correctness_boundaries
- security_posture
- failure_resilience
- performance_topology
- test_validity
- maintainability
- deployment_topology
constraints:
- "No direct database writes in request handlers"
- "All user inputs must be validated against schema"
- "Secrets must never appear in logs or error responses"
- "Changes must be backward compatible unless flagged"
verdict_thresholds:
SAFE: 0 critical, 0 high
REQUIRES_CHANGES: 0 critical, <=2 high
HIGH_RISK: >=1 critical OR >=3 high
ci_integration:
block_merge_on: "HIGH_RISK"
require_human_approval_for: "REQUIRES_CHANGES"
Quick Start Guide
- Install a prompt versioning tool: Store your review template in a shared configuration file or repository. Never hardcode prompts in chat interfaces.
- Configure your diff extractor: Use
git diff --unified=10 or your IDE's patch export to capture complete context. Ensure type definitions and config files are included if they change.
- Inject constraints: Add your team's security, performance, and deployment rules to the context payload. Treat these as non-negotiable evaluation criteria.
- Run the review: Execute the structured prompt against the diff. Parse the output using the required schema. Address any HIGH_RISK findings before opening the PR.
- Attach findings: Paste the AI review summary into the pull request description. This gives human reviewers a pre-filtered risk register and reduces merge latency.
By treating AI-assisted development as a two-phase workflow—generation followed by structured risk assessment—teams retain velocity while enforcing production-grade engineering standards. The checklist is not a replacement for human judgment; it is a force multiplier that surfaces failure modes before they reach the merge queue.