string;
phase: IncidentPhase;
severity: 'P1' | 'P2' | 'P3';
affectedServices: string[];
containmentActions: string[];
hypothesisLog: HypothesisEntry[];
createdAt: number;
updatedAt: number;
}
class IncidentStateMachine {
private context: IncidentContext;
private readonly validTransitions: Record<IncidentPhase, IncidentPhase[]> = {
DETECTION: ['TRIAGE'],
TRIAGE: ['CONTAINMENT', 'DETECTION'],
CONTAINMENT: ['RECOVERY', 'TRIAGE'],
RECOVERY: ['POSTMORTEM'],
POSTMORTEM: []
};
constructor(initial: IncidentContext) {
this.context = { ...initial, updatedAt: Date.now() };
}
advanceTo(nextPhase: IncidentPhase): boolean {
const allowed = this.validTransitions[this.context.phase];
if (!allowed.includes(nextPhase)) {
throw new Error(Invalid transition: ${this.context.phase} -> ${nextPhase});
}
this.context.phase = nextPhase;
this.context.updatedAt = Date.now();
return true;
}
getContext(): Readonly<IncidentContext> {
return { ...this.context };
}
}
**Architecture Rationale:** Immutable state transitions prevent teams from skipping containment or declaring recovery prematurely. The strict transition map enforces the industry-standard lifecycle (Detection → Triage → Containment → Recovery → Postmortem) while allowing controlled backtracking when new evidence emerges.
### Step 2: Automated Communication Router
Stakeholder updates must be predictable, plain-language, and cadence-driven. Technical jargon during outages increases panic and misalignment. The router abstracts message composition and enforces delivery intervals.
```typescript
interface CommsPayload {
channel: 'slack' | 'email' | 'statuspage';
audience: 'engineering' | 'product' | 'executive' | 'customers';
template: 'initial' | 'update' | 'resolution';
data: {
status: string;
impact: string;
nextMilestone: string;
confidence: 'high' | 'medium' | 'low';
caveats: string[];
};
}
class CommsRouter {
private readonly cadenceMap: Record<string, number> = {
P1: 15,
P2: 30,
P3: 60
};
async broadcast(payload: CommsPayload, severity: 'P1' | 'P2' | 'P3'): Promise<void> {
const interval = this.cadenceMap[severity];
const message = this.translateToAudience(payload);
await this.deliver(payload.channel, payload.audience, message);
console.log(`[COMMS] Scheduled next update in ${interval}m | Channel: ${payload.channel}`);
}
private translateToAudience(payload: CommsPayload): string {
const { status, impact, nextMilestone, confidence, caveats } = payload.data;
return `[${payload.template.toUpperCase()}] Status: ${status}. Impact: ${impact}. Next milestone: ${nextMilestone} (Confidence: ${confidence}). Caveats: ${caveats.join(', ') || 'None'}`;
}
private async deliver(channel: string, audience: string, message: string): Promise<void> {
// Integration with Slack SDK, SendGrid, or Statuspage API
console.log(`[DELIVER] ${channel} -> ${audience}: ${message}`);
}
}
Architecture Rationale: Separating message composition from delivery channels ensures consistency across platforms. The cadence map ties update frequency to severity, preventing notification fatigue during P3 events while maintaining urgency for P1 incidents. Plain-language translation is enforced at the router level, not left to individual engineers.
Step 3: Hypothesis & Evidence Tracker
Debugging during outages requires structured iteration. Untracked hypotheses lead to duplicated effort and premature conclusions. The tracker enforces the observe-hypothesize-test-confirm loop.
interface HypothesisEntry {
id: string;
boundary: string;
observation: string;
hypothesis: string;
testAction: string;
result: 'confirmed' | 'refuted' | 'inconclusive';
evidence: Record<string, unknown>;
timestamp: number;
}
class DebugHypothesisTracker {
private log: HypothesisEntry[] = [];
record(entry: Omit<HypothesisEntry, 'id' | 'timestamp'>): HypothesisEntry {
const newEntry: HypothesisEntry = {
...entry,
id: crypto.randomUUID(),
timestamp: Date.now()
};
this.log.push(newEntry);
return newEntry;
}
getActiveHypotheses(): HypothesisEntry[] {
return this.log.filter(h => h.result === 'inconclusive');
}
getRefutedPaths(): string[] {
return this.log
.filter(h => h.result === 'refuted')
.map(h => `${h.boundary} -> ${h.hypothesis}`);
}
exportAuditTrail(): string {
return JSON.stringify(this.log, null, 2);
}
}
Architecture Rationale: Immutable hypothesis logging prevents confirmation bias. By requiring explicit test actions and evidence capture before marking a hypothesis as confirmed, teams avoid declaring root causes based on correlation. The refuted paths list becomes invaluable during postmortems, highlighting detection gaps and monitoring blind spots.
Step 4: Playbook Executor
Runbooks and playbooks must be version-controlled, approval-gated, and executable. Manual runbooks drift; automated playbooks enforce consistency.
interface PlaybookStep {
id: string;
action: string;
requiresApproval: boolean;
approvalRole?: string;
rollbackCriteria: string;
successCriteria: string;
}
interface Playbook {
id: string;
version: string;
triggerConditions: string[];
steps: PlaybookStep[];
}
class PlaybookEngine {
private approvedRunners: Set<string> = new Set();
constructor(allowedRoles: string[]) {
allowedRoles.forEach(r => this.approvedRunners.add(r));
}
async execute(playbook: Playbook, runnerRole: string): Promise<boolean> {
if (!this.approvedRunners.has(runnerRole)) {
throw new Error(`Unauthorized: ${runnerRole} cannot execute playbooks`);
}
for (const step of playbook.steps) {
if (step.requiresApproval) {
console.log(`[PLAYBOOK] Awaiting approval for step: ${step.action}`);
// Integration with approval workflow (e.g., Slack approval, PagerDuty escalation)
}
console.log(`[PLAYBOOK] Executing: ${step.action}`);
// Execute action, validate successCriteria, trigger rollback if needed
}
return true;
}
}
Architecture Rationale: Playbook execution is gated by role-based authorization and step-level success/rollback criteria. This prevents unsafe automated actions during high-severity incidents while ensuring routine containment steps execute deterministically. Versioning guarantees that responders always use validated procedures, eliminating runbook drift.
Pitfall Guide
1. Premature Root Cause Locking
Explanation: Teams declare a root cause after observing a single correlated symptom, then spend hours validating a false hypothesis instead of containing the blast radius.
Fix: Enforce a strict containment-first policy. Root cause analysis only begins after the incident state machine transitions to RECOVERY. Require two independent evidence sources before marking any hypothesis as confirmed.
2. Communication Channel Fragmentation
Explanation: Updates scatter across Slack threads, email chains, and voice calls. Stakeholders receive conflicting information, eroding trust and creating duplicate triage efforts.
Fix: Route all external updates through a single communication router. Designate one approved spokesperson per channel. Enforce a fixed cadence and template structure. Archive all updates in the incident context for postmortem reconstruction.
3. Runbook Staleness (Drift)
Explanation: Playbooks are written once and never updated. Infrastructure changes, API deprecations, and team turnover render procedures obsolete, causing execution failures during actual incidents.
Fix: Store playbooks in version control with mandatory review cycles. Implement automated drift detection by comparing playbook dependencies against current infrastructure state. Require quarterly tabletop exercises to validate execution paths.
4. Metric-Driven Containment Neglect
Explanation: Teams optimize for MTTD/MTTR dashboards instead of actual system stability. Engineers rush containment to improve metrics, introducing secondary failures or skipping validation steps.
Fix: Treat metrics as lagging indicators, not operational targets. Prioritize containment safety over speed. Use the state machine to enforce validation gates before advancing phases. Review metric trends during postmortems, not during active incidents.
5. Blame-Attributed Postmortems
Explanation: Post-incident reviews focus on individual actions rather than systemic gaps. This triggers defensive behavior, suppresses honest reporting, and guarantees recurrence.
Fix: Mandate blameless framing syntax. Replace "Who caused this?" with "What conditions allowed this to happen?" Use the Five Whys technique to trace issues to process, tooling, or architectural constraints. Assign action items to systems, not people.
6. Unvalidated Playbook Assumptions
Explanation: Playbooks assume ideal conditions: network connectivity, available replicas, correct permissions. Real incidents rarely match documentation assumptions.
Fix: Design playbooks with explicit failure modes. Include fallback paths for each step. Validate assumptions during tabletop exercises. Log assumption violations during actual incidents to refine future procedures.
7. Cognitive Overload During Triage
Explanation: On-call engineers attempt to track metrics, debug logs, coordinate teams, and draft updates simultaneously. Working memory saturation leads to missed signals and delayed escalation.
Fix: Externalize cognitive load. Assign explicit roles: Incident Commander (coordination), Debugger (hypothesis tracking), Comms Lead (stakeholder updates), Scribe (audit logging). Use checklists and automated routers to remove manual tracking from the debugger's workflow.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Minor latency spike (P3) | Automated runbook execution + async comms | Low blast radius; deterministic rollback reduces manual overhead | Minimal engineering time; avoids alert fatigue |
| Full database outage (P1) | State machine containment + synchronous comms + dedicated IC | High data risk; requires strict phase control and stakeholder alignment | Higher immediate cost; prevents data corruption and extended downtime |
| Cross-service dependency failure (P2) | Boundary-mapped debugging + hypothesis tracker + cross-team comms router | Complex failure surface; requires structured evidence collection to avoid tunnel vision | Moderate cost; reduces duplicate debugging and accelerates root cause isolation |
| Security incident (P1) | Isolation playbook + executive comms + forensic logging | Regulatory and data exposure risks; requires strict access control and audit trails | High compliance cost; mitigates legal and reputational damage |
Configuration Template
# incident-framework.config.yaml
framework:
version: "2.1.0"
lifecycle:
phases: ["DETECTION", "TRIAGE", "CONTAINMENT", "RECOVERY", "POSTMORTEM"]
transitions:
DETECTION: ["TRIAGE"]
TRIAGE: ["CONTAINMENT", "DETECTION"]
CONTAINMENT: ["RECOVERY", "TRIAGE"]
RECOVERY: ["POSTMORTEM"]
POSTMORTEM: []
communications:
cadence_minutes:
P1: 15
P2: 30
P3: 60
channels:
slack:
webhook: "${SLACK_WEBHOOK_URL}"
template: "plain_language"
statuspage:
api_key: "${STATUSPAGE_API_KEY}"
component_mapping: "auto"
playbooks:
storage: "git"
repository: "ops/playbooks"
approval_roles: ["sre-lead", "platform-engineer"]
drift_detection:
enabled: true
interval_hours: 168
compare_against: "infrastructure_state"
postmortem:
framework: "blameless_systemic"
required_fields:
- executive_summary
- timeline
- root_cause_analysis
- corrective_actions
- owners
- deadlines
- lessons_learned
action_tracking: "jira"
completion_target_days: 14
Quick Start Guide
- Initialize the framework: Clone the incident toolkit repository and run
npm install @codcompass/incident-framework. Import the state machine, comms router, and hypothesis tracker into your on-call service.
- Configure severity routing: Map your existing alerting rules to P1/P2/P3 severities. Update the communication router cadence and channel webhooks in the YAML configuration.
- Deploy first playbook: Create a minimal containment playbook for your highest-traffic service. Define trigger conditions, approval roles, rollback criteria, and success metrics. Commit to version control.
- Run tabletop validation: Simulate a P2 incident with your on-call rotation. Execute the playbook, track hypotheses, and broadcast updates using the router. Log assumption violations and refine the procedure.
- Enable postmortem tracking: Connect the framework to your issue tracker. Enforce blameless framing syntax and assign systemic action items with deadlines. Monitor completion rates monthly.
Deterministic incident response isn't about preventing outages. It's about ensuring that when they occur, your team operates with precision, transparency, and continuous improvement. Externalize decision-making, enforce structural discipline, and treat every incident as a reliability engineering opportunity. The framework scales with your system; your response should too.