inst business rules. The engine persists state, orchestrates remediation actions, and manages communication channels.
Architecture Decisions and Rationale
- State Machine Pattern: Incidents follow a deterministic lifecycle (
DETECTED → TRIAGE → MITIGATING → RESOLVED → POST_MORTEM). A state machine enforces valid transitions, prevents race conditions, and provides a clear audit trail of state changes.
- Event-Driven Ingestion: The engine consumes events via a message bus (e.g., Kafka, SQS) from observability tools (Prometheus, Datadog, Sentry). This decouples detection from processing and allows for high-throughput ingestion.
- Idempotent Remediation Hooks: Automated mitigation actions are exposed as idempotent hooks. The workflow engine invokes these hooks with a unique correlation ID, ensuring that retries do not cause duplicate side effects.
- Human-in-the-Loop Gates: Critical transitions, such as
MITIGATING for P1 incidents or RESOLVED for complex failures, require explicit human approval via interactive notifications, balancing automation speed with safety.
Step-by-Step Technical Implementation
1. Define the Incident Schema and State Transitions
Use TypeScript to enforce strict typing for incident payloads and state transitions.
export type IncidentSeverity = 'P1' | 'P2' | 'P3' | 'P4';
export type IncidentState =
| 'DETECTED'
| 'TRIAGE'
| 'MITIGATING'
| 'RESOLVED'
| 'POST_MORTEM';
export interface IncidentPayload {
id: string;
title: string;
severity: IncidentSeverity;
source: string; // e.g., 'prometheus', 'sentry'
state: IncidentState;
metadata: Record<string, unknown>;
createdAt: number;
updatedAt: number;
assignedEngineer?: string;
remediationSteps: string[];
}
export interface TransitionRule {
from: IncidentState;
to: IncidentState;
validator: (incident: IncidentPayload) => Promise<boolean>;
action?: (incident: IncidentPayload) => Promise<void>;
}
const TRANSITIONS: TransitionRule[] = [
{
from: 'DETECTED',
to: 'TRIAGE',
validator: async (inc) => inc.severity !== undefined,
action: async (inc) => {
// Auto-assign based on on-call schedule
console.log(`Assigning ${inc.id} to on-call engineer.`);
}
},
{
from: 'TRIAGE',
to: 'MITIGATING',
validator: async (inc) => {
// P1 requires manual approval for mitigation
if (inc.severity === 'P1') return inc.metadata.approved === true;
return true;
},
action: async (inc) => {
// Trigger auto-remediation hooks
console.log(`Executing mitigation steps for ${inc.id}.`);
}
},
// Additional transitions...
];
2. Implement the Workflow Engine
The engine processes events, validates transitions, and persists state.
import { v4 as uuidv4 } from 'uuid';
export class IncidentWorkflowEngine {
private incidents: Map<string, IncidentPayload> = new Map();
async ingestEvent(event: { type: string; payload: Partial<IncidentPayload> }): Promise<void> {
const incidentId = event.payload.id || uuidv4();
let incident = this.incidents.get(incidentId);
if (!incident) {
// Initialize new incident
incident = {
id: incidentId,
state: 'DETECTED',
createdAt: Date.now(),
updatedAt: Date.now(),
severity: event.payload.severity || 'P3',
title: event.payload.title || 'Unknown Incident',
source: event.payload.source || 'unknown',
metadata: {},
remediationSteps: []
};
this.incidents.set(incidentId, incident);
console.log(`New incident detected: ${incidentId}`);
}
// Process state transition based on event type
const targetState = this.mapEventTypeToState(event.type);
if (targetState && targetState !== incident.state) {
await this.transition(incident, targetState);
}
}
private async transition(incident: IncidentPayload, targetState: IncidentState): Promise<void> {
const rule = TRANSITIONS.find(
t => t.from === incident.state && t.to === targetState
);
if (!rule) {
throw new Error(`Invalid transition from ${incident.state} to ${targetState} for incident ${incident.id}`);
}
const isValid = await rule.validator(incident);
if (!isValid) {
throw new Error(`Validation failed for transition ${incident.state} -> ${targetState}`);
}
// Execute transition action
if (rule.action) {
await rule.action(incident);
}
// Update state
incident.state = targetState;
incident.updatedAt = Date.now();
console.log(`Incident ${incident.id} transitioned to ${targetState}`);
// Persist to database/event store here
this.incidents.set(incident.id, incident);
}
private mapEventTypeToState(eventType: string): IncidentState | null {
switch (eventType) {
case 'alert.resolved': return 'RESOLVED';
case 'engineer.ack': return 'TRIAGE';
case 'mitigation.approved': return 'MITIGATING';
case 'postmortem.created': return 'POST_MORTEM';
default: return null;
}
}
}
3. Integrate Observability Webhooks
Configure observability tools to send structured events to the workflow engine's ingestion endpoint. Ensure payloads include correlation IDs and severity levels.
4. Add Remediation and Communication Hooks
Extend the workflow engine with side-effect handlers for Slack notifications, PagerDuty escalation, and API calls to remediation services.
Pitfall Guide
1. Alert Fatigue and Noise Flooding
Mistake: Ingesting every alert into the workflow engine without deduplication or correlation.
Explanation: This overwhelms the engine and engineers, causing critical incidents to be lost in noise.
Best Practice: Implement a correlation layer before the workflow engine. Group alerts by service, host, and time window. Only trigger workflow ingestion when a correlation threshold is met.
2. Hardcoded Runbooks in Workflow Logic
Mistake: Embedding remediation steps directly in the engine code.
Explanation: This couples workflow logic with operational knowledge, requiring code deployments to update runbooks.
Best Practice: Store remediation steps in an external configuration store or knowledge base. The workflow engine should reference step IDs and fetch actions dynamically.
Mistake: Designing remediation hooks that are not idempotent.
Explanation: Network retries or workflow re-processing can trigger duplicate actions, such as restarting a service twice or scaling resources incorrectly.
Best Practice: All remediation hooks must accept a unique correlation ID and check for prior execution. Use distributed locks or idempotency keys in downstream APIs.
4. Lack of Human-in-the-Loop for Critical Paths
Mistake: Fully automating mitigation for P1 incidents without approval gates.
Explanation: Automated actions on critical systems can cause cascading failures if the detection logic is flawed.
Best Practice: Implement mandatory approval gates for P1 incidents. The workflow should pause at TRIAGE and wait for explicit engineer confirmation before transitioning to MITIGATING.
5. Treating Post-Mortem as an Afterthought
Mistake: The workflow ends at RESOLVED with no mechanism to trigger post-incident review.
Explanation: Valuable learnings are lost, and systemic issues remain unaddressed.
Best Practice: Automate the transition to POST_MORTEM upon resolution. The workflow should automatically create a post-mortem ticket, assign owners, and schedule a review meeting.
6. Insufficient Audit Trails
Mistake: Overwriting incident state without preserving history.
Explanation: Makes it impossible to reconstruct the timeline for compliance or analysis.
Best Practice: Use an event-sourcing pattern or append-only log for state changes. Every transition must be recorded with a timestamp, actor, and reason.
Mistake: Workflow engine only integrates with one observability tool.
Explanation: Incidents often span multiple systems; missing data from one source leads to incomplete triage.
Best Practice: Design the ingestion layer to normalize events from diverse sources (metrics, logs, traces, synthetic checks) into a unified incident schema.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small Team (<10 devs) | Lightweight Workflow Script | Low overhead, fast implementation, sufficient for limited alert volume. | Low |
| Enterprise / Multi-Service | Full Workflow-as-Code Engine | Scalability, auditability, cross-team coordination, compliance requirements. | Medium |
| Regulated Industry (FinTech/Health) | Compliance-First Workflow | Mandatory audit trails, human approval gates, strict state validation. | High |
| High-Frequency Auto-Remediation | Event-Driven Engine with Idempotency | Prevents duplicate actions, handles high throughput, ensures safety. | Medium |
Configuration Template
# incident_workflow_config.yaml
severity_matrix:
P1:
sla_minutes: 15
auto_remediation: false
approval_required: true
notification_channels:
- slack_critical
- pagerduty
P2:
sla_minutes: 60
auto_remediation: true
approval_required: false
notification_channels:
- slack_ops
- email
workflow_rules:
- transition: DETECTED -> TRIAGE
validator: severity_defined
action: assign_on_call
- transition: TRIAGE -> MITIGATING
validator: approval_or_severity_check
action: execute_remediation_steps
- transition: RESOLVED -> POST_MORTEM
validator: always_true
action: create_postmortem_ticket
remediation_hooks:
- id: scale_up_service
endpoint: https://api.internal/scaling
method: POST
idempotency_key: correlation_id
- id: restart_container
endpoint: https://api.internal/containers/restart
method: POST
idempotency_key: correlation_id
Quick Start Guide
-
Initialize Project:
mkdir incident-workflow && cd incident-workflow
npm init -y
npm install typescript uuid @types/node
npx tsc --init
-
Create Engine File:
Copy the TypeScript implementation from the Core Solution into src/engine.ts. Define your TRANSITIONS and IncidentPayload interface.
-
Configure Ingestion:
Create a simple Express server to expose the ingestion endpoint.
// src/server.ts
import express from 'express';
import { IncidentWorkflowEngine } from './engine';
const app = express();
const engine = new IncidentWorkflowEngine();
app.use(express.json());
app.post('/ingest', async (req, res) => {
try {
await engine.ingestEvent(req.body);
res.status(200).send('Event processed');
} catch (err) {
res.status(400).send(err.message);
}
});
app.listen(3000, () => console.log('Workflow engine running on port 3000'));
-
Deploy and Test:
Run the server and send a test event via curl:
curl -X POST http://localhost:3000/ingest \
-H "Content-Type: application/json" \
-d '{"type": "alert.triggered", "payload": {"id": "test-1", "severity": "P2", "title": "High Latency"}}'
Verify the incident is created and transitions are logged.
-
Integrate Observability:
Configure your monitoring tool to send webhooks to http://<engine-host>:3000/ingest. Map alert fields to the IncidentPayload schema. Validate end-to-end flow.