OpenSRE: Build Your Own AI Incident-Investigation Agent

By Codcompass Team·2026-05-18·8 min read

Engineering Deterministic Incident Investigation Agents with LangGraph

Current Situation Analysis

Modern observability stacks are fundamentally fragmented. Logs live in Datadog or Loki, metrics in Grafana or CloudWatch, deployment configs in Git, and runtime state in Kubernetes or cloud control planes. When a production incident triggers, the evidence required to diagnose it is scattered across six to eight independent platforms. Engineers are forced to become manual ETL pipelines: pulling timestamps, cross-referencing traces, pinging subject-matter experts, and reconstructing a timeline from disjointed signals.

This fragmentation is rarely treated as a systemic engineering problem. Most AI tooling investments target the development phase—code completion, PR reviews, test generation. The operational phase, where system failures actually occur, receives minimal automation. The result is predictable: Mean Time to Resolution (MTTR) balloons because 60–70% of incident response time is spent gathering context, not applying fixes. Under on-call fatigue, teams default to "patch-and-pray" mitigations, deferring root cause analysis until after stability returns. This creates technical debt in the form of unresolved failure modes that inevitably resurface.

The misunderstanding lies in assuming LLMs are too unpredictable for production incident response. While raw generative models lack determinism, structured agent frameworks built on directed acyclic graphs (DAGs) and state machines can enforce rigorous, evidence-backed workflows. The gap isn't AI capability; it's architectural discipline. We need investigation agents that operate like senior SREs: they gather context in parallel, test multiple hypotheses simultaneously, ground conclusions in verifiable data, and maintain a complete audit trail. Frameworks like the Apache 2.0 licensed toolkit maintained by Tracer demonstrate that this is achievable when LangGraph is used to orchestrate deterministic, multi-step investigation pipelines rather than open-ended chat loops.

WOW Moment: Key Findings

The shift from manual triage to AI-driven parallel investigation fundamentally changes how SRE teams allocate cognitive resources. The following comparison illustrates the operational delta:

Approach	Context Assembly Time	Hypothesis Parallelism	Evidence Traceability	Cognitive Overhead
Manual Triage	45–120 minutes	Sequential (1–2 at a time)	Fragmented across tools	High (context-switching, fatigue)
AI-Driven Agent	3–8 minutes	Concurrent (5–10 failure modes)	Immutable state logs, source citations	Low (review & decide)

This finding matters because it decouples data correlation from decision-making. Traditional runbooks force engineers to follow linear checklists. An agent framework evaluates multiple failure paths simultaneously, queries connected systems in parallel, and halts only when a confidence threshold is met. The output is not a guess; it is a structured report mapping observed signals to probable root causes, complete with query provenance. This enables teams to treat incident investigation as a repeatable, auditable process rather than a heroic effort.

Core Solution

Building a deterministic investigation agent requires moving beyond simple prompt chains. The architecture must enforce state persistence, parallel tool execution, hypothesis evaluation, and strict security boundaries. Below is a production-grade implementation pattern using TypeScript and LangGraph principles.

Step 1: Define the Investigation State Machine

The agent operates as a state machine.

Each node represents a discrete phase: alert ingestion, context assembly, hypothesis framing, parallel querying, evaluation, and reporting. State is passed explicitly between nodes, ensuring every decision is reproducible.

import { StateGraph, END, START } from "@langchain/langgraph";

interface InvestigationState {
  alertId: string;
  severity: "critical" | "warning" | "info";
  timestamp: number;
  gatheredEvidence: Record<string, any>;
  activeHypotheses: string[];
  evaluatedResults: Array<{ hypothesis: string; confidence: number; sources: string[] }>;
  finalConclusion: string | null;
  confidenceThreshold: number;
  auditTrail: Array<{ step: string; timestamp: number; data: any }>;
}

const initialState: InvestigationState = {
  alertId: "",
  severity: "warning",
  timestamp: Date.now(),
  gatheredEvidence: {},
  activeHypotheses: [],
  evaluatedResults: [],
  finalConclusion: null,
  confidenceThreshold: 0.85,
  auditTrail: []
};

Step 2: Implement Parallel Evidence Gathering

Instead of sequential API calls, the agent fans out to connected observability and infrastructure endpoints. Each tool adapter returns structured evidence with source metadata.

async function gatherContext(state: InvestigationState): Promise<Partial<InvestigationState>> {
  const tools = [
    fetchLogMetrics(state.alertId),
    fetchDeploymentHistory(state.alertId),
    fetchInfraHealth(state.alertId),
    fetchDependencyGraph(state.alertId)
  ];

  const results = await Promise.allSettled(tools);
  const evidence: Record<string, any> = {};
  const auditEntries = [];

  results.forEach((res, idx) => {
    const toolName = ["logs", "deployments", "infra", "dependencies"][idx];
    if (res.status === "fulfilled") {
      evidence[toolName] = res.value;
      auditEntries.push({ step: `fetch_${toolName}`, timestamp: Date.now(), data: { status: "success" } });
    } else {
      auditEntries.push({ step: `fetch_${toolName}`, timestamp: Date.now(), data: { status: "failed", error: res.reason } });
    }
  });

  return { gatheredEvidence: evidence, auditTrail: [...state.auditTrail, ...auditEntries] };
}

Step 3: Frame & Evaluate Hypotheses

The agent generates plausible failure modes based on the alert signature and gathered context. Each hypothesis is tested against the evidence using a structured evaluation function. Confidence scores are calculated based on signal correlation strength, not LLM intuition.

function evaluateHypotheses(state: InvestigationState): Partial<InvestigationState> {
  const hypotheses = generateFailureModes(state.alertId, state.gatheredEvidence);
  const evaluations = hypotheses.map(hyp => {
    const correlationScore = calculateSignalCorrelation(hyp, state.gatheredEvidence);
    const confidence = Math.min(correlationScore, 1.0);
    return {
      hypothesis: hyp,
      confidence,
      sources: extractSourceReferences(hyp, state.gatheredEvidence)
    };
  });

  const topMatch = evaluations.reduce((prev, curr) => 
    curr.confidence > prev.confidence ? curr : prev
  );

  const isResolved = topMatch.confidence >= state.confidenceThreshold;

  return {
    activeHypotheses: hypotheses,
    evaluatedResults: evaluations,
    finalConclusion: isResolved ? topMatch.hypothesis : null,
    auditTrail: [...state.auditTrail, { step: "hypothesis_evaluation", timestamp: Date.now(), data: evaluations }]
  };
}

Step 4: Route & Terminate

The graph routes to a reporting node only when confidence exceeds the threshold. Otherwise, it escalates to manual review or triggers additional targeted queries.

const workflow = new StateGraph(InvestigationState)
  .addNode("ingest", ingestAlert)
  .addNode("gather", gatherContext)
  .addNode("evaluate", evaluateHypotheses)
  .addNode("report", generateSlackReport)
  .addNode("escalate", notifyOnCall)
  .addEdge(START, "ingest")
  .addEdge("ingest", "gather")
  .addEdge("gather", "evaluate")
  .addConditionalEdges("evaluate", (state) => 
    state.finalConclusion ? "report" : "escalate"
  )
  .addEdge("report", END)
  .addEdge("escalate", END);

const app = workflow.compile();

Architecture Decisions & Rationale

State Machine over Linear Chain: LangGraph enforces explicit state transitions, preventing infinite loops and enabling mid-flight inspection. This is critical for compliance and debugging.
Parallel Tool Execution: Promise.allSettled ensures the agent doesn't block on a single slow API. Partial failures are logged without halting the investigation.
Confidence Thresholds: Hardcoding a minimum confidence (e.g., 0.85) prevents premature conclusions. The agent escalates rather than guesses.
Evidence Grounding: Every conclusion must cite source queries. This eliminates hallucination drift and enables post-incident verification.
Separation of Investigation & Remediation: The agent stops at reporting. Automated fixes require a separate, human-approved execution pipeline. This aligns with zero-trust SRE practices.

Pitfall Guide

1. Unbounded Tool Execution

Explanation: LLMs can enter recursive loops when querying APIs, exhausting rate limits or incurring unexpected costs. Fix: Implement strict iteration caps (max_iterations: 5), timeout guards per tool, and circuit breakers that halt execution after repeated failures.

2. Write-Access During Investigation

Explanation: Granting the agent modify permissions to production systems during triage risks accidental configuration drift or data mutation. Fix: Enforce read-only IAM roles, network egress filtering, and API scopes limited to GET/LIST operations. Remediation must use a separate, gated pipeline.

3. Hallucinated Correlations

Explanation: The agent may infer causal relationships between unrelated metrics due to pattern matching without statistical validation. Fix: Require source citations for every claim. Implement a correlation validator that checks temporal alignment and statistical significance before accepting a hypothesis.

4. Ignoring Rate Limits & API Costs

Explanation: Parallel queries to Datadog, Grafana, or cloud APIs can trigger throttling or billing spikes. Fix: Implement query batching, response caching for static metadata, and credit budgeting per investigation. Log API call counts in the audit trail.

5. Skipping Immutable Audit Trails

Explanation: Without persistent state logging, investigations become black boxes. Post-incident reviews lack verifiable evidence. Fix: Append every step, tool response, and confidence score to an append-only log. Store snapshots in object storage with cryptographic checksums for compliance.

6. Over-Automating Remediation

Explanation: Auto-applying fixes based on AI conclusions can cascade failures if the root cause is misidentified. Fix: Decouple investigation from execution. The agent generates a remediation proposal; a human or policy engine approves before any write operation occurs.

7. Poor Confidence Calibration

Explanation: Static thresholds don't adapt to incident complexity. Simple alerts may resolve at 0.7 confidence, while cascading failures require 0.95. Fix: Implement dynamic thresholds based on severity, system criticality, and historical false-positive rates. Log threshold adjustments for model tuning.

Production Bundle

Action Checklist

Define investigation state schema with explicit fields for evidence, hypotheses, and audit logs
Implement parallel tool adapters with timeout guards and circuit breakers
Enforce read-only IAM policies and network isolation for all agent integrations
Set dynamic confidence thresholds based on service criticality and alert severity
Route low-confidence results to human review instead of auto-escalation
Store immutable audit trails in append-only storage with cryptographic verification
Separate investigation workflows from remediation execution pipelines
Instrument API call metrics, latency, and cost per investigation for capacity planning

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Data pipeline failure (Airflow/Kafka)	AI parallel investigation	High signal density across logs, metrics, and job states; agent correlates efficiently	Low (reduces MTTR by 60–70%)
Cloud infrastructure drift	AI investigation + policy enforcement	Infra state requires deterministic validation; agent flags drift, policy engine remediates	Medium (API costs + policy engine licensing)
Application-level bug	Traditional runbooks + AI triage	Code-level issues require stack traces and PR diffs; agent gathers context, engineer diagnoses	Low (minimal API overhead)
Compliance audit requirement	AI investigation + immutable logging	Regulatory frameworks demand traceable evidence chains; agent provides auditable state snapshots	High (storage + compliance tooling)

Configuration Template

agent:
  name: incident-investigator-v1
  framework: langgraph
  max_iterations: 5
  timeout_seconds: 120
  confidence_threshold: 0.85
  dynamic_thresholding: true

security:
  iam_role: read-only-sre-investigator
  network_policy: egress-deny-all
  allowed_endpoints:
    - grafana.internal:3000
    - datadog.api:443
    - k8s-api.internal:6443
  write_access: false

integrations:
  observability:
    - type: grafana
      endpoint: ${GRAFANA_URL}
      api_key: ${GRAFANA_API_KEY}
    - type: datadog
      endpoint: https://api.datadoghq.com
      api_key: ${DD_API_KEY}
  infrastructure:
    - type: kubernetes
      cluster: prod-us-east-1
      namespace: default
  communication:
    - type: slack
      webhook: ${SLACK_WEBHOOK_URL}
      channel: "#incidents"

logging:
  audit_storage: s3://sre-audit-logs/
  retention_days: 365
  checksum_algorithm: sha256
  format: jsonl

Quick Start Guide

Initialize the project structure: Create a TypeScript workspace with @langchain/langgraph and your preferred HTTP client. Define the state schema and node functions as shown in the Core Solution.
Configure mock observability endpoints: Spin up local instances of Grafana and Loki, or use HTTP mocks that return structured JSON responses for logs, metrics, and deployment history. Point the tool adapters to these endpoints.
Set security boundaries: Apply read-only API keys, restrict egress traffic to localhost or mock servers, and enable append-only logging to a local directory. Verify that no write operations are possible.
Compile and test the graph: Run the LangGraph compiler, trigger a synthetic alert payload, and observe the state transitions. Validate that parallel queries execute concurrently, confidence scoring aligns with evidence, and the audit trail captures every step.
Iterate on thresholds: Adjust the confidence threshold based on test results. Introduce dynamic scaling rules for severity levels, and verify escalation paths route correctly to Slack or PagerDuty when confidence falls below the minimum.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back