Each node represents a discrete phase: alert ingestion, context assembly, hypothesis framing, parallel querying, evaluation, and reporting. State is passed explicitly between nodes, ensuring every decision is reproducible.
import { StateGraph, END, START } from "@langchain/langgraph";
interface InvestigationState {
alertId: string;
severity: "critical" | "warning" | "info";
timestamp: number;
gatheredEvidence: Record<string, any>;
activeHypotheses: string[];
evaluatedResults: Array<{ hypothesis: string; confidence: number; sources: string[] }>;
finalConclusion: string | null;
confidenceThreshold: number;
auditTrail: Array<{ step: string; timestamp: number; data: any }>;
}
const initialState: InvestigationState = {
alertId: "",
severity: "warning",
timestamp: Date.now(),
gatheredEvidence: {},
activeHypotheses: [],
evaluatedResults: [],
finalConclusion: null,
confidenceThreshold: 0.85,
auditTrail: []
};
Step 2: Implement Parallel Evidence Gathering
Instead of sequential API calls, the agent fans out to connected observability and infrastructure endpoints. Each tool adapter returns structured evidence with source metadata.
async function gatherContext(state: InvestigationState): Promise<Partial<InvestigationState>> {
const tools = [
fetchLogMetrics(state.alertId),
fetchDeploymentHistory(state.alertId),
fetchInfraHealth(state.alertId),
fetchDependencyGraph(state.alertId)
];
const results = await Promise.allSettled(tools);
const evidence: Record<string, any> = {};
const auditEntries = [];
results.forEach((res, idx) => {
const toolName = ["logs", "deployments", "infra", "dependencies"][idx];
if (res.status === "fulfilled") {
evidence[toolName] = res.value;
auditEntries.push({ step: `fetch_${toolName}`, timestamp: Date.now(), data: { status: "success" } });
} else {
auditEntries.push({ step: `fetch_${toolName}`, timestamp: Date.now(), data: { status: "failed", error: res.reason } });
}
});
return { gatheredEvidence: evidence, auditTrail: [...state.auditTrail, ...auditEntries] };
}
Step 3: Frame & Evaluate Hypotheses
The agent generates plausible failure modes based on the alert signature and gathered context. Each hypothesis is tested against the evidence using a structured evaluation function. Confidence scores are calculated based on signal correlation strength, not LLM intuition.
function evaluateHypotheses(state: InvestigationState): Partial<InvestigationState> {
const hypotheses = generateFailureModes(state.alertId, state.gatheredEvidence);
const evaluations = hypotheses.map(hyp => {
const correlationScore = calculateSignalCorrelation(hyp, state.gatheredEvidence);
const confidence = Math.min(correlationScore, 1.0);
return {
hypothesis: hyp,
confidence,
sources: extractSourceReferences(hyp, state.gatheredEvidence)
};
});
const topMatch = evaluations.reduce((prev, curr) =>
curr.confidence > prev.confidence ? curr : prev
);
const isResolved = topMatch.confidence >= state.confidenceThreshold;
return {
activeHypotheses: hypotheses,
evaluatedResults: evaluations,
finalConclusion: isResolved ? topMatch.hypothesis : null,
auditTrail: [...state.auditTrail, { step: "hypothesis_evaluation", timestamp: Date.now(), data: evaluations }]
};
}
Step 4: Route & Terminate
The graph routes to a reporting node only when confidence exceeds the threshold. Otherwise, it escalates to manual review or triggers additional targeted queries.
const workflow = new StateGraph(InvestigationState)
.addNode("ingest", ingestAlert)
.addNode("gather", gatherContext)
.addNode("evaluate", evaluateHypotheses)
.addNode("report", generateSlackReport)
.addNode("escalate", notifyOnCall)
.addEdge(START, "ingest")
.addEdge("ingest", "gather")
.addEdge("gather", "evaluate")
.addConditionalEdges("evaluate", (state) =>
state.finalConclusion ? "report" : "escalate"
)
.addEdge("report", END)
.addEdge("escalate", END);
const app = workflow.compile();
Architecture Decisions & Rationale
- State Machine over Linear Chain: LangGraph enforces explicit state transitions, preventing infinite loops and enabling mid-flight inspection. This is critical for compliance and debugging.
- Parallel Tool Execution:
Promise.allSettled ensures the agent doesn't block on a single slow API. Partial failures are logged without halting the investigation.
- Confidence Thresholds: Hardcoding a minimum confidence (e.g., 0.85) prevents premature conclusions. The agent escalates rather than guesses.
- Evidence Grounding: Every conclusion must cite source queries. This eliminates hallucination drift and enables post-incident verification.
- Separation of Investigation & Remediation: The agent stops at reporting. Automated fixes require a separate, human-approved execution pipeline. This aligns with zero-trust SRE practices.
Pitfall Guide
Explanation: LLMs can enter recursive loops when querying APIs, exhausting rate limits or incurring unexpected costs.
Fix: Implement strict iteration caps (max_iterations: 5), timeout guards per tool, and circuit breakers that halt execution after repeated failures.
2. Write-Access During Investigation
Explanation: Granting the agent modify permissions to production systems during triage risks accidental configuration drift or data mutation.
Fix: Enforce read-only IAM roles, network egress filtering, and API scopes limited to GET/LIST operations. Remediation must use a separate, gated pipeline.
3. Hallucinated Correlations
Explanation: The agent may infer causal relationships between unrelated metrics due to pattern matching without statistical validation.
Fix: Require source citations for every claim. Implement a correlation validator that checks temporal alignment and statistical significance before accepting a hypothesis.
4. Ignoring Rate Limits & API Costs
Explanation: Parallel queries to Datadog, Grafana, or cloud APIs can trigger throttling or billing spikes.
Fix: Implement query batching, response caching for static metadata, and credit budgeting per investigation. Log API call counts in the audit trail.
5. Skipping Immutable Audit Trails
Explanation: Without persistent state logging, investigations become black boxes. Post-incident reviews lack verifiable evidence.
Fix: Append every step, tool response, and confidence score to an append-only log. Store snapshots in object storage with cryptographic checksums for compliance.
Explanation: Auto-applying fixes based on AI conclusions can cascade failures if the root cause is misidentified.
Fix: Decouple investigation from execution. The agent generates a remediation proposal; a human or policy engine approves before any write operation occurs.
7. Poor Confidence Calibration
Explanation: Static thresholds don't adapt to incident complexity. Simple alerts may resolve at 0.7 confidence, while cascading failures require 0.95.
Fix: Implement dynamic thresholds based on severity, system criticality, and historical false-positive rates. Log threshold adjustments for model tuning.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Data pipeline failure (Airflow/Kafka) | AI parallel investigation | High signal density across logs, metrics, and job states; agent correlates efficiently | Low (reduces MTTR by 60–70%) |
| Cloud infrastructure drift | AI investigation + policy enforcement | Infra state requires deterministic validation; agent flags drift, policy engine remediates | Medium (API costs + policy engine licensing) |
| Application-level bug | Traditional runbooks + AI triage | Code-level issues require stack traces and PR diffs; agent gathers context, engineer diagnoses | Low (minimal API overhead) |
| Compliance audit requirement | AI investigation + immutable logging | Regulatory frameworks demand traceable evidence chains; agent provides auditable state snapshots | High (storage + compliance tooling) |
Configuration Template
agent:
name: incident-investigator-v1
framework: langgraph
max_iterations: 5
timeout_seconds: 120
confidence_threshold: 0.85
dynamic_thresholding: true
security:
iam_role: read-only-sre-investigator
network_policy: egress-deny-all
allowed_endpoints:
- grafana.internal:3000
- datadog.api:443
- k8s-api.internal:6443
write_access: false
integrations:
observability:
- type: grafana
endpoint: ${GRAFANA_URL}
api_key: ${GRAFANA_API_KEY}
- type: datadog
endpoint: https://api.datadoghq.com
api_key: ${DD_API_KEY}
infrastructure:
- type: kubernetes
cluster: prod-us-east-1
namespace: default
communication:
- type: slack
webhook: ${SLACK_WEBHOOK_URL}
channel: "#incidents"
logging:
audit_storage: s3://sre-audit-logs/
retention_days: 365
checksum_algorithm: sha256
format: jsonl
Quick Start Guide
- Initialize the project structure: Create a TypeScript workspace with
@langchain/langgraph and your preferred HTTP client. Define the state schema and node functions as shown in the Core Solution.
- Configure mock observability endpoints: Spin up local instances of Grafana and Loki, or use HTTP mocks that return structured JSON responses for logs, metrics, and deployment history. Point the tool adapters to these endpoints.
- Set security boundaries: Apply read-only API keys, restrict egress traffic to localhost or mock servers, and enable append-only logging to a local directory. Verify that no write operations are possible.
- Compile and test the graph: Run the LangGraph compiler, trigger a synthetic alert payload, and observe the state transitions. Validate that parallel queries execute concurrently, confidence scoring aligns with evidence, and the audit trail captures every step.
- Iterate on thresholds: Adjust the confidence threshold based on test results. Introduce dynamic scaling rules for severity levels, and verify escalation paths route correctly to Slack or PagerDuty when confidence falls below the minimum.