AI Agent Debugging Checklist: From Failed Run to Root Cause

By Codcompass Team·2026-06-01·9 min read

Forensic Replay: A Protocol for Root Cause Analysis in Production AI Agents

Current Situation Analysis

Production AI agents introduce a debugging paradigm that breaks traditional request-tracing assumptions. When a standard microservice fails, engineers inspect logs, identify the error, and fix the code. When an AI agent fails, the instinct is often to tweak the prompt and rerun the workflow. This approach is not just ineffective; it is actively destructive to incident resolution.

A naive rerun of an AI agent alters multiple variables simultaneously. The stochastic nature of the model may produce different outputs. The retrieval system may return different chunks due to index drift or timing. External APIs may return updated data. Tool states may have changed. Worse, if the agent has already executed side effects—sending emails, issuing refunds, modifying database records, or triggering external webhooks—a rerun risks duplicating these actions.

The core misunderstanding is treating agents as deterministic functions rather than stateful systems with external dependencies. Debugging requires preserving the exact state of the failed execution to isolate the "first unsupported decision"—the earliest point where the agent's action lacked sufficient evidence from its context. Without a forensic protocol, teams waste hours chasing symptoms caused by stale retrieval, permission drift, or tool schema mismatches, while the actual root cause remains obscured by the noise of a changed environment.

WOW Moment: Key Findings

The difference between a naive rerun and a forensic replay is quantifiable across safety, fidelity, and resolution speed. Forensic replay pins the execution environment, allowing engineers to analyze the agent's decision graph without risking production state.

Strategy	Side-Effect Risk	Context Fidelity	Root Cause Visibility	Regression Readiness
Naive Rerun	High	Low	Low	None
	Risk of duplicating writes, emails, or API calls.	Context shifts due to index updates, memory changes, or API state drift.	Model output changes; hard to distinguish model error from context error.	Requires manual recreation of inputs; fragile.
Forensic Replay	Zero	High	High	Immediate
	Mutating tools are stubbed; no external state changes.	Inputs, context, tool responses, and timestamps are pinned.	Decision graph is analyzed against pinned evidence; divergence is isolated.	Incident converts directly to a regression fixture.

Forensic replay enables the team to answer three critical questions deterministically: What did the agent see? What did it choose? Why was that choice unsupported by the evidence?

Core Solution

Implementing a forensic replay protocol requires instrumenting the agent lifecycle to capture a complete execution snapshot, classifying tools by risk, and providing a safe replay engine. The following TypeScript implementation demonstrates the architecture.

1. Define the Trace Snapshot Schema

The foundation of forensic analysis is a structured snapshot that captures identity, context, and decisions. This schema ensures no evidence is lost.

interface TraceSnapshot {
  // Identity
  traceId: string;
  runId: string;
  agentVersion: string;
  deploymentSha: string;
  modelProvider: string;
  modelId: string;
  timestamp: string;
  
  // Context
  userPayload: unknown;
  systemInstructions: string;
  retrievedChunks: Array<{ source: string; content: string; score: number }>;
  memoryEntries: Array<{ key: string; value: string; timestamp: string }>;
  featureFlags: Record<string, boolean>;
  
  // Decision Graph
  steps: DecisionStep[];
  sideEffects: SideEffectRecord[];
}

interface DecisionStep {
  stepIndex: number;
  type: 'model_call' | 'tool_call' | 'guardrail' | 'handoff';
  inputContext: string;
  modelOutput?: string;
  selectedAction: string;
  alternatives?: string[];
  toolCall?: ToolCallRecord;
  guardrailResult?: GuardrailResult;
  evidenceSupport: 'strong' | 'weak' | 'none';
}

interface ToolCallRecord {
  toolName: string;
  riskLevel: 'read_only' | 'write' | 'risky_write' | 'human_approval';
  arguments: Record<strin

g, unknown>; validation: 'passed' | 'failed'; authScope: string; rawResponse: unknown; normalizedResponse: unknown; latencyMs: number; retryCount: number; idempotencyKey?: string; durableReceiptId?: string; }


#### 2. Implement the Forensic Capture Service

This service intercepts agent execution to build the snapshot. It classifies tools and records side effects.

```typescript
class ForensicCaptureService {
  private snapshot: TraceSnapshot;
  private toolRegistry: Map<string, ToolRiskLevel>;

  constructor(runId: string, agentVersion: string) {
    this.snapshot = {
      traceId: crypto.randomUUID(),
      runId,
      agentVersion,
      deploymentSha: process.env.DEPLOYMENT_SHA || 'unknown',
      modelProvider: 'default',
      modelId: 'default',
      timestamp: new Date().toISOString(),
      userPayload: null,
      systemInstructions: '',
      retrievedChunks: [],
      memoryEntries: [],
      featureFlags: {},
      steps: [],
      sideEffects: [],
    };
    this.toolRegistry = new Map();
  }

  registerTool(toolName: string, riskLevel: ToolRiskLevel): void {
    this.toolRegistry.set(toolName, riskLevel);
  }

  captureContext(payload: unknown, instructions: string, chunks: any[], memory: any[]): void {
    this.snapshot.userPayload = payload;
    this.snapshot.systemInstructions = instructions;
    this.snapshot.retrievedChunks = chunks;
    this.snapshot.memoryEntries = memory;
  }

  recordDecision(step: DecisionStep): void {
    this.snapshot.steps.push(step);
  }

  recordToolCall(toolName: string, args: any, response: any, isMutation: boolean): void {
    const riskLevel = this.toolRegistry.get(toolName) || 'read_only';
    const toolCall: ToolCallRecord = {
      toolName,
      riskLevel,
      arguments: args,
      validation: 'passed',
      authScope: 'current_user',
      rawResponse: response,
      normalizedResponse: response,
      latencyMs: 0,
      retryCount: 0,
      idempotencyKey: isMutation ? crypto.randomUUID() : undefined,
      durableReceiptId: isMutation ? `receipt_${Date.now()}` : undefined,
    };
    
    const step: DecisionStep = {
      stepIndex: this.snapshot.steps.length,
      type: 'tool_call',
      inputContext: 'tool_execution',
      selectedAction: `call_tool:${toolName}`,
      toolCall,
      evidenceSupport: 'strong',
    };
    
    this.snapshot.steps.push(step);

    if (isMutation) {
      this.snapshot.sideEffects.push({
        type: 'tool_mutation',
        toolName,
        receiptId: toolCall.durableReceiptId,
        timestamp: new Date().toISOString(),
      });
    }
  }

  getSnapshot(): TraceSnapshot {
    return JSON.parse(JSON.stringify(this.snapshot));
  }
}

type ToolRiskLevel = 'read_only' | 'write' | 'risky_write' | 'human_approval';

3. Build the Safe Replay Engine

The replay engine consumes a snapshot and executes the agent logic with pinned inputs. It stubs mutating tools to prevent side effects.

interface ReplayConfig {
  pinModelOutput?: boolean;
  stubMutatingTools: boolean;
  overrideModelId?: string;
}

class ReplayOrchestrator {
  private snapshot: TraceSnapshot;
  private config: ReplayConfig;

  constructor(snapshot: TraceSnapshot, config: ReplayConfig) {
    this.snapshot = snapshot;
    this.config = config;
  }

  async execute(): Promise<ReplayResult> {
    const result: ReplayResult = {
      traceId: this.snapshot.traceId,
      steps: [],
      sideEffectsBlocked: [],
      errors: [],
    };

    for (const step of this.snapshot.steps) {
      if (step.type === 'tool_call' && step.toolCall) {
        const isMutating = ['write', 'risky_write', 'human_approval'].includes(step.toolCall.riskLevel);
        
        if (isMutating && this.config.stubMutatingTools) {
          result.sideEffectsBlocked.push({
            toolName: step.toolCall.toolName,
            reason: 'Mutating tool stubbed during replay',
            originalReceiptId: step.toolCall.durableReceiptId,
          });
          
          result.steps.push({
            ...step,
            toolCall: {
              ...step.toolCall,
              rawResponse: { status: 'stubbed', original: step.toolCall.rawResponse },
              normalizedResponse: { status: 'stubbed' },
            },
          });
          continue;
        }
      }

      // In a real implementation, this would invoke the agent logic
      // with pinned context. For this example, we record the replay step.
      result.steps.push(step);
    }

    return result;
  }
}

interface ReplayResult {
  traceId: string;
  steps: DecisionStep[];
  sideEffectsBlocked: Array<{ toolName: string; reason: string; originalReceiptId?: string }>;
  errors: string[];
}

4. Identify the First Unsupported Decision

The replay result should be analyzed to find the first step where the agent's action was not supported by the pinned context. This is the root cause.

function analyzeFirstUnsupportedDecision(snapshot: TraceSnapshot): string | null {
  for (const step of snapshot.steps) {
    if (step.evidenceSupport === 'none' || step.evidenceSupport === 'weak') {
      return `Step ${step.stepIndex}: ${step.selectedAction} had ${step.evidenceSupport} evidence support.`;
    }
  }
  return null;
}

5. Convert to Regression Fixture

Once the root cause is identified, the snapshot and analysis are converted into a regression test. This ensures the failure cannot recur due to prompt changes, model upgrades, or retrieval drift.

interface RegressionFixture {
  name: string;
  snapshot: TraceSnapshot;
  expectedDecision: string;
  blockedActions: string[];
  assertions: Array<{ type: string; path: string; value: unknown }>;
  rootCauseNote: string;
}

function createRegressionFixture(
  snapshot: TraceSnapshot, 
  rootCause: string
): RegressionFixture {
  return {
    name: `regression_${snapshot.runId}`,
    snapshot,
    expectedDecision: 'agent_should_not_execute_mutation',
    blockedActions: snapshot.sideEffects.map(se => se.toolName),
    assertions: [
      { type: 'no_side_effects', path: 'sideEffects', value: [] },
      { type: 'decision_support', path: 'steps[*].evidenceSupport', value: 'strong' },
    ],
    rootCauseNote: rootCause,
  };
}

Pitfall Guide

The Stochastic Rerun Fallacy
- Explanation: Rerunning the agent without pinning inputs changes the model output, making it impossible to determine if a fix works or if the model just got lucky.
- Fix: Always use a replay engine that pins user input, retrieved context, and tool responses before testing changes.
Context Blindness
- Explanation: Blaming the model for hallucination when the retrieval system returned stale or irrelevant chunks. The model is only as good as its evidence.
- Fix: Inspect retrievedChunks and memoryEntries in the snapshot. Verify chunk relevance and freshness before adjusting prompts.
Tool Equivalence Error
- Explanation: Treating a read-only search tool the same as a write tool that issues refunds. This leads to unsafe debugging practices that duplicate side effects.
- Fix: Classify all tools by risk level. Enforce stubbing for write, risky_write, and human_approval tools during replay.
Idempotency Ignorance
- Explanation: Retrying a failed tool call without checking idempotency keys can cause duplicate mutations, even if the tool API is idempotent.
- Fix: Capture idempotencyKey and retryCount in tool logs. Ensure replay stubs respect these keys to prevent duplicate execution logic.
Prompt Tunnel Vision
- Explanation: Focusing exclusively on prompt engineering while ignoring system state, permissions, or feature flags that may have caused the failure.
- Fix: Review featureFlags, authScope, and memoryEntries in the snapshot. A prompt change cannot fix a permission error or stale memory.
Missing the Divergence Point
- Explanation: Analyzing the final output instead of the decision graph. The final response may look correct even if an intermediate decision was wrong.
- Fix: Use the analyzeFirstUnsupportedDecision function to find the earliest step with weak or no evidence support.
Regression Fragility
- Explanation: Creating regression tests that depend on exact model output strings, which break on model upgrades.
- Fix: Assert on decisions, side effects, and evidence support rather than exact text. Use expectedDecision and blockedActions in fixtures.

Production Bundle

Action Checklist

Capture Identity: Record traceId, runId, agent version, deployment SHA, model ID, and timestamp for every run.
Classify Tools: Map all tools to risk levels (read_only, write, risky_write, human_approval) in the tool registry.
Pin Context: Ensure userPayload, systemInstructions, retrievedChunks, and memoryEntries are captured in the snapshot.
Stub Mutations: Configure the replay engine to stub all mutating tools and block side effects.
Analyze Decisions: Run the decision analyzer to identify the first unsupported decision in the trace.
Verify Evidence: Check retrieved chunks and memory for staleness or irrelevance before blaming the model.
Create Fixture: Convert the incident into a regression fixture with pinned context and assertions.
Review Incident: Use the incident review template to document the root cause and fix.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Critical Write Failure	Full Stub Replay	Prevents duplicate mutations; isolates decision logic safely.	Low compute; high safety.
Retrieval Drift	Context Pin + Model Rerun	Isolates retrieval quality from model behavior.	Medium compute; allows model evaluation.
New Model Evaluation	Shadow Replay	Compares decisions between models without side effects.	High compute; valuable for upgrades.
Permission Error	Auth Scope Replay	Verifies agent behavior under specific permission constraints.	Low compute; focuses on auth logic.
Memory Corruption	Memory Pin + Replay	Tests agent behavior with specific memory states.	Low compute; targets memory issues.

Configuration Template

Use this TypeScript configuration to set up the forensic capture and replay system.

// agent-debug-config.ts

export interface AgentDebugConfig {
  capture: {
    enabled: boolean;
    includeRawToolResponses: boolean;
    includeMemoryEntries: boolean;
    includeFeatureFlags: boolean;
  };
  replay: {
    enabled: boolean;
    stubMutatingTools: boolean;
    pinModelOutput: boolean;
    maxReplayRetries: number;
  };
  tools: {
    [toolName: string]: {
      riskLevel: 'read_only' | 'write' | 'risky_write' | 'human_approval';
      idempotent: boolean;
    };
  };
  regression: {
    autoGenerate: boolean;
    assertionStrategy: 'decision_based' | 'output_based';
  };
}

export const defaultConfig: AgentDebugConfig = {
  capture: {
    enabled: true,
    includeRawToolResponses: true,
    includeMemoryEntries: true,
    includeFeatureFlags: true,
  },
  replay: {
    enabled: true,
    stubMutatingTools: true,
    pinModelOutput: false,
    maxReplayRetries: 3,
  },
  tools: {
    'send_email': { riskLevel: 'risky_write', idempotent: false },
    'issue_refund': { riskLevel: 'risky_write', idempotent: true },
    'search_docs': { riskLevel: 'read_only', idempotent: true },
    'update_ticket': { riskLevel: 'write', idempotent: false },
  },
  regression: {
    autoGenerate: true,
    assertionStrategy: 'decision_based',
  },
};

Quick Start Guide

Instrument Capture: Integrate ForensicCaptureService into your agent execution pipeline. Ensure every run captures a TraceSnapshot.
Define Tool Risks: Populate the tool registry with risk levels for all tools. Mark mutating tools appropriately.
Deploy Replay Engine: Implement ReplayOrchestrator in your debugging environment. Configure it to stub mutating tools.
Run Replay: When an incident occurs, load the TraceSnapshot and execute a replay. Analyze the result for the first unsupported decision.
Assert Regression: Convert the incident to a RegressionFixture. Add assertions to your test suite to prevent recurrence.

Production agent debugging requires discipline. By preserving evidence, classifying risks, and replaying safely, you transform chaotic failures into reproducible regressions. The goal is not to make every run deterministic; it is to find the first unsupported decision and ensure it never happens again.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back