Debugging Multi-Agent Systems in TypeScript: From Flat Logs to Execution Trees
Visualizing Agent Concurrency: A Tree-Based Approach to Multi-Agent Observability
Current Situation Analysis
Multi-agent architectures introduce a class of concurrency bugs that traditional observability tools cannot resolve. When a single agent executes, linear logs provide a sufficient audit trail. However, as soon as multiple agents interact—sharing state, competing for resources, or executing in parallel—the system's control flow transforms from a list into a directed acyclic graph (DAG).
Flat logs fail in this environment because they collapse the causal hierarchy into a timestamped sequence. A log entry might show Agent A writing to a database at 14:00:01 and Agent B reading stale data at 14:00:02, but the log does not explicitly encode the dependency or the race condition. Engineers are forced to manually reconstruct the execution tree by correlating timestamps, which is error-prone and unscalable.
This gap is often overlooked because developers initially treat agents as sequential functions. The complexity emerges only when orchestration patterns like fan-out, retry loops, and resource locking are introduced. Without a visualization of the execution tree, debugging becomes a game of inference rather than inspection. You can see the symptoms (errors, conflicts), but the structural cause (parallel branches touching the same resource, decision loops based on outdated state) remains hidden.
WOW Moment: Key Findings
The shift from flat logging to execution tree tracing fundamentally changes how concurrency failures are diagnosed. By capturing the hierarchical relationship between decisions, tool calls, and parallel branches, you gain immediate visibility into race conditions and coordination failures.
| Dimension | Linear Logging | Execution Tree Tracing |
|---|---|---|
| Concurrency Visibility | Implicit; requires timestamp correlation | Explicit; parallel branches are structural nodes |
| Root Cause Isolation | Manual; scan logs for error keywords | Structural; trace path to failure node |
| State Freshness | Unknown; logs rarely capture state snapshots | Visible; each node can record state context |
| Coordination Gaps | Hidden; orchestrator logic is opaque | Clear; reveals if conflict resolution was bypassed |
| Debug Latency | High; minutes to hours of log parsing | Low; seconds to identify structural anomaly |
Why this matters: Execution trees allow you to distinguish between a tool failure, an LLM reasoning error, and an orchestration bug. In a tree view, a failure inside a parallel branch that prevents the coordinator from reaching a conflict-resolution step is immediately obvious, whereas logs might just show a timeout and a later conflict error with no causal link.
Core Solution
The solution involves instrumenting your TypeScript agent flows with a local-first execution tracer. Tools like agent-inspect provide a structured debugging layer that writes trace data to the local filesystem, enabling rapid iteration without the overhead of a hosted observability platform.
The implementation strategy focuses on three layers:
- Session Boundaries: Wrap the top-level orchestration flow to define the scope of a single run.
- Phase Instrumentation: Mark logical steps within agents to capture decision points.
- Action Tagging: Explicitly label tool calls and LLM interactions to separate side effects from computation.
Implementation Strategy
Consider a payment fraud detection system where a RiskAnalyzer agent evaluates transactions, and a MitigationAgent handles account freezes. The system must ensure that if multiple agents attempt to modify the same account, they do not conflict.
1. Orchestrator Instrumentation
Wrap the main entry point with inspectRun. This creates the root of the execution tree. Use step for logical phases and step.tool for operations that interact with external systems or simulate infrastructure changes.
import { inspectRun, step } from 'agent-inspect';
interface FraudCase {
caseId: string;
amount: number;
userId: string;
}
async function processFraudCase(fraudCase: FraudCase) {
return inspectRun(
'fraud-mitigation-pipeline',
async () => {
// Phase 1: Risk Assessment
const riskEvaluation = await step('evaluate-risk-profile', async () => {
return riskAnalyzer.assess(fraudCase);
});
// Phase 2: Conditional Mitigation
if (riskEvaluation.severity === 'critical') {
return step('execute-mitigation-protocol', async () => {
// Parallel execution of independent actions
return Promise.all([
step.tool('freeze-user-account', () =>
accountService.lock(fraudCase.userId)
),
step.tool('notify-compliance', () =>
complianceService.alert(fraudCase.caseId)
),
step.tool('block-payment-channel', () =>
paymentGateway.revoke(fraudCase.userId)
),
]);
});
}
return { status: 'monitored', reason: 'Risk below threshold' };
},
{ traceDir: './.agent-traces' }
);
}
2. Agent-Level Granularity
Instrument individual agents to capture internal decision loops. This is critical for identifying when an agent acts on stale state. Use descriptive names that reflect the business logic, not generic labels.
class MitigationAgent {
async handleCriticalCase(caseId: string) {
return step('mitigation-agent-flow', async () => {
// Check current state before acting
const accountState = await step.tool('fetch-account-lock-status', async () => {
return this.accountRepo.getLockStatus(caseId);
});
// Guard against concurrent modifications
if (accountState.isLockedByAnotherAgent) {
return step('defer-to-coordinator', async () => {
await this.waitForLockRelease();
return this.handleCriticalCase(caseId);
});
}
// LLM decision point
const actionPlan = await step.llm('determine-mitigation-steps', async () => {
return this.model.chat({
messages: [{
role: 'user',
content: JSON.stringify({
task: 'Select mitigation steps based on account state',
state: accountState,
}),
}],
});
});
// Execute selected actions
return step.tool('apply-mitigation', async () => {
return this.executePlan(actionPlan.steps);
});
});
}
}
3. Resolving Coordination Failures
Execution traces often reveal that failures stem from parallel agents acting on shared resources without synchronization. The trace might show two agents attempting to scale a database or freeze an account simultaneously, causing a quorum loss or lock conflict.
Fix 1: State Refresh Guards Ensure agents verify state freshness before committing actions. If a resource is in flux, the agent should wait or re-evaluate.
async function applyMitigation(plan: ActionPlan) {
return step('safe-mitigation-execution', async () => {
const currentState = await step.tool('verify-resource-state', async () => {
return this.resourceClient.getStatus(plan.targetId);
});
if (currentState.isModifying) {
return step('wait-for-stability', async () => {
await this.pollUntilStable(plan.targetId);
return this.applyMitigation(plan);
});
}
return this.performAction(plan);
});
}
Fix 2: Distributed Locking Protect critical sections with locks. The trace should show lock acquisition and release, making contention visible.
async function freezeAccount(userId: string) {
return step.tool('secure-account-freeze', async () => {
const lock = await this.lockManager.acquire(`account:${userId}`, 30_000);
try {
return this.accountService.freeze(userId);
} finally {
await lock.release();
}
});
}
Fix 3: Orchestrator Sequencing The coordinator should analyze resource targets and sequence actions that overlap, rather than blindly parallelizing.
async function coordinateRemediation(diagnosis: Diagnosis) {
return step('orchestrate-remediation', async () => {
const targets = extractResourceTargets(diagnosis);
// Sequence actions if they target the same resource
if (targets.hasOverlap(['account', 'payment'])) {
return step('sequential-remediation', async () => {
await step.tool('handle-account', () => accountAgent.resolve(targets.account));
return step.tool('handle-payment', () => paymentAgent.resolve(targets.payment));
});
}
// Parallelize independent resources
return Promise.all([
step.tool('handle-network', () => networkAgent.resolve(targets.network)),
step.tool('handle-storage', () => storageAgent.resolve(targets.storage)),
]);
});
}
Pitfall Guide
1. The Timestamp Trap
- Explanation: Relying on log timestamps to determine execution order in concurrent systems. Clock skew and asynchronous scheduling make timestamps unreliable for causality.
- Fix: Use execution trees with causal IDs. The tree structure inherently encodes order and parallelism, removing ambiguity.
2. Blind Parallelism on Shared Resources
- Explanation: Orchestrating agents to run in parallel without checking if they modify the same resource. This leads to race conditions, data corruption, or conflicting actions.
- Fix: Implement resource target analysis in the coordinator. Sequence actions that share resources; parallelize only independent ones.
3. Stale State Decisions
- Explanation: Agents making decisions based on state fetched at the start of a long-running flow. By the time the action executes, the state may have changed.
- Fix: Add state refresh guards before critical actions. If the state is inconsistent, retry or escalate.
4. Trace Bloat and Noise
- Explanation: Instrumenting every function call or internal loop, resulting in massive traces that obscure the important structure.
- Fix: Trace boundaries, not internals. Use
stepfor logical phases andstep.toolfor side effects. Avoid tracing pure computation or tight loops unless debugging specific performance issues.
5. Missing LLM Context
- Explanation: Tracing tool calls but omitting the prompts or responses from LLM calls. This makes it impossible to understand why an agent made a specific decision.
- Fix: Use
step.llmto capture prompt summaries and response structures. Ensure sensitive data is redacted but the reasoning path is preserved.
6. Orchestration Bypass
- Explanation: Agents calling tools directly without going through the coordinator, bypassing conflict resolution and locking mechanisms.
- Fix: Enforce entry points. Agents should expose high-level methods that the coordinator invokes, rather than allowing direct tool access.
7. Ignoring Trace Retention
- Explanation: Local traces accumulate and consume disk space, or are lost after a crash, hindering post-mortem analysis.
- Fix: Configure retention policies. Rotate traces based on age or count. For production, consider streaming traces to a centralized store.
Production Bundle
Action Checklist
- Define Trace Boundaries: Wrap all agent entry points and orchestration flows with
inspectRunor equivalent session wrappers. - Instrument Tool Calls: Tag every external interaction (API calls, DB writes, file ops) with
step.toolto capture side effects. - Capture LLM Decisions: Use
step.llmto record prompts and responses for all model interactions. - Implement Resource Locking: Add distributed locks for any resource accessed by multiple agents.
- Add State Guards: Insert state verification steps before critical actions to prevent stale data usage.
- Configure Local Output: Set up local trace directories for development to enable fast iteration.
- Review Failed Runs: Establish a workflow to inspect execution trees for every failed agent run.
- Redact Sensitive Data: Ensure prompts and tool inputs are sanitized before writing to traces.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Single Agent Flow | Structured Logging | Low overhead; linear execution is easy to follow with logs. | Minimal |
| Multi-Agent Development | Local Execution Tree | Fast feedback loop; no infrastructure setup; reveals concurrency bugs. | Low (Disk I/O) |
| Multi-Agent Production | Distributed Tracing | Centralized storage; long-term retention; cross-service correlation. | Moderate (Network/Storage) |
| High-Frequency Agents | Sampled Tracing | Reduces overhead while maintaining statistical visibility. | Low |
| Compliance Auditing | Immutable Trace Store | Provides tamper-evident records of agent decisions and actions. | High (Storage/Integrity) |
Configuration Template
Use this configuration to set up local tracing with retention and redaction policies.
{
"tracer": {
"enabled": true,
"mode": "local",
"outputDirectory": "./.agent-traces",
"retention": {
"maxAgeDays": 7,
"maxSizeMB": 500
},
"redaction": {
"enabled": true,
"patterns": [
"apiKey",
"token",
"password",
"creditCard"
]
},
"sampling": {
"enabled": false,
"rate": 1.0
}
}
}
Quick Start Guide
- Install the Tracer: Add
agent-inspectto your project dependencies.npm install agent-inspect - Wrap Your Flow: Import
inspectRunandstep, then wrap your main agent orchestration function.import { inspectRun, step } from 'agent-inspect'; async function main() { return inspectRun('my-agent-run', async () => { // Your agent logic here }); } - Run and Inspect: Execute your application. After the run, list and view the trace using the CLI.
npx agent-inspect list npx agent-inspect view <run-id> - Analyze the Tree: Look for failed nodes, parallel branches, and resource contention. Use the tree structure to identify the root cause of coordination failures.
Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register — Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
