our trace viewer.
Step 2: Define the Middleware Interface
Instead of wrapping the entire LLM call, we intercept tool execution at the dispatch boundary. This requires a lightweight middleware contract that fires before invocation, after completion, and on failure.
import { Span, SpanAttributes } from "@opentelemetry/api";
export interface ToolExecutionContext {
toolName: string;
serverId: string;
callId: string;
inputPayload: unknown;
traceContext: unknown;
}
export interface ToolMiddleware {
name: string;
priority: number;
preInvoke: (ctx: ToolExecutionContext) => Promise<ToolExecutionContext>;
postInvoke: (ctx: ToolExecutionContext, output: unknown) => Promise<void>;
onError: (ctx: ToolExecutionContext, error: Error) => Promise<void>;
}
Why this choice: Decoupling instrumentation from the core agent logic allows multiple middlewares (tracing, rate limiting, caching) to compose cleanly. Explicit context objects prevent implicit state leaks and make testing deterministic.
Step 3: Implement the Tracing Middleware
The tracing middleware manages span creation, latency calculation, and attribute attachment. It uses a Map to track active spans by call ID, ensuring concurrent tool calls don't interfere with each other.
import { agentTracer, SpanStatusCode } from "./otel-setup";
import { ToolMiddleware, ToolExecutionContext } from "./middleware-types";
const activeToolSpans = new Map<string, Span>();
export const toolTracingMiddleware: ToolMiddleware = {
name: "mcp-tool-tracer",
priority: 100,
async preInvoke(ctx: ToolExecutionContext) {
const parentSpan = ctx.traceContext ? trace.setSpan(context.active(), ctx.traceContext as any) : context.active();
const span = agentTracer.startSpan(
`mcp.tool.${ctx.toolName}`,
{
attributes: {
"mcp.tool.name": ctx.toolName,
"mcp.server.id": ctx.serverId,
"mcp.call.id": ctx.callId,
"mcp.input.size_bytes": Buffer.byteLength(JSON.stringify(ctx.inputPayload)),
},
},
parentSpan
);
activeToolSpans.set(ctx.callId, span);
return { ...ctx, _spanStart: performance.now() };
},
async postInvoke(ctx: ToolExecutionContext, output: unknown) {
const span = activeToolSpans.get(ctx.callId);
if (!span) return;
const latency = performance.now() - (ctx._spanStart ?? performance.now());
span.setAttributes({
"mcp.latency_ms": Math.round(latency),
"mcp.output.size_bytes": Buffer.byteLength(JSON.stringify(output)),
"mcp.success": true,
});
span.setStatus({ code: SpanStatusCode.OK });
span.end();
activeToolSpans.delete(ctx.callId);
},
async onError(ctx: ToolExecutionContext, error: Error) {
const span = activeToolSpans.get(ctx.callId);
if (!span) return;
const latency = performance.now() - (ctx._spanStart ?? performance.now());
const errorClass = classifyToolError(error);
const isRetryable = errorClass !== "malformed_output";
span.setAttributes({
"mcp.latency_ms": Math.round(latency),
"mcp.error.class": errorClass,
"mcp.error.retryable": isRetryable,
"mcp.error.message": error.message,
});
span.setStatus({ code: SpanStatusCode.ERROR, message: `${errorClass}: ${error.message}` });
span.end();
activeToolSpans.delete(ctx.callId);
},
};
function classifyToolError(err: Error): string {
const msg = err.message.toLowerCase();
if (msg.includes("timeout") || msg.includes("etimedout")) return "timeout";
if (msg.includes("rate_limit") || msg.includes("429")) return "rate_limit";
if (msg.includes("schema") || msg.includes("json") || msg.includes("syntax")) return "malformed_output";
return "unknown";
}
Why this choice:
performance.now() provides sub-millisecond precision for latency tracking.
- Explicit span cleanup in both success and error paths prevents memory leaks.
- Error classification drives downstream retry logic and alerting thresholds.
- Attribute naming follows OpenTelemetry semantic conventions for consistency across trace backends.
Step 4: Propagate Trace Context Across Agent Runs
Each agent execution should share a single root trace ID. The root span wraps the entire generate() call, and all tool spans become children. Context must be explicitly bound to avoid cross-request contamination.
import { agentTracer, SpanStatusCode } from "./otel-setup";
import { context, trace } from "@opentelemetry/api";
export async function executeAgentRun(prompt: string, runId: string, agentInstance: any) {
const rootSpan = agentTracer.startSpan("agent.execution", {
attributes: {
"agent.run_id": runId,
"agent.prompt_length": prompt.length,
},
});
const executionCtx = trace.setSpan(context.active(), rootSpan);
let result: any;
await context.with(executionCtx, async () => {
result = await agentInstance.generate({ input: { text: prompt } });
});
rootSpan.setAttributes({
"agent.tool_calls_count": result?.toolCalls?.length ?? 0,
"agent.total_tokens": result?.usage?.totalTokens ?? 0,
"agent.success": true,
});
rootSpan.setStatus({ code: SpanStatusCode.OK });
rootSpan.end();
return result;
}
Why this choice: context.with() ensures all middleware invocations during the generate() call inherit the correct parent span. This produces a clean flame graph in trace viewers where the root span contains the LLM call, which contains all tool spans. The slow tool becomes visually obvious.
Pitfall Guide
1. Context Leakage Across Concurrent Requests
Explanation: Forgetting to bind spans to the correct execution context causes traces from different user requests to merge, creating impossible parent-child relationships.
Fix: Always use context.with() or AsyncLocalStorage when invoking agent runs. Pass trace context explicitly to middleware instead of relying on implicit globals.
2. Span Memory Leaks on Timeout
Explanation: If a tool hangs and the executor kills the promise, the postInvoke or onError handler may never fire, leaving the span in the active map indefinitely.
Fix: Wrap tool execution in a try/finally block. Always delete the span from the map in the finally clause, and call span.end() if it hasn't been closed.
3. Misclassifying Schema Errors as Transient
Explanation: Treating malformed tool output as retryable causes infinite retry loops. The LLM will repeatedly call the tool, consuming tokens and delaying responses.
Fix: Explicitly tag schema/validation errors as malformed_output with retryable: false. Route these to alerting channels for immediate developer review.
4. Ignoring Output Size Correlation
Explanation: Large payloads cause downstream timeouts, serialization overhead, and context window exhaustion. Without tracking output size, you cannot correlate latency spikes with data volume.
Fix: Record mcp.output.size_bytes on every span. Set up dashboards that alert when output size exceeds the 95th percentile for a given tool.
5. Hardcoding Global Timeouts
Explanation: Applying a single timeout to all tools wastes budget on fast operations and starves slow ones. A 5-second timeout kills a fast search but is too generous for a database query.
Fix: Implement per-tool latency budgets. Register tools with explicit timeoutMs values and enforce them at the executor level. Adjust budgets based on historical p95 latency.
6. Missing Cost Attribution
Explanation: Only tracking LLM tokens ignores paid API calls inside tools. In data-heavy workflows, tool costs often exceed model costs by 3–5x.
Fix: Attach mcp.estimated_cost_usd attributes using an internal pricing table. Aggregate costs per run and per tool to identify expensive workflows.
7. Over-Instrumenting Internal Functions
Explanation: Wrapping every utility function, cache lookup, or string formatter creates noise and degrades performance. Tracing should focus on external boundaries.
Fix: Scope instrumentation strictly to MCP tool dispatch boundaries. Use sampling or disable tracing for internal helper functions.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume search tools | Per-tool timeout + output size tracking | Prevents context window exhaustion and reduces token waste | Low (infrastructure tuning) |
| Paid external APIs | Cost attribution + rate_limit classification | Enables budget alerts and automatic backoff strategies | Medium (API spend optimization) |
| Internal database queries | Latency budget + malformed_output detection | Catches schema drift and missing indexes before they cascade | Low (DB optimization) |
| Multi-step agent chains | Root span propagation + tool call counting | Provides end-to-end visibility without nested trace fragmentation | Low (observability overhead) |
| Production debugging | Granular error classification + retry flags | Reduces mean time to resolution from hours to minutes | High (engineering efficiency) |
Configuration Template
// otel-config.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-otlp-http";
import { trace, SpanStatusCode, context } from "@opentelemetry/api";
export const otelSdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces",
}),
serviceName: "mcp-agent-runtime",
forceFlushTimeoutMillis: 5000,
});
otelSdk.start();
export const tracer = trace.getTracer("mcp-tool-orchestrator", "2.1.0");
export { SpanStatusCode, context };
// tool-timeout-config.ts
export const TOOL_LATENCY_BUDGETS: Record<string, number> = {
"github.search_code": 8_000,
"github.read_file": 5_000,
"database.query": 3_000,
"web.search": 10_000,
"code.analyzer": 12_000,
};
export function getToolTimeout(toolName: string): number {
return TOOL_LATENCY_BUDGETS[toolName] ?? 15_000;
}
Quick Start Guide
- Install dependencies:
npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/exporter-otlp-http
- Initialize the OTEL SDK at application startup with your OTLP endpoint and service name
- Define a middleware interface with
preInvoke, postInvoke, and onError hooks
- Implement the tracing middleware to start spans, attach attributes, and classify errors
- Wrap agent execution with
context.with() to propagate trace IDs and generate clean flame graphs
Deploy to a staging environment, trigger a multi-step agent run, and verify that your trace backend displays a root span containing individual tool spans with latency, size, and error attributes. Adjust per-tool budgets based on p95 latency metrics, and route malformed_output errors to your incident management system.