Run-Level Telemetry for LLM Agents: Diagnosing Hidden Cost and Latency Spikes

Current Situation Analysis

The fundamental disconnect in modern LLM agent development lies in how providers bill versus how agents execute. Cloud AI platforms invoice per API request (messages.create, chat.completions), treating each round-trip as an isolated transaction. Agents, however, operate as stateful workflows: they invoke tools, process results, retry on failure, and accumulate conversation history across multiple semantic steps. This architectural mismatch creates a critical observability gap.

Engineers typically monitor individual request latency or aggregate total spend across a billing cycle. Neither approach reveals the execution shape of a single agent run. When a workflow budgeted at $0.12 suddenly consumes $4.20, per-call line items appear normal. Each individual request might use standard token counts and return within acceptable latency bounds. The anomaly only becomes visible when calls are grouped by their originating run and broken down by logical step.

This problem is systematically overlooked because traditional APM tools are built around HTTP spans or database queries, not semantic agent phases. They lack native concepts of "tool loops," "context window accumulation," or "step-level budgeting." Consequently, developers miss quadratic token growth patterns hidden inside unbounded tool invocations. Real-world incident data consistently shows that ambiguous tool prompts combined with naive history management can inflate input tokens by 10–30x per iteration, directly correlating to cost spikes that only surface after the billing cycle closes.

The solution requires shifting from request-level monitoring to run-level telemetry: lightweight, in-memory aggregation that tracks cost, latency percentiles, and token distribution across explicitly tagged workflow phases.

WOW Moment: Key Findings

The diagnostic power of run-level aggregation becomes immediately apparent when comparing blind per-call tracking against step-aware telemetry. The following table contrasts the visibility each approach provides for the same agent execution:

Approach	Total Cost	Call Count	P95 Latency	Context Growth Pattern	Diagnostic Clarity
Per-Call Billing View	$4.20	11	~4.1s	Hidden	Low (line items appear normal)
Run-Level Aggregation	$4.20	11	4.92s	O(n²) detected	High (step breakdown isolates loop)
Optimized Run (Fixed)	$0.14	5	3.05s	Bounded	High (30x cost reduction verified)

This finding matters because it transforms cost debugging from retrospective guesswork into deterministic root-cause analysis. By grouping calls into runs and tagging them by semantic step, engineers can immediately identify which phase consumed disproportionate resources. The P95 latency metric reveals tail behavior that averages obscure, while step-level token tracking exposes quadratic growth before it impacts production budgets. This visibility enables proactive loop detection, dynamic context management, and precise step-level budgeting.

Core Solution

Building a run-level telemetry system requires three architectural decisions: explicit run lifecycle management, flat tagging for step attribution, and lightweight in-memory aggregation. The implementation avoids async lock-in, payload logging, and nested span complexity, focusing strictly on cost, latency, and token distribution.

Step 1: Define the Run Lifecycle

A run represents a single agent execution from initiation to completion. It must support explicit start, tagging, call recording, and finalization.

export interface RunConfig {
  runId: string;
  workflowName: string;
  tags: Record<string, string>;
}

export class AgentRunTracker {
  private calls: CallSnapshot[] = [];
  private startTime: number;
  private config: RunConfig;

  constructor(config: RunConfig) {
    this.config = config;
    this.startTime = performance.now();
  }

  recordCall(snapshot: CallSnapshot): void {
    this.calls.push(snapshot);
  }

  finalize(): RunReport {
    const duration = performance.now() - this.startTime;
    return this.aggregate(duration);
  }
}

Step 2: Implement Call Recording with Cost Extraction

Each LLM interaction should be captured as a snapshot containing token counts, latency, and pre-computed cost. Cost calculation must be cache-aware to reflect actual provider pricing.

export interface CallSnapshot {
  model: string;
  inputTokens: number;
  outputTokens: number;
  cacheReadTokens: number;
  cacheWriteTokens: number;
  latencyMs: number;
  costUsd: number;
  stepTag: string;
}

export function computeCost(
  model: string,
  input: number,
  output: number,
  cacheRead: number,
  cacheWrite: number
): number {
  const PRICING = {
    "claude-opus-4-7": { input: 0.000015, output: 0.000075, cacheRead: 0.00001, cacheWrite: 0.00001875 },
    "claude-haiku-4": { input: 0.00000025, output: 0.00000125, cacheRead: 0.00000003, cacheWrite: 0.0000003 }
  };

  const rates = PRICING[model as keyof typeof PRICING] || PRICING["claude-haiku-4"];
  return (
    input * rates.input +
    output * rates.output +
    cacheRead * rates.cacheRead +
    cacheWrite * rates.cacheWrite
  );
}

Step 3: Aggregate Metrics and Generate Reports

Aggregation computes percentiles, groups by step, and formats the output for downstream analysis. P50 and P95 latency are calculated using standard percentile algorithms.

export interface RunReport {
  runId: string;
  workflowName: string;
  durationMs: number;
  totalCostUsd: number;
  callCount: number;
  p50LatencyMs: number;
  p95LatencyMs: number;
  byStep: Record<string, StepMetrics>;
}

export interface StepMetrics {
  callCount: number;
  totalCostUsd: number;
  avgInputTokens: number;
  avgOutputTokens: number;
}

private aggregate(duration: number): RunReport {
  const latencies = this.calls.map(c => c.latencyMs).sort((a, b) => a - b);
  const p50 = latencies[Math.floor(latencies.length * 0.5)] || 0;
  const p95 = latencies[Math.floor(latencies.length * 0.95)] || 0;

  const byStep: Record<string, StepMetrics> = {};
  let totalCost = 0;

  for (const call of this.calls) {
    totalCost += call.costUsd;
    if (!byStep[call.stepTag]) {
      byStep[call.stepTag] = { callCount: 0, totalCostUsd: 0, avgInputTokens: 0, avgOutputTokens: 0 };
    }
    const step = byStep[call.stepTag];
    step.callCount++;
    step.totalCostUsd += call.costUsd;
    step.avgInputTokens = (step.avgInputTokens * (step.callCount - 1) + call.inputTokens) / step.callCount;
    step.avgOutputTokens = (step.avgOutputTokens * (step.callCount - 1) + call.outputTokens) / step.callCount;
  }

  return {
    runId: this.config.runId,
    workflowName: this.config.workflowName,
    durationMs: Math.round(duration),
    totalCostUsd: Math.round(totalCost * 10000) / 10000,
    callCount: this.calls.length,
    p50LatencyMs: Math.round(p50),
    p95LatencyMs: Math.round(p95),
    byStep
  };
}

Architecture Rationale

In-Memory Aggregation: Avoids I/O overhead during execution. Telemetry should never block the agent loop. Serialization happens post-run.
Flat Tagging System: Steps are labeled with simple key-value pairs. This keeps the API surface minimal and avoids the complexity of nested spans. If hierarchical tracing is required, OpenTelemetry remains the appropriate tool.
Pre-Computed Cost: Cost estimation is decoupled from the tracker. This allows swapping pricing engines (cache-aware, volume-discounted, or custom enterprise rates) without modifying the telemetry layer.
Percentile Latency: P50 and P95 are prioritized over mean. Agent workflows exhibit heavy-tailed latency distributions; averages mask tail behavior that directly impacts user experience.

Pitfall Guide

1. The Context Snowball (O(n²) Token Growth)

Explanation: Appending full conversation history to every tool iteration causes input tokens to grow linearly per call. Over n iterations, total tokens scale quadratically. This is the most common cause of unexpected cost spikes in tool-using agents. Fix: Implement a sliding window or explicit history truncation. Only attach relevant context for the current step. Use message pruning strategies that preserve system instructions and recent tool outputs while discarding older intermediate results.

2. Mean Latency Mirage

Explanation: Reporting average latency across a run obscures tail behavior. A single 12-second reasoning call mixed with eight 2-second calls yields a ~3.1s average, masking the actual user-facing lag. Fix: Track P50, P95, and P99 latency by default. Set alerting thresholds on P95, not mean. This accurately reflects the experience of users hitting slow paths.

3. Ambiguous Tool Termination Conditions

Explanation: Models will continue invoking tools if the prompt lacks explicit exit criteria. This creates unbounded loops that consume tokens until context limits or budget caps are hit. Fix: Define strict output schemas with explicit termination flags. Use structured outputs or JSON mode to enforce {"action": "continue" | "complete", "result": "..."}. Validate the response before proceeding.

4. Flat Tagging Limitations in Complex Workflows

Explanation: Simple step tags work for linear agents but fail when workflows branch, parallelize, or retry. Flat tags cannot represent parent-child relationships or conditional paths. Fix: Recognize when to graduate to OpenTelemetry. Use run-level aggregation for cost/latency debugging, but deploy full distributed tracing when you need span hierarchies, context propagation, or compliance auditing.

5. Silent Pricing Drift

Explanation: Hardcoded pricing tables become stale when providers update rates or introduce cache discounts. Tracking costs without cache awareness overestimates spend and masks optimization opportunities. Fix: Integrate cache-aware pricing calculators. Track cache_read_tokens and cache_write_tokens separately. Monitor cache hit ratios alongside cost to identify when prompt caching is actually delivering savings.

6. Over-Engineering Telemetry Payloads

Explanation: Logging full request/response bodies, headers, and raw JSON inflates storage costs and introduces PII leakage risks. It also slows down the agent loop. Fix: Track metadata only: token counts, latency, cost, model, and step tags. If payload inspection is required, route it to a separate wire-capture service. Keep the telemetry layer strictly numeric.

7. Missing Circuit Breakers

Explanation: Running agents against degraded upstream services or misconfigured endpoints can trigger rapid retry loops, accumulating cost before failure is detected. Fix: Implement circuit breakers that short-circuit runs after consecutive failures or latency thresholds. Combine with run-level telemetry to detect degradation patterns early and halt spending automatically.

Production Bundle

Action Checklist

Instrument agent entry points with explicit run initialization and unique run IDs
Tag every LLM call with its semantic step (e.g., search, summarize, validate)
Replace mean latency tracking with P50/P95 metrics across all runs
Implement sliding window context management to prevent O(n²) token growth
Integrate cache-aware pricing calculation before recording call snapshots
Set P95 latency and per-step cost thresholds for automated alerting
Serialize run reports to durable storage (JSON, SQLite, or cloud sink) post-execution
Review run-level breakdowns weekly to identify recurring loop patterns or pricing drift

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Debugging unexpected spend in a single agent run	Run-level telemetry aggregator	Isolates cost/latency by step; reveals quadratic growth	Low overhead; prevents budget blowouts
Production monitoring across distributed services	OpenTelemetry + APM backend	Provides span hierarchies, context propagation, and compliance	Higher infrastructure cost; necessary for scale
Real-time cost alerting and circuit breaking	Telemetry aggregator + custom breaker	Enables step-level thresholds and automatic run termination	Reduces waste by 60-80% during upstream degradation
Compliance auditing and PII tracking	Wire-capture service + telemetry	Separates metadata from payload; meets regulatory requirements	Storage cost increases; mitigates legal risk
Rapid prototyping and local debugging	In-memory aggregator with JSON export	Zero external dependencies; instant feedback loop	Negligible; ideal for development cycles

Configuration Template

// telemetry.config.ts
import { AgentRunTracker, RunConfig, CallSnapshot, computeCost } from './run-tracker';

export const TELEMTRY_CONFIG = {
  runId: `run_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`,
  workflowName: 'cite-check-pipeline',
  tags: { environment: 'production', team: 'research' },
  alertThresholds: {
    p95LatencyMs: 4000,
    maxCostPerStep: 0.50,
    maxCallsPerStep: 5
  }
};

export function createRunTracker(config: RunConfig): AgentRunTracker {
  const tracker = new AgentRunTracker(config);
  
  // Optional: attach validation middleware
  tracker.recordCall = function(snapshot: CallSnapshot) {
    const stepCost = snapshot.costUsd;
    if (stepCost > TELEMTRY_CONFIG.alertThresholds.maxCostPerStep) {
      console.warn(`[TELEMETRY] Step ${snapshot.stepTag} exceeded cost threshold: $${stepCost.toFixed(4)}`);
    }
    return AgentRunTracker.prototype.recordCall.call(this, snapshot);
  };
  
  return tracker;
}

export function instrumentLlmCall(
  tracker: AgentRunTracker,
  model: string,
  inputTokens: number,
  outputTokens: number,
  cacheRead: number,
  cacheWrite: number,
  latencyMs: number,
  step: string
): void {
  const cost = computeCost(model, inputTokens, outputTokens, cacheRead, cacheWrite);
  tracker.recordCall({
    model,
    inputTokens,
    outputTokens,
    cacheReadTokens: cacheRead,
    cacheWriteTokens: cacheWrite,
    latencyMs,
    costUsd: cost,
    stepTag: step
  });
}

Quick Start Guide

Initialize the tracker at the start of your agent workflow using createRunTracker(TELEMTRY_CONFIG). Assign a unique run ID and tag the workflow name.
Wrap every LLM invocation with instrumentLlmCall(). Pass model, token counts, cache metrics, measured latency, and the current step name.
Finalize and export by calling tracker.finalize(). Serialize the resulting RunReport to your preferred sink (console, file, or HTTP endpoint).
Validate thresholds against TELEMTRY_CONFIG.alertThresholds. Trigger alerts or halt execution if P95 latency or per-step cost exceeds limits.
Iterate on context management if step-level token counts show linear growth. Replace full history attachment with a sliding window or step-specific context extraction.

A 3-step agent cost me $4.20. agenttrace showed me the O(n ) tool call hiding in plain sight.