A 3-step agent cost me $4.20. agenttrace showed me the O(n ) tool call hiding in plain sight.
Run-Level Telemetry for LLM Agents: Diagnosing Hidden Cost and Latency Spikes
Current Situation Analysis
The fundamental disconnect in modern LLM agent development lies in how providers bill versus how agents execute. Cloud AI platforms invoice per API request (messages.create, chat.completions), treating each round-trip as an isolated transaction. Agents, however, operate as stateful workflows: they invoke tools, process results, retry on failure, and accumulate conversation history across multiple semantic steps. This architectural mismatch creates a critical observability gap.
Engineers typically monitor individual request latency or aggregate total spend across a billing cycle. Neither approach reveals the execution shape of a single agent run. When a workflow budgeted at $0.12 suddenly consumes $4.20, per-call line items appear normal. Each individual request might use standard token counts and return within acceptable latency bounds. The anomaly only becomes visible when calls are grouped by their originating run and broken down by logical step.
This problem is systematically overlooked because traditional APM tools are built around HTTP spans or database queries, not semantic agent phases. They lack native concepts of "tool loops," "context window accumulation," or "step-level budgeting." Consequently, developers miss quadratic token growth patterns hidden inside unbounded tool invocations. Real-world incident data consistently shows that ambiguous tool prompts combined with naive history management can inflate input tokens by 10–30x per iteration, directly correlating to cost spikes that only surface after the billing cycle closes.
The solution requires shifting from request-level monitoring to run-level telemetry: lightweight, in-memory aggregation that tracks cost, latency percentiles, and token distribution across explicitly tagged workflow phases.
WOW Moment: Key Findings
The diagnostic power of run-level aggregation becomes immediately apparent when comparing blind per-call tracking against step-aware telemetry. The following table contrasts the visibility each approach provides for the same agent execution:
| Approach | Total Cost | Call Count | P95 Latency | Context Growth Pattern | Diagnostic Clarity |
|---|---|---|---|---|---|
| Per-Call Billing View | $4.20 | 11 | ~4.1s | Hidden | Low (line items appear normal) |
| Run-Level Aggregation | $4.20 | 11 | 4.92s | O(n²) detected | High (step breakdown isolates loop) |
| Optimized Run (Fixed) | $0.14 | 5 | 3.05s | Bounded | High (30x cost reduction verified) |
This finding matters because it transforms cost debugging from retrospective guesswork into deterministic root-cause analysis. By grouping calls into runs and tagging them by semantic step, engineers can immediately identify which phase consumed disproportionate resources. The P95 latency metric reveals tail behavior that averages obscure, while step-level token tracking exposes quadratic growth before it impacts production budgets. This visibility enables proactive loop detection, dynamic context management, and precise step-level budgeting.
Core Solution
Building a run-level telemetry system requires three architectural decisions: explicit run lifecycle management, flat tagging for step attribution, and lightweight in-memory aggregation. The implementation avoids async lock-in, payload logging, and nested span complexity, focusing strictly on cost, latency, and token distribution.
Step 1: Define the Run Lifecycle
A run represents a single agent execution from initiation to completion. It must support explicit start, tagging, call recording, and finalization.
export interface RunConfig {
runId: string;
workflowName: string;
tags: Record<string, string>;
}
export class AgentRunTracker {
private calls: CallSnapshot[] = [];
private startTime: number;
private config: RunConfig;
constructor(config: RunConfig) {
this.config = config;
this.startTime = performance.now();
}
recordCall(snapshot: CallSnapshot): void {
this.calls.push(snapshot);
}
finalize(): RunReport {
const duration = performance.now() - this.startTime;
return this.aggregate(duration);
}
}
Step 2: Implement Call Recording with Cost Extraction
Each LLM interaction should be captured as a snapshot containing token counts, latency, and pre-computed cost. Cost calculation must be cache-aware to reflect actual provider pricing.
export interface CallSnapshot {
model: string;
inputTokens: number;
outputTokens: number;
cacheReadTokens: number;
cacheWriteTokens: number;
latencyMs: number;
costUsd: number;
stepTag: string;
}
export function computeCost(
model: string,
input: number,
output: number,
cacheRead: number,
cacheWrite: number
): number {
const PRICING = {
"claude-opus-4-7": { input: 0.000015, output: 0.000075, cacheRead: 0.00001, cacheWrite: 0.00001875 },
"claude-haiku-4": { input: 0.00000025, output: 0.00000125, cacheRead: 0.00000003, cacheWrite: 0.0000003 }
};
const rates = PRICING[model as keyof typeof PRICING] || PRICING["claude-haiku-4"];
return (
input * rates.input +
output * rates.output +
cacheRead * rates.cacheRead +
cacheWrite * rates.cacheWrite
);
}
Step 3: Aggregate Metrics and Generate Reports
Aggregation computes percentiles, groups by step, and formats the output for downstream analysis. P50 and P95 latency are calculated using standard percentile algorithms.
export interface RunReport {
runId: string;
workflowName: string;
durationMs: number;
totalCostUsd: number;
callCount: number;
p50LatencyMs: number;
p95LatencyMs: number;
byStep: Record<string, StepMetrics>;
}
export interface StepMetrics {
callCount: number;
totalCostUsd: number;
avgInputTokens: number;
avgOutputTokens: number;
}
private aggregate(duration: number): RunReport {
const latencies = this.calls.map(c => c.latencyMs).sort((a, b) => a - b);
const p50 = latencies[Math.floor(latencies.length * 0.5)] || 0;
const p95 = latencies[Math.floor(latencies.length * 0.95)] || 0;
const byStep: Record<string, StepMetrics> = {};
let totalCost = 0;
for (const call of this.calls) {
totalCost += call.costUsd;
if (!byStep[call.stepTag]) {
byStep[call.stepTag] = { callCount: 0, totalCostUsd: 0, avgInputTokens: 0, avgOutputTokens: 0 };
}
const step = byStep[call.stepTag];
step.callCount++;
step.totalCostUsd += call.costUsd;
step.avgInputTokens = (step.avgInputTokens * (step.callCount - 1) + call.inputTokens) / step.callCount;
step.avgOutputTokens = (step.avgOutputTokens * (step.callCount - 1) + call.outputTokens) / step.callCount;
}
return {
runId: this.config.runId,
workflowName: this.config.workflowName,
durationMs: Math.round(duration),
totalCostUsd: Math.round(totalCost * 10000) / 10000,
callCount: this.calls.length,
p50LatencyMs: Math.round(p50),
p95LatencyMs: Math.round(p95),
byStep
};
}
Architecture Rationale
- In-Memory Aggregation: Avoids I/O overhead during execution. Telemetry should never block the agent loop. Serialization happens post-run.
- Flat Tagging System: Steps are labeled with simple key-value pairs. This keeps the API surface minimal and avoids the complexity of nested spans. If hierarchical tracing is required, OpenTelemetry remains the appropriate tool.
- Pre-Computed Cost: Cost estimation is decoupled from the tracker. This allows swapping pricing engines (cache-aware, volume-discounted, or custom enterprise rates) without modifying the telemetry layer.
- Percentile Latency: P50 and P95 are prioritized over mean. Agent workflows exhibit heavy-tailed latency distributions; averages mask tail behavior that directly impacts user experience.
Pitfall Guide
1. The Context Snowball (O(n²) Token Growth)
Explanation: Appending full conversation history to every tool iteration causes input tokens to grow linearly per call. Over n iterations, total tokens scale quadratically. This is the most common cause of unexpected cost spikes in tool-using agents. Fix: Implement a sliding window or explicit history truncation. Only attach relevant context for the current step. Use message pruning strategies that preserve system instructions and recent tool outputs while discarding older intermediate results.
2. Mean Latency Mirage
Explanation: Reporting average latency across a run obscures tail behavior. A single 12-second reasoning call mixed with eight 2-second calls yields a ~3.1s average, masking the actual user-facing lag. Fix: Track P50, P95, and P99 latency by default. Set alerting thresholds on P95, not mean. This accurately reflects the experience of users hitting slow paths.
3. Ambiguous Tool Termination Conditions
Explanation: Models will continue invoking tools if the prompt lacks explicit exit criteria. This creates unbounded loops that consume tokens until context limits or budget caps are hit.
Fix: Define strict output schemas with explicit termination flags. Use structured outputs or JSON mode to enforce {"action": "continue" | "complete", "result": "..."}. Validate the response before proceeding.
4. Flat Tagging Limitations in Complex Workflows
Explanation: Simple step tags work for linear agents but fail when workflows branch, parallelize, or retry. Flat tags cannot represent parent-child relationships or conditional paths. Fix: Recognize when to graduate to OpenTelemetry. Use run-level aggregation for cost/latency debugging, but deploy full distributed tracing when you need span hierarchies, context propagation, or compliance auditing.
5. Silent Pricing Drift
Explanation: Hardcoded pricing tables become stale when providers update rates or introduce cache discounts. Tracking costs without cache awareness overestimates spend and masks optimization opportunities.
Fix: Integrate cache-aware pricing calculators. Track cache_read_tokens and cache_write_tokens separately. Monitor cache hit ratios alongside cost to identify when prompt caching is actually delivering savings.
6. Over-Engineering Telemetry Payloads
Explanation: Logging full request/response bodies, headers, and raw JSON inflates storage costs and introduces PII leakage risks. It also slows down the agent loop. Fix: Track metadata only: token counts, latency, cost, model, and step tags. If payload inspection is required, route it to a separate wire-capture service. Keep the telemetry layer strictly numeric.
7. Missing Circuit Breakers
Explanation: Running agents against degraded upstream services or misconfigured endpoints can trigger rapid retry loops, accumulating cost before failure is detected. Fix: Implement circuit breakers that short-circuit runs after consecutive failures or latency thresholds. Combine with run-level telemetry to detect degradation patterns early and halt spending automatically.
Production Bundle
Action Checklist
- Instrument agent entry points with explicit run initialization and unique run IDs
- Tag every LLM call with its semantic step (e.g.,
search,summarize,validate) - Replace mean latency tracking with P50/P95 metrics across all runs
- Implement sliding window context management to prevent O(n²) token growth
- Integrate cache-aware pricing calculation before recording call snapshots
- Set P95 latency and per-step cost thresholds for automated alerting
- Serialize run reports to durable storage (JSON, SQLite, or cloud sink) post-execution
- Review run-level breakdowns weekly to identify recurring loop patterns or pricing drift
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Debugging unexpected spend in a single agent run | Run-level telemetry aggregator | Isolates cost/latency by step; reveals quadratic growth | Low overhead; prevents budget blowouts |
| Production monitoring across distributed services | OpenTelemetry + APM backend | Provides span hierarchies, context propagation, and compliance | Higher infrastructure cost; necessary for scale |
| Real-time cost alerting and circuit breaking | Telemetry aggregator + custom breaker | Enables step-level thresholds and automatic run termination | Reduces waste by 60-80% during upstream degradation |
| Compliance auditing and PII tracking | Wire-capture service + telemetry | Separates metadata from payload; meets regulatory requirements | Storage cost increases; mitigates legal risk |
| Rapid prototyping and local debugging | In-memory aggregator with JSON export | Zero external dependencies; instant feedback loop | Negligible; ideal for development cycles |
Configuration Template
// telemetry.config.ts
import { AgentRunTracker, RunConfig, CallSnapshot, computeCost } from './run-tracker';
export const TELEMTRY_CONFIG = {
runId: `run_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`,
workflowName: 'cite-check-pipeline',
tags: { environment: 'production', team: 'research' },
alertThresholds: {
p95LatencyMs: 4000,
maxCostPerStep: 0.50,
maxCallsPerStep: 5
}
};
export function createRunTracker(config: RunConfig): AgentRunTracker {
const tracker = new AgentRunTracker(config);
// Optional: attach validation middleware
tracker.recordCall = function(snapshot: CallSnapshot) {
const stepCost = snapshot.costUsd;
if (stepCost > TELEMTRY_CONFIG.alertThresholds.maxCostPerStep) {
console.warn(`[TELEMETRY] Step ${snapshot.stepTag} exceeded cost threshold: $${stepCost.toFixed(4)}`);
}
return AgentRunTracker.prototype.recordCall.call(this, snapshot);
};
return tracker;
}
export function instrumentLlmCall(
tracker: AgentRunTracker,
model: string,
inputTokens: number,
outputTokens: number,
cacheRead: number,
cacheWrite: number,
latencyMs: number,
step: string
): void {
const cost = computeCost(model, inputTokens, outputTokens, cacheRead, cacheWrite);
tracker.recordCall({
model,
inputTokens,
outputTokens,
cacheReadTokens: cacheRead,
cacheWriteTokens: cacheWrite,
latencyMs,
costUsd: cost,
stepTag: step
});
}
Quick Start Guide
- Initialize the tracker at the start of your agent workflow using
createRunTracker(TELEMTRY_CONFIG). Assign a unique run ID and tag the workflow name. - Wrap every LLM invocation with
instrumentLlmCall(). Pass model, token counts, cache metrics, measured latency, and the current step name. - Finalize and export by calling
tracker.finalize(). Serialize the resultingRunReportto your preferred sink (console, file, or HTTP endpoint). - Validate thresholds against
TELEMTRY_CONFIG.alertThresholds. Trigger alerts or halt execution if P95 latency or per-step cost exceeds limits. - Iterate on context management if step-level token counts show linear growth. Replace full history attachment with a sliding window or step-specific context extraction.
Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register — Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
