I Turned on Agent Tracing for 30 Days. 4 Hidden Bottlenecks Were Eating 47% of My Tokens.

The Silent Token Tax: Auditing LLM Agent Economics with Distributed Tracing

Current Situation Analysis

The Opacity of LLM Spend

Engineering teams deploying autonomous agents frequently encounter a disconnect between operational behavior and financial reality. The invoice arrives with a total token count, but the agent's internal logs report only high-level summaries: "Task completed successfully." This summary-level logging creates a blind spot where inefficient behaviors—retry storms, redundant context loading, and verbose output patterns—accumulate silently.

Why the Problem is Misunderstood

Most teams rely on aggregated metrics (average tokens per turn, daily cost totals) or agent-generated logs. Aggregates smooth out anomalies, masking behavioral spikes. Agent logs are inherently biased; the model generating the log is the same entity being measured, often optimizing for brevity or perceived success rather than diagnostic accuracy. Consequently, teams treat token consumption as a fixed cost of model capability rather than a variable output of system design.

Data-Backed Evidence

In a production audit of a Claude-based agent performing code reviews, changelog drafting, and channel summarization, the system consumed 5.2 million tokens monthly. Initial log analysis suggested a volume inconsistent with the actual workload (approximately 80 PRs per month). After implementing granular per-call tracing over a 30-day window, four distinct inefficiencies were isolated. These bottlenecks accounted for 47% of the monthly token expenditure while delivering no functional value. Remediation reduced consumption to 2.8 million tokens without altering the agent's output quality or scope.

WOW Moment: Key Findings

The audit revealed that aggregate dashboards are structurally incapable of diagnosing token waste. The following comparison highlights why distributed tracing is required for economic optimization.

Dimension	Aggregate Dashboards	Per-Call Distributed Tracing
Cost Attribution	Shows total spend; cannot link cost to specific behaviors.	Links every token to a specific span, tool call, or turn.
Retry Detection	Masks retry loops; retries appear as normal turns.	Exposes consecutive identical tool calls and error states.
Context Efficiency	Reports average input size; hides duplication.	Identifies identical payloads across sub-agents or turns.
Output Quality	No visibility into output content.	Enables analysis of output text for verbosity or sycophancy.
Root Cause ID	Requires guesswork and manual reproduction.	Provides exact query paths to isolate bottlenecks.

Why This Matters:
Token costs in production agents are rarely driven by model pricing alone. They are driven by system architecture. Per-call tracing transforms token spend from an opaque invoice line item into a queryable dataset, enabling engineers to apply the same optimization rigor to LLM systems as they do to traditional backend services.

Core Solution

Implementation Strategy

The solution requires instrumenting the agent loop with OpenTelemetry (OTel) using GenAI semantic conventions. The goal is to capture a span for every LLM inference call, every tool execution, and every agent turn, with trace IDs propagated across sub-agent boundaries.

1. Instrumentation Architecture

We define a TracingAgent wrapper that manages span lifecycle and attribute injection. This approach decouples tracing logic from business logic, ensuring consistent data capture.

import { trace, SpanStatusCode } from '@opentelemetry/api';
import { GenAIAttributes } from '@opentelemetry/semantic-conventions-ai';

export class TracingAgent {
  private tracer = trace.getTracer('agent-economics');

  async executeTurn(turnContext: TurnContext): Promise<AgentResponse> {
    const turnSpan = this.tracer.startSpan('agent.turn', {
      attributes: {
        [GenAIAttributes.TURN_ID]: turnContext.id,
        [GenAIAttributes.SESSION_ID]: turnContext.sessionId,
      },
    });

    try {
      // 1. Resolve context with caching
      const resolvedContext = await this.resolveContext(turnContext);
      
      // 2. Execute LLM inference
      const llmResult = await this.executeInference(resolvedContext);
      
      // 3. Process tool calls if present
      const toolResults = llmResult.toolCalls 
        ? await this.executeTools(llmResult.toolCalls) 
        : [];

      turnSpan.setAttribute(GenAIAttributes.OUTPUT_TOKENS, llmResult.outputTokens);
      turnSpan.setStatus({ code: SpanStatusCode.OK });
      
      return { response: llmResult.text, toolResults };
    } catch (error) {
      turnSpan.recordException(error);
      turnSpan.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      turnSpan.end();
    }
  }
}

2. Bottleneck Remediation

The audit identified four specific patterns. Below are the technical fixes implemented to address each.

A. Tool-Call Retry Storms
Issue: The agent retried tool calls up to seven times on transient errors (e.g., HTTP 429), resending the full prompt context with each attempt.
Fix: Implement a rate-limit-aware retry policy with exponential backoff and a hard cap on retries per turn.

import { RetryPolicy, ExponentialBackoff } from 'retry-policy-lib';

const retryPolicy = new RetryPolicy({
  maxAttempts: 2,
  backoff: new ExponentialBackoff({
    initialDelay: 1000,
    maxDelay: 5000,
    jitter: true,
  }),
  retryableErrors: [429, 503],
});

async function executeToolWithRetry(toolCall: ToolCall): Promise<ToolResult> {
  return retryPolicy.execute(async () => {
    const span = trace.startActiveSpan('tool.execution', (span) => {
      span.setAttribute(GenAIAttributes.TOOL_NAME, toolCall.name);
      return span;
    });

    try {
      const result = await toolExecutor.run(toolCall);
      span.setAttribute(GenAIAttributes.TOOL_OUTPUT_SIZE, JSON.stringify(result).length);
      return result;
    } finally {
      span.end();
    }
  });
}

B. Context Re-Fetching
Issue: Static configuration files (e.g., CLAUDE.md, OWNERS.md) were re-read and re-sent to the model on every turn, consuming ~56k tokens per session unnecessarily.
Fix: Utilize ephemeral prompt caching for read-only context blocks. Subsequent turns reference the cached block, reducing input costs by ~90% for cached content.

interface ContextBlock {
  type: 'text';
  text: string;
  cache_control?: { type: 'ephemeral' };
}

function buildCachedContext(files: string[]): ContextBlock[] {
  return files.map(file => ({
    type: 'text',
    text: loadFileContent(file),
    cache_control: { type: 'ephemeral' },
  }));
}

// Usage in payload construction
const payload = {
  messages: [
    ...buildCachedContext(['CLAUDE.md', 'OWNERS.md']),
    { role: 'user', content: dynamicUserInput },
  ],
};

C. Sub-Agent Fan-Out Duplication
Issue: Parallel sub-agents (e.g., security, performance, architecture reviewers) each received the full PR diff, duplicating ~3k tokens per sub-agent.
Fix: Pass the diff once to a parent context and allow sub-agents to reference the cached block. This leverages the caching mechanism to share context across parallel executions.

async function fanOutReview(diff: string): Promise<ReviewResult[]> {
  // Parent context establishes the cached diff
  const parentSpan = trace.startSpan('review.fanout');
  
  const subAgents = [
    { role: 'security', prompt: 'Analyze security implications...' },
    { role: 'performance', prompt: 'Evaluate performance impact...' },
    { role: 'architecture', prompt: 'Assess architectural alignment...' },
  ];

  // All sub-agents share the same cached diff block
  const sharedContext = buildCachedContext([diff]);

  const results = await Promise.all(
    subAgents.map(async (agent) => {
      const childSpan = parentSpan.startChild(`sub-agent.${agent.role}`);
      try {
        return await runSubAgent(agent, sharedContext);
      } finally {
        childSpan.end();
      }
    })
  );

  parentSpan.end();
  return results;
}

D. Sycophancy Preamble
Issue: The model frequently generated conversational filler ("You're absolutely right," "Great question") before answering, adding ~120 output tokens per turn across 40% of interactions.
Fix: Inject a strict system instruction to suppress preamble and require direct responses.

const SYSTEM_INSTRUCTIONS = `
  You are an automated code review assistant.
  RULES:
  - Do not restate the user's request.
  - Do not use conversational fillers or agreement phrases.
  - Begin responses immediately with the analysis or answer.
  - Be concise and technical.
`;

Architecture Decisions

OpenTelemetry with GenAI Conventions: Ensures vendor neutrality and compatibility with multiple backends (Logfire, Helicone, Langfuse). The GenAI attributes provide standardized keys for token counts and model identifiers.
Per-Call Spans: Aggregated metrics cannot diagnose behavioral inefficiencies. Per-call spans enable queries such as "identify turns with >2 identical tool calls" or "measure input similarity across sub-agents."
Ephemeral Caching: Anthropic's prompt caching reduces costs for repeated context. This is critical for static configuration and shared sub-agent payloads.
Trace Propagation: Trace IDs must be threaded through sub-agent spawns to group related spans. This allows analysis of fan-out duplication and parent-child cost relationships.

Pitfall Guide

1. The Summary Log Trap

Explanation: Relying on logs generated by the agent itself. The model optimizes logs for brevity and success, often omitting retries, errors, or verbose output.
Fix: Use external instrumentation. Spans are generated by the infrastructure, not the model, providing ground truth.

2. Blind Retry Policies

Explanation: Default retry mechanisms may retry indefinitely or excessively on rate limits, burning tokens on every attempt.
Fix: Implement rate-limit detection with exponential backoff and hard caps on retries per turn. Monitor for consecutive identical tool calls.

3. Context Amnesia

Explanation: Re-fetching static files on every turn. This multiplies context costs by the number of turns without adding value.
Fix: Use ephemeral caching for read-only context. Load configuration once per session and reference the cache in subsequent turns.

4. Fan-Out Redundancy

Explanation: Sending full context to parallel sub-agents. This duplicates input tokens linearly with the number of workers.
Fix: Share context via cached blocks. Ensure sub-agents reference the same parent context to leverage cache hits.

5. Conversational Fluff

Explanation: Model pleasantries and preamble add output tokens that contribute no information.
Fix: Enforce strict system instructions. Monitor output text for common filler patterns and adjust prompts accordingly.

6. Aggregate Blindness

Explanation: Using dashboards that show averages. Averages hide spikes and repetitive behaviors.
Fix: Query per-call spans. Use saved queries to detect patterns like retry loops, duplication, and verbosity.

7. Ignoring Cache Hit Rates

Explanation: Enabling caching without monitoring hit rates. Caching only reduces costs if the cache is actually used.
Fix: Track cached_tokens vs input_tokens. Optimize context structure to maximize cache reuse.

Production Bundle

Action Checklist

Instrument OTel: Add OpenTelemetry spans for LLM calls, tool executions, and agent turns using GenAI conventions.
Enable Per-Call Tracing: Ensure every inference call generates a span with input/output token counts and model attributes.
Implement Retry Caps: Configure retry policies with exponential backoff and a maximum of 2 retries per tool per turn.
Activate Ephemeral Caching: Apply cache_control: ephemeral to static context blocks and shared payloads.
Suppress Preamble: Update system instructions to forbid conversational filler and require direct responses.
Set Budget Alerts: Configure alerts at 80% of the previous month's token consumption to detect anomalies early.
Schedule Weekly Audits: Run saved trace queries weekly to identify new bottlenecks or regressions.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High Turn Count	Ephemeral Caching	Reduces cost of repeated context across turns.	~90% reduction on cached input.
Parallel Sub-Agents	Shared Context Blocks	Eliminates duplication of payloads across workers.	Linear savings per sub-agent.
Rate-Limited Tools	Backoff + Retry Cap	Prevents token burn on transient errors.	Eliminates retry storm costs.
Verbose Output	Strict System Prompts	Removes non-informative output tokens.	~10-15% reduction in output tokens.
Debugging Spikes	Per-Call Trace Queries	Isolates specific behaviors causing cost increases.	Enables targeted fixes.

Configuration Template

# otel-agent-config.yaml
service:
  name: llm-agent-economics

tracer:
  provider: opentelemetry
  conventions: genai-2026-03

spans:
  - name: llm.inference
    attributes:
      - genai.system
      - genai.request.model
      - genai.usage.input_tokens
      - genai.usage.output_tokens
      - genai.usage.cache_creation_input_tokens
      - genai.usage.cache_read_input_tokens
  - name: tool.execution
    attributes:
      - genai.tool.name
      - genai.tool.input_size
      - genai.tool.output_size
      - error.type
  - name: agent.turn
    attributes:
      - genai.session.id
      - genai.turn.id
      - genai.turn.duration_ms

exporters:
  - type: otlp
    endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT}
    headers:
      Authorization: Bearer ${OTEL_API_KEY}

Quick Start Guide

Install Dependencies: Add @opentelemetry/api, @opentelemetry/sdk-trace-base, and the GenAI semantic conventions package to your project.
Initialize Tracer: Configure the OTel tracer provider with your chosen exporter (e.g., Logfire, Helicone).
Wrap LLM Calls: Modify your inference function to start a span for each call, capturing token usage and model attributes.
Add Caching: Update context resolution to apply cache_control: ephemeral to static files and shared blocks.
Deploy and Query: Run the agent for 24 hours, then query traces for retry loops, duplication, and verbosity using the patterns identified in the Core Solution.

Mid-Year Sale — Unlock Full Article