I Turned on Agent Tracing for 30 Days. 4 Hidden Bottlenecks Were Eating 47% of My Tokens.
The Silent Token Tax: Auditing LLM Agent Economics with Distributed Tracing
Current Situation Analysis
The Opacity of LLM Spend
Engineering teams deploying autonomous agents frequently encounter a disconnect between operational behavior and financial reality. The invoice arrives with a total token count, but the agent's internal logs report only high-level summaries: "Task completed successfully." This summary-level logging creates a blind spot where inefficient behaviorsâretry storms, redundant context loading, and verbose output patternsâaccumulate silently.
Why the Problem is Misunderstood
Most teams rely on aggregated metrics (average tokens per turn, daily cost totals) or agent-generated logs. Aggregates smooth out anomalies, masking behavioral spikes. Agent logs are inherently biased; the model generating the log is the same entity being measured, often optimizing for brevity or perceived success rather than diagnostic accuracy. Consequently, teams treat token consumption as a fixed cost of model capability rather than a variable output of system design.
Data-Backed Evidence
In a production audit of a Claude-based agent performing code reviews, changelog drafting, and channel summarization, the system consumed 5.2 million tokens monthly. Initial log analysis suggested a volume inconsistent with the actual workload (approximately 80 PRs per month). After implementing granular per-call tracing over a 30-day window, four distinct inefficiencies were isolated. These bottlenecks accounted for 47% of the monthly token expenditure while delivering no functional value. Remediation reduced consumption to 2.8 million tokens without altering the agent's output quality or scope.
WOW Moment: Key Findings
The audit revealed that aggregate dashboards are structurally incapable of diagnosing token waste. The following comparison highlights why distributed tracing is required for economic optimization.
| Dimension | Aggregate Dashboards | Per-Call Distributed Tracing |
|---|---|---|
| Cost Attribution | Shows total spend; cannot link cost to specific behaviors. | Links every token to a specific span, tool call, or turn. |
| Retry Detection | Masks retry loops; retries appear as normal turns. | Exposes consecutive identical tool calls and error states. |
| Context Efficiency | Reports average input size; hides duplication. | Identifies identical payloads across sub-agents or turns. |
| Output Quality | No visibility into output content. | Enables analysis of output text for verbosity or sycophancy. |
| Root Cause ID | Requires guesswork and manual reproduction. | Provides exact query paths to isolate bottlenecks. |
Why This Matters:
Token costs in production agents are rarely driven by model pricing alone. They are driven by system architecture. Per-call tracing transforms token spend from an opaque invoice line item into a queryable dataset, enabling engineers to apply the same optimization rigor to LLM systems as they do to traditional backend services.
Core Solution
Implementation Strategy
The solution requires instrumenting the agent loop with OpenTelemetry (OTel) using GenAI semantic conventions. The goal is to capture a span for every LLM inference call, every tool execution, and every agent turn, with trace IDs propagated across sub-agent boundaries.
1. Instrumentation Architecture
We define a TracingAgent wrapper that manages span lifecycle and attribute injection. This approach decouples tracing logic from business logic, ensuring consistent data capture.
import { trace, SpanStatusCode } from '@opentelemetry/api';
import { GenAIAttributes } from '@opentelemetry/semantic-conventions-ai';
export class TracingAgent {
private tracer = trace.getTracer('agent-economics');
async executeTurn(turnContext: TurnContext): Promise<AgentResponse> {
const turnSpan = this.tracer.startSpan('agent.turn', {
attributes: {
[GenAIAttributes.TURN_ID]: turnContext.id,
[GenAIAttributes.SESSION_ID]: turnContext.sessionId,
},
});
try {
// 1. Resolve context with caching
const resolvedContext = await this.resolveContext(turnContext);
// 2. Execute LLM inference
const llmResult = await this.executeInference(resolvedContext);
// 3. Process tool calls if present
const toolResults = llmResult.toolCalls
? await this.executeTools(llmResult.toolCalls)
: [];
turnSpan.setAttribute(GenAIAttributes.OUTPUT_TOKENS, llmResult.outputTokens);
turnSpan.setStatus({ code: SpanStatusCode.OK });
return { response: llmResult.text, toolResults };
} catch (error) {
turnSpan.recordException(error);
turnSpan.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
turnSpan.end();
}
}
}
2. Bottleneck Remediation
The audit identified four specific patterns. Below are the technical fixes implemented to address each.
A. Tool-Call Retry Storms
Issue: The agent retried tool calls up to seven times on transient errors (e.g., HTTP 429), resending the full prompt context with each attempt.
Fix: Implement a rate-limit-aware retry policy with exponential backoff and a hard cap on retries per turn.
import { RetryPolicy, ExponentialBackoff } from 'retry-policy-lib';
const retryPolicy = new RetryPolicy({
maxAttempts: 2,
backoff: new ExponentialBackoff({
initialDelay: 1000,
maxDelay: 5000,
jitter: true,
}),
retryableErrors: [429, 503],
});
async function executeToolWithRetry(toolCall: ToolCall): Promise<ToolResult> {
return retryPolicy.execute(async () => {
const span = trace.startActiveSpan('tool.execution', (span) => {
span.setAttribute(GenAIAttributes.TOOL_NAME, toolCall.name);
return span;
});
try {
const result = await toolExecutor.run(toolCall);
span.setAttribute(GenAIAttributes.TOOL_OUTPUT_SIZE, JSON.stringify(result).length);
return result;
} finally {
span.end();
}
});
}
B. Context Re-Fetching
Issue: Static configuration files (e.g., CLAUDE.md, OWNERS.md) were re-read and re-sent to the model on every turn, consuming ~56k tokens per session unnecessarily.
Fix: Utilize ephemeral prompt caching for read-only context blocks. Subsequent turns reference the cached block, reducing input costs by ~90% for cached content.
interface ContextBlock {
type: 'text';
text: string;
cache_control?: { type: 'ephemeral' };
}
function buildCachedContext(files: string[]): ContextBlock[] {
return files.map(file => ({
type: 'text',
text: loadFileContent(file),
cache_control: { type: 'ephemeral' },
}));
}
// Usage in payload construction
const payload = {
messages: [
...buildCachedContext(['CLAUDE.md', 'OWNERS.md']),
{ role: 'user', content: dynamicUserInput },
],
};
C. Sub-Agent Fan-Out Duplication
Issue: Parallel sub-agents (e.g., security, performance, architecture reviewers) each received the full PR diff, duplicating ~3k tokens per sub-agent.
Fix: Pass the diff once to a parent context and allow sub-agents to reference the cached block. This leverages the caching mechanism to share context across parallel executions.
async function fanOutReview(diff: string): Promise<ReviewResult[]> {
// Parent context establishes the cached diff
const parentSpan = trace.startSpan('review.fanout');
const subAgents = [
{ role: 'security', prompt: 'Analyze security implications...' },
{ role: 'performance', prompt: 'Evaluate performance impact...' },
{ role: 'architecture', prompt: 'Assess architectural alignment...' },
];
// All sub-agents share the same cached diff block
const sharedContext = buildCachedContext([diff]);
const results = await Promise.all(
subAgents.map(async (agent) => {
const childSpan = parentSpan.startChild(`sub-agent.${agent.role}`);
try {
return await runSubAgent(agent, sharedContext);
} finally {
childSpan.end();
}
})
);
parentSpan.end();
return results;
}
D. Sycophancy Preamble
Issue: The model frequently generated conversational filler ("You're absolutely right," "Great question") before answering, adding ~120 output tokens per turn across 40% of interactions.
Fix: Inject a strict system instruction to suppress preamble and require direct responses.
const SYSTEM_INSTRUCTIONS = `
You are an automated code review assistant.
RULES:
- Do not restate the user's request.
- Do not use conversational fillers or agreement phrases.
- Begin responses immediately with the analysis or answer.
- Be concise and technical.
`;
Architecture Decisions
- OpenTelemetry with GenAI Conventions: Ensures vendor neutrality and compatibility with multiple backends (Logfire, Helicone, Langfuse). The GenAI attributes provide standardized keys for token counts and model identifiers.
- Per-Call Spans: Aggregated metrics cannot diagnose behavioral inefficiencies. Per-call spans enable queries such as "identify turns with >2 identical tool calls" or "measure input similarity across sub-agents."
- Ephemeral Caching: Anthropic's prompt caching reduces costs for repeated context. This is critical for static configuration and shared sub-agent payloads.
- Trace Propagation: Trace IDs must be threaded through sub-agent spawns to group related spans. This allows analysis of fan-out duplication and parent-child cost relationships.
Pitfall Guide
1. The Summary Log Trap
Explanation: Relying on logs generated by the agent itself. The model optimizes logs for brevity and success, often omitting retries, errors, or verbose output.
Fix: Use external instrumentation. Spans are generated by the infrastructure, not the model, providing ground truth.
2. Blind Retry Policies
Explanation: Default retry mechanisms may retry indefinitely or excessively on rate limits, burning tokens on every attempt.
Fix: Implement rate-limit detection with exponential backoff and hard caps on retries per turn. Monitor for consecutive identical tool calls.
3. Context Amnesia
Explanation: Re-fetching static files on every turn. This multiplies context costs by the number of turns without adding value.
Fix: Use ephemeral caching for read-only context. Load configuration once per session and reference the cache in subsequent turns.
4. Fan-Out Redundancy
Explanation: Sending full context to parallel sub-agents. This duplicates input tokens linearly with the number of workers.
Fix: Share context via cached blocks. Ensure sub-agents reference the same parent context to leverage cache hits.
5. Conversational Fluff
Explanation: Model pleasantries and preamble add output tokens that contribute no information.
Fix: Enforce strict system instructions. Monitor output text for common filler patterns and adjust prompts accordingly.
6. Aggregate Blindness
Explanation: Using dashboards that show averages. Averages hide spikes and repetitive behaviors.
Fix: Query per-call spans. Use saved queries to detect patterns like retry loops, duplication, and verbosity.
7. Ignoring Cache Hit Rates
Explanation: Enabling caching without monitoring hit rates. Caching only reduces costs if the cache is actually used.
Fix: Track cached_tokens vs input_tokens. Optimize context structure to maximize cache reuse.
Production Bundle
Action Checklist
- Instrument OTel: Add OpenTelemetry spans for LLM calls, tool executions, and agent turns using GenAI conventions.
- Enable Per-Call Tracing: Ensure every inference call generates a span with input/output token counts and model attributes.
- Implement Retry Caps: Configure retry policies with exponential backoff and a maximum of 2 retries per tool per turn.
- Activate Ephemeral Caching: Apply
cache_control: ephemeralto static context blocks and shared payloads. - Suppress Preamble: Update system instructions to forbid conversational filler and require direct responses.
- Set Budget Alerts: Configure alerts at 80% of the previous month's token consumption to detect anomalies early.
- Schedule Weekly Audits: Run saved trace queries weekly to identify new bottlenecks or regressions.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| High Turn Count | Ephemeral Caching | Reduces cost of repeated context across turns. | ~90% reduction on cached input. |
| Parallel Sub-Agents | Shared Context Blocks | Eliminates duplication of payloads across workers. | Linear savings per sub-agent. |
| Rate-Limited Tools | Backoff + Retry Cap | Prevents token burn on transient errors. | Eliminates retry storm costs. |
| Verbose Output | Strict System Prompts | Removes non-informative output tokens. | ~10-15% reduction in output tokens. |
| Debugging Spikes | Per-Call Trace Queries | Isolates specific behaviors causing cost increases. | Enables targeted fixes. |
Configuration Template
# otel-agent-config.yaml
service:
name: llm-agent-economics
tracer:
provider: opentelemetry
conventions: genai-2026-03
spans:
- name: llm.inference
attributes:
- genai.system
- genai.request.model
- genai.usage.input_tokens
- genai.usage.output_tokens
- genai.usage.cache_creation_input_tokens
- genai.usage.cache_read_input_tokens
- name: tool.execution
attributes:
- genai.tool.name
- genai.tool.input_size
- genai.tool.output_size
- error.type
- name: agent.turn
attributes:
- genai.session.id
- genai.turn.id
- genai.turn.duration_ms
exporters:
- type: otlp
endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT}
headers:
Authorization: Bearer ${OTEL_API_KEY}
Quick Start Guide
- Install Dependencies: Add
@opentelemetry/api,@opentelemetry/sdk-trace-base, and the GenAI semantic conventions package to your project. - Initialize Tracer: Configure the OTel tracer provider with your chosen exporter (e.g., Logfire, Helicone).
- Wrap LLM Calls: Modify your inference function to start a span for each call, capturing token usage and model attributes.
- Add Caching: Update context resolution to apply
cache_control: ephemeralto static files and shared blocks. - Deploy and Query: Run the agent for 24 hours, then query traces for retry loops, duplication, and verbosity using the patterns identified in the Core Solution.
Mid-Year Sale â Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register â Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
