Beyond the Invoice: Engineering a Cross-Session Audit Pipeline for LLM Coding Agents

Current Situation Analysis

The adoption of agentic coding assistants has introduced a new class of operational expense: behavioral token consumption. Unlike traditional cloud infrastructure where costs map directly to compute hours or storage gigabytes, LLM agent billing reflects conversational loops, tool invocations, and context window accumulation. The industry pain point is not the raw price per token; it is the opacity of cost attribution. Most developers interact with a single aggregate dashboard that shows a monthly total but provides zero visibility into the behavioral patterns driving that number.

This problem is systematically overlooked because monitoring tools are architected around single-session boundaries. Real-time endpoints like Anthropic's /usage API answer what a current conversation costs, but they cannot detect cross-session repetition, model misallocation drift, or compounding context bloat. When an agent runs across dozens of sessions over a month, small inefficiencies compound multiplicatively. A 14k-token context window that grows by 2k tokens per turn doesn't just add linear cost; it forces every subsequent turn to re-process the entire history. The billing model rewards brevity and session hygiene, yet most development workflows treat agent sessions like persistent terminal windows.

Data from a 31-day audit of a high-usage coding agent reveals eight distinct cost-drain patterns. Retry storms and model misallocation alone account for 41.5% of total spend. Context accumulation and redundant file I/O contribute another 25.8%. The remaining 32.7% stems from tool duplication, routing fallbacks, unattended execution, and gradual baseline drift. Without a cross-session analytical layer, these patterns remain invisible until the invoice arrives. The solution requires shifting from reactive token counting to prescriptive behavioral auditing.

WOW Moment: Key Findings

The following table isolates the primary cost drivers, their proportional impact, and the technical leverage available for mitigation. The data demonstrates that cost optimization is not about reducing token volume; it is about interrupting specific behavioral loops.

Cost Driver	Share of Total Spend	Primary Trigger	Mitigation ROI
Retry Storms	21.6%	Consecutive tool errors without user intervention	High: Breaks exponential context growth
Model Misallocation	19.9%	Sticky model selection across trivial tasks	High: 20-30x cost reduction per turn
Context Bloat	14.7%	Sessions exceeding 50k-60k token thresholds	Medium: Requires session recycling discipline
Repeated File Reads	11.1%	Agent re-ingesting unchanged source files	Medium: Eliminates redundant input tokens
Tool Duplication	9.4%	Overlapping search queries and redundant calls	High: Reduces planning + consumption overhead
Routing Fallbacks	7.7%	Platform throttling silently upgrading model tier	Medium: Requires log verification
Unattended Execution	5.9%	Sessions running outside human review windows	Low-Medium: Visibility-driven throttling
Baseline Drift	4.5%	Gradual increase in average session cost	Medium: Trend-based early warning

This finding matters because it reframes cost management from a budgeting problem to an engineering problem. Each pattern maps to a specific architectural or workflow intervention. Retry storms require error-handling gates. Model misallocation requires explicit routing policies. Context bloat requires session lifecycle management. Once these patterns are isolated, mitigation becomes deterministic rather than speculative.

Core Solution

Building a cross-session audit pipeline requires three components: a streaming log parser, a pattern detection engine, and a cost attribution layer. The pipeline ingests JSONL session logs, applies statistical baselines, flags anomalous behaviors, and outputs a structured report. The architecture prioritizes memory efficiency, deterministic pattern matching, and extensibility for new detection rules.

Step 1: Stream-Based Log Ingestion

JSONL files are append-only and grow continuously. Loading an entire month of logs into memory is inefficient and fragile. A streaming approach processes turns incrementally, maintaining only the current session state in memory.

import { createReadStream } from 'fs';
import { createInterface } from 'readline';

interface TurnRecord {
  role: 'assistant' | 'user' | 'tool_result';
  model?: string;
  usage?: { input_tokens: number; output_tokens: number };
  tool_use?: Array<{ name: string; input: Record<string, unknown> }>;
  is_error?: boolean;
  timestamp?: string;
}

interface SessionBucket {
  id: string;
  turns: TurnRecord[];
  totalInputTokens: number;
  totalOutputTokens: number;
  startTime: string;
  endTime: string;
}

async function streamSessionLogs(
  filePath: string,
  onSessionComplete: (session: SessionBucket) => void
): Promise<void> {
  const rl = createInterface({ input: createReadStream(filePath) });
  let currentSession: SessionBucket | null = null;

  for await (const line of rl) {
    const turn: TurnRecord = JSON.parse(line);
    
    if (!currentSession) {
      currentSession = {
        id: crypto.randomUUID(),
        turns: [],
        totalInputTokens: 0,
        totalOutputTokens: 0,
        startTime: turn.timestamp ?? new Date().toISOString(),
        endTime: turn.timestamp ?? new Date().toISOString()
      };
    }

    currentSession.turns.push(turn);
    currentSession.totalInputTokens += turn.usage?.input_tokens ?? 0;
    currentSession.totalOutputTokens += turn.usage?.output_tokens ?? 0;
    currentSession.endTime = turn.timestamp ?? new Date().toISOString();

    // Session boundary heuristic: user message resets context
    if (turn.role === 'user' && currentSession.turns.length > 1) {
      onSessionComplete(currentSession);
      currentSession = null;
    }
  }

  if (currentSession) onSessionComplete(currentSession);
}

Architecture Rationale: The streaming parser decouples I/O from analysis. Session boundaries are determined by user intervention points, which align with how agents reset context. Token accumulation is tracked incrementally to avoid post-hoc summation overhead.

Step 2: Pattern Detection Engine

Each cost driver requires a deterministic detector. Statistical baselines prevent false positives across different workload profiles.

type DetectorResult = { pattern: string; severity: 'high' | 'medium' | 'low'; evidence: unknown };

function detectRetryStorm(session: SessionBucket): DetectorResult | null {
  let consecutiveFailures = 0;
  for (const turn of session.turns) {
    if (turn.role === 'tool_result' && turn.is_error) {
      consecutiveFailures++;
      if (consecutiveFailures >= 3) {
        return {
          pattern: 'retry_storm',
          severity: 'high',
          evidence: { failureCount: consecutiveFailures, contextTokens: session.totalInputTokens }
        };
      }
    } else if (turn.role === 'user') {
      consecutiveFailures = 0;
    }
  }
  return null;
}

function detectContextBloat(
  session: SessionBucket,
  baselineAvg: number,
  baselineStdDev: number
): DetectorResult | null {
  const avgPerTurn = session.totalInputTokens / Math.max(session.turns.length, 1);
  const threshold = baselineAvg + 3 * baselineStdDev;
  
  if (avgPerTurn > threshold && session.totalInputTokens > 50000) {
    return {
      pattern: 'context_bloat',
      severity: 'medium',
      evidence: { avgTokensPerTurn: avgPerTurn, threshold, totalTokens: session.totalInputTokens }
    };
  }
  return null;
}

function detectToolDuplication(session: SessionBucket): DetectorResult | null {
  const toolCalls = session.turns
    .flatMap(t => t.tool_use ?? [])
    .map(call => call.name + JSON.stringify(call.input));
  
  const uniqueCalls = new Set(toolCalls);
  const duplicationRatio = 1 - (uniqueCalls.size / toolCalls.length);
  
  if (toolCalls.length > 15 && duplicationRatio > 0.3) {
    return {
      pattern: 'tool_duplication',
      severity: 'high',
      evidence: { totalCalls: toolCalls.length, uniqueCalls: uniqueCalls.size, ratio: duplicationRatio }
    };
  }
  return null;
}

Architecture Rationale: Detectors are pure functions that accept session state and return structured results. This enables parallel execution and unit testing. Statistical thresholds (3σ) adapt to individual developer baselines rather than enforcing rigid global limits. Duplication detection uses set cardinality rather than string matching, which is faster and more reliable for JSON payloads.

Step 3: Cost Attribution & Reporting

Token counts must be mapped to pricing tiers. Model routing discrepancies require cross-referencing configuration against actual execution logs.

const PRICING_TIERS: Record<string, { inputPerM: number; outputPerM: number }> = {
  'opus': { inputPerM: 15.0, outputPerM: 75.0 },
  'sonnet': { inputPerM: 3.0, outputPerM: 15.0 },
  'haiku': { inputPerM: 0.25, outputPerM: 1.25 }
};

function calculateSessionCost(session: SessionBucket, model: string): number {
  const tier = PRICING_TIERS[model.toLowerCase()] ?? PRICING_TIERS['sonnet'];
  const inputCost = (session.totalInputTokens / 1_000_000) * tier.inputPerM;
  const outputCost = (session.totalOutputTokens / 1_000_000) * tier.outputPerM;
  return inputCost + outputCost;
}

function auditPipeline(logPath: string, userBaseline: { avg: number; stdDev: number }): Promise<DetectorResult[]> {
  return new Promise((resolve) => {
    const findings: DetectorResult[] = [];
    
    streamSessionLogs(logPath, (session) => {
      const activeModel = session.turns.find(t => t.model)?.model?.toLowerCase() ?? 'sonnet';
      const cost = calculateSessionCost(session, activeModel);
      
      const detectors = [
        () => detectRetryStorm(session),
        () => detectContextBloat(session, userBaseline.avg, userBaseline.stdDev),
        () => detectToolDuplication(session)
      ];

      for (const run of detectors) {
        const result = run();
        if (result) {
          result.evidence = { ...result.evidence, estimatedCost: cost };
          findings.push(result);
        }
      }
    }).then(() => resolve(findings));
  });
}

Architecture Rationale: Cost calculation is decoupled from detection to allow pricing model updates without touching detection logic. The pipeline aggregates findings across all sessions, enabling trend analysis. Evidence objects carry cost estimates, which bridges the gap between technical metrics and financial impact.

Pitfall Guide

1. Single-Session Myopia

Explanation: Relying exclusively on real-time dashboards or /usage endpoints masks cross-session repetition. A single session may appear efficient, but repeating the same inefficient pattern across 50 sessions compounds costs exponentially. Fix: Implement session-agnostic pattern detectors that aggregate across time windows. Track behavioral recurrence, not just instantaneous token volume.

2. Hardcoded Token Thresholds

Explanation: Applying absolute limits (e.g., "flag sessions over 40k tokens") generates false positives for researchers, architects, or complex debugging workflows that legitimately require large contexts. Fix: Calculate per-user or per-project baselines using rolling averages and standard deviations. Trigger alerts only when sessions exceed statistical outliers (e.g., 2σ or 3σ).

3. Ignoring Tool Argument Similarity

Explanation: Counting tool invocations without analyzing their payloads misses semantic duplication. An agent running grep "User", grep "users", and grep "\\buser\\b" appears as three distinct calls but represents redundant work. Fix: Normalize tool arguments, compute Levenshtein distance or token overlap, and flag sessions where argument similarity exceeds a defined threshold.

4. Assuming Model Routing Matches Configuration

Explanation: Platform throttling, availability fallbacks, or routing quirks can silently upgrade sessions to higher-tier models. Configuration files reflect intent, not execution. Fix: Cross-reference .claude/config.json or CLI flags against actual model fields in JSONL turns. Log discrepancies and apply cost deltas to attribution reports.

5. Treating Context Windows as Free Memory

Explanation: Developers often leave sessions open for days, assuming the agent "remembers" context efficiently. In reality, every turn re-processes the entire conversation history, multiplying input costs. Fix: Enforce session recycling policies. Summarize state at ~50k tokens, archive the conversation, and initialize a fresh session with the summary as system context.

6. Over-Optimizing Trivial Tasks

Explanation: Focusing exclusively on reducing costs for simple operations (variable renames, lint fixes) while ignoring structural leaks yields minimal ROI. The highest-impact patterns are retry storms and context bloat. Fix: Prioritize detection rules by estimated cost impact. Allocate engineering effort to interrupting high-frequency, high-multiplier loops first.

7. Unattended Execution Blind Spots

Explanation: Letting agents run overnight or during meetings without monitoring creates autopilot spend. The agent may enter infinite loops, retry failed operations, or explore low-value paths. Fix: Implement time-bound execution windows. Require explicit approval for sessions exceeding duration or token thresholds. Log off-hours activity separately for cost allocation.

Production Bundle

Action Checklist

Deploy streaming log parser: Replace batch processing with incremental JSONL ingestion to prevent memory exhaustion.
Establish per-user baselines: Calculate rolling 30-day averages for tokens per turn, tool calls, and session duration.
Implement retry storm gate: Add a hard stop after 3 consecutive tool errors requiring manual review.
Enforce explicit model routing: Replace sticky model preferences with per-session CLI flags or environment variables.
Configure context recycling: Set automatic session summarization at 50k tokens with fresh initialization.
Enable tool argument deduplication: Integrate payload similarity checks into the detection pipeline.
Schedule weekly trend reviews: Automate baseline drift reports and distribute to engineering leads.
Tag off-hours execution: Separate unattended sessions in billing reports for accurate cost attribution.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Routine refactoring across multiple files	Sonnet with explicit session boundaries	Lower token pricing, sufficient reasoning capacity	Reduces per-turn cost by ~80%
Novel architecture design	Opus with strict context limits	Higher reasoning quality justifies premium pricing	Acceptable if sessions stay <40k tokens
Debugging complex runtime errors	Sonnet + retry storm gate	Prevents exponential context growth during failure loops	Cuts retry-related spend by ~60%
Large codebase exploration	Haiku for grep/read, Sonnet for synthesis	Tiered tool usage matches task complexity	Optimizes I/O token allocation
Cross-team agent deployment	Centralized audit pipeline + per-project baselines	Normalizes detection thresholds across workloads	Enables accurate cost allocation

Configuration Template

// audit.config.ts
export const AuditConfig = {
  logDirectory: process.env.AGENT_LOG_DIR ?? '~/.claude/projects',
  baselineWindow: 30, // days
  statisticalThreshold: 3, // sigma
  sessionTokenLimit: 50000,
  retryFailureLimit: 3,
  toolDuplicationThreshold: 0.3,
  pricingModel: {
    opus: { inputPerM: 15.0, outputPerM: 75.0 },
    sonnet: { inputPerM: 3.0, outputPerM: 15.0 },
    haiku: { inputPerM: 0.25, outputPerM: 1.25 }
  },
  reporting: {
    format: 'json' | 'csv' | 'table',
    outputDirectory: './audit-reports',
    retentionDays: 90
  },
  alerts: {
    enabled: true,
    channels: ['slack', 'email'],
    severityThreshold: 'high'
  }
};

Quick Start Guide

Export your agent logs: Locate the JSONL session files in your agent's configuration directory. Ensure file permissions allow read access for the audit pipeline.
Install dependencies: Run npm install with the provided package.json. The pipeline requires only Node.js 18+ and standard library modules.
Configure baselines: Copy audit.config.ts to your project root. Adjust baselineWindow and statisticalThreshold to match your team's workflow cadence.
Execute the pipeline: Run node audit-pipeline.js --log-dir ./logs --output ./report.json. The script streams logs, applies detectors, and generates a structured findings report.
Review and iterate: Open the generated report. Prioritize high-severity patterns, apply the corresponding workflow fixes, and re-run the pipeline after 7 days to measure impact.