Tool call volume: Total external API or function invocations
- Token consumption: Cumulative input/output tokens across all LLM calls
- Execution duration: Wall-clock time from start to completion
Step 3: Implement Explicit Output Validation
Never assume completion implies success. Require an explicit output validation event that only fires after the agent's output passes a structural or semantic check. This event must precede the completion signal.
Step 4: Establish Dynamic Baselines
Instead of hardcoding thresholds, maintain rolling statistical baselines per agent and task type. Calculate mean and standard deviation for each metric over the last 50-100 successful runs. Flag deviations exceeding 2 standard deviations.
Step 5: Trigger Anomaly Alerts
When runtime metrics deviate from the baseline, or when the output validation event is missing, trigger a mid-execution alert. Include context: current iteration count, token burn rate, elapsed time, and missing events.
TypeScript Implementation
import { EventEmitter } from 'events';
interface TelemetryEvent {
type: 'iteration' | 'tool_call' | 'token_usage' | 'output_validated' | 'completion' | 'stall';
payload: Record<string, unknown>;
timestamp: number;
}
interface BehavioralBaseline {
avgIterations: number;
stdIterations: number;
avgTokens: number;
stdTokens: number;
avgDurationMs: number;
stdDurationMs: number;
lastUpdated: Date;
}
class AgentTelemetry extends EventEmitter {
private sessionId: string;
private baseline: BehavioralBaseline;
private metrics: { iterations: number; tokens: number; durationStart: number; toolCalls: number };
private outputValidated: boolean = false;
constructor(sessionId: string, baseline: BehavioralBaseline) {
super();
this.sessionId = sessionId;
this.baseline = baseline;
this.metrics = { iterations: 0, tokens: 0, durationStart: Date.now(), toolCalls: 0 };
}
trackIteration(taskId: string): void {
this.metrics.iterations++;
this.emit('telemetry', {
type: 'iteration',
payload: { taskId, count: this.metrics.iterations },
timestamp: Date.now()
});
this.checkDeviation('iterations', this.metrics.iterations);
}
trackToolCall(toolName: string): void {
this.metrics.toolCalls++;
this.emit('telemetry', {
type: 'tool_call',
payload: { tool: toolName, total: this.metrics.toolCalls },
timestamp: Date.now()
});
}
trackTokens(count: number): void {
this.metrics.tokens += count;
this.emit('telemetry', {
type: 'token_usage',
payload: { tokens: count, cumulative: this.metrics.tokens },
timestamp: Date.now()
});
this.checkDeviation('tokens', this.metrics.tokens);
}
validateOutput(summary: string): void {
this.outputValidated = true;
this.emit('telemetry', {
type: 'output_validated',
payload: { summary, timestamp: Date.now() },
timestamp: Date.now()
});
}
complete(): void {
if (!this.outputValidated) {
this.emit('anomaly', {
type: 'ghost_run',
message: 'Completion signaled without output validation',
metrics: this.metrics
});
return;
}
const duration = Date.now() - this.metrics.durationStart;
this.checkDeviation('duration', duration);
this.emit('telemetry', {
type: 'completion',
payload: { duration, metrics: this.metrics },
timestamp: Date.now()
});
}
private checkDeviation(metric: 'iterations' | 'tokens' | 'duration', value: number): void {
const baselineKey = metric === 'duration' ? 'avgDurationMs' : `avg${metric.charAt(0).toUpperCase() + metric.slice(1)}`;
const stdKey = metric === 'duration' ? 'stdDurationMs' : `std${metric.charAt(0).toUpperCase() + metric.slice(1)}`;
const avg = this.baseline[baselineKey as keyof BehavioralBaseline] as number;
const std = this.baseline[stdKey as keyof BehavioralBaseline] as number;
if (std > 0 && Math.abs(value - avg) > 2 * std) {
this.emit('anomaly', {
type: 'baseline_deviation',
metric,
currentValue: value,
baseline: { avg, std },
timestamp: Date.now()
});
}
}
}
Architecture Decisions & Rationale
- Event-Driven Emitter Pattern: Decouples telemetry collection from alerting logic. The agent loop only calls tracking methods; anomaly detection and notification routing happen asynchronously. This prevents telemetry overhead from blocking the agent's execution path.
- Explicit Output Validation Hook: Separates task completion from output verification. By requiring
validateOutput() to fire before complete(), the system catches ghost runs where the agent self-reports success without producing usable artifacts.
- Dynamic Baseline Comparison: Hardcoded thresholds fail across different task complexities and model versions. Rolling statistical baselines adapt to normal variance and only flag true anomalies. The 2-sigma threshold balances sensitivity with false positive reduction.
- Cumulative Token Tracking: LLM costs scale non-linearly with iteration count. Tracking cumulative tokens per session enables real-time cost containment alerts before the bill spikes.
Pitfall Guide
1. Relying on Exit Codes and Standard Logs
Explanation: Exit code 0 and clean logs only indicate the process terminated without unhandled exceptions. They provide zero insight into whether the agent's output matches the task objective.
Fix: Implement semantic validation hooks that run after each agent iteration. Treat exit codes as infrastructure signals, not task success indicators.
2. Hardcoding Static Thresholds
Explanation: Fixed limits (e.g., "alert if iterations > 10") break when task complexity changes, models are upgraded, or input data varies. They generate constant false positives or miss real anomalies.
Fix: Maintain rolling baselines per agent-task combination. Update averages and standard deviations after each successful run. Use statistical process control instead of rigid limits.
3. Skipping Explicit Output Validation
Explanation: Assuming complete() implies success allows ghost runs to propagate. Agents frequently self-evaluate against loose criteria and mark tasks done without generating meaningful output.
Fix: Require an explicit validation event that only fires after structural or semantic checks pass. Block completion signaling until validation succeeds.
4. Ignoring Token Accumulation Patterns
Explanation: Monitoring per-call token usage misses cumulative burn. An agent making 50 calls at 200 tokens each looks normal per call but consumes 10,000 tokens total, triggering cost overruns.
Fix: Track cumulative tokens per session. Compare against baseline cumulative averages. Implement mid-run cost alerts that trigger before the run finishes.
5. Treating Latency as Progress
Explanation: Normal API latency or steady response times do not indicate forward progress. An agent can stall on a slow external dependency or enter a reasoning loop while maintaining healthy latency metrics.
Fix: Track iteration velocity and tool call frequency. Implement stall detection by measuring time between telemetry events. Alert when event gaps exceed baseline expectations.
6. Over-Instrumenting Without Contextual Grouping
Explanation: Emitting telemetry for every minor operation creates noise and makes baseline learning impossible. Without grouping events by task type or agent version, baselines become meaningless averages.
Fix: Tag telemetry events with agentId, taskType, and modelVersion. Maintain separate baselines per combination. Aggregate metrics at the session level, not the individual call level.
7. Failing to Handle Async External Dependencies
Explanation: Agents frequently pause for external APIs, database writes, or human-in-the-loop approvals. Standard timeouts fire prematurely, or stalls go undetected because the process remains alive.
Fix: Implement explicit stall markers that pause the baseline timer during known external waits. Resume tracking when the dependency resolves. Alert only when stall duration exceeds the expected external response window.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume batch processing | Cumulative token tracking + rolling baselines | Prevents silent cost accumulation across thousands of runs | Reduces unexpected LLM spend by 60-80% |
| Real-time conversational agent | Per-call latency + iteration velocity monitoring | Maintains response SLAs while detecting reasoning loops | Minimal infra cost, prevents token waste |
| Multi-step workflow with external APIs | Stall markers + event gap monitoring | Distinguishes healthy external waits from true hangs | Avoids premature timeout retries and duplicate charges |
| New agent deployment | Strict validation hooks + conservative 1.5-sigma alerts | Catches configuration errors before baseline stabilizes | Higher initial alert volume, prevents downstream corruption |
Configuration Template
// telemetry.config.ts
export const TELEMETRY_CONFIG = {
session: {
ttlMs: 3600000, // 1 hour max session duration
cleanupIntervalMs: 300000
},
baselines: {
windowSize: 75, // Rolling window of successful runs
updateThreshold: 10, // Min runs before baseline activates
deviationSigma: 2.0 // Alert threshold
},
metrics: {
trackIterations: true,
trackToolCalls: true,
trackTokens: true,
trackDuration: true,
requireOutputValidation: true
},
alerts: {
channels: ['pagerduty', 'slack'],
cooldownMs: 300000, // 5 min cooldown per session
includeContext: true // Attach current metrics to alert payload
}
};
// baseline.manager.ts
export class BaselineManager {
private store: Map<string, number[]> = new Map();
record(metric: string, value: number): void {
if (!this.store.has(metric)) this.store.set(metric, []);
const history = this.store.get(metric)!;
history.push(value);
if (history.length > TELEMETRY_CONFIG.baselines.windowSize) {
history.shift();
}
}
getStats(metric: string): { avg: number; std: number; count: number } {
const data = this.store.get(metric) || [];
if (data.length < TELEMETRY_CONFIG.baselines.updateThreshold) {
return { avg: 0, std: 0, count: data.length };
}
const avg = data.reduce((a, b) => a + b, 0) / data.length;
const variance = data.reduce((a, b) => a + Math.pow(b - avg, 2), 0) / data.length;
return { avg, std: Math.sqrt(variance), count: data.length };
}
}
Quick Start Guide
- Initialize the telemetry session: Create a new
AgentTelemetry instance at the start of your agent loop. Pass a unique session ID and load the baseline for the current task type.
- Hook into the execution loop: Replace your existing iteration counter with
trackIteration(taskId). Add trackToolCall(toolName) around every external function invocation.
- Instrument token tracking: After each LLM response, extract
usage.total_tokens and pass it to trackTokens(count). Ensure cumulative tracking, not per-call replacement.
- Add output validation: Before calling
complete(), run a lightweight structural check on the agent's output. If it passes, call validateOutput(summary). If it fails, trigger a retry or fallback path instead of completing.
- Deploy anomaly routing: Subscribe to the
anomaly event emitter. Route deviations to your alerting pipeline with full metric context. Test with a simulated ghost run to verify detection triggers mid-execution.