: Trace Normalization
Raw agent logs are inconsistent. Some frameworks emit JSON, others emit markdown, and tool outputs vary wildly. The first stage converts heterogeneous logs into a unified event schema.
interface TraceEvent {
stepId: string;
timestamp: number;
agentRole: string;
action: 'tool_call' | 'tool_response' | 'reasoning' | 'final_answer';
input: Record<string, unknown>;
output: Record<string, unknown>;
status: 'success' | 'failure' | 'timeout';
}
function normalizeRawLog(rawLines: string[]): TraceEvent[] {
return rawLines
.filter(line => line.trim().length > 0)
.map((line, idx) => {
const parsed = JSON.parse(line);
return {
stepId: `evt_${idx.toString().padStart(4, '0')}`,
timestamp: parsed.ts ?? Date.now(),
agentRole: parsed.agent ?? 'default',
action: parsed.type as TraceEvent['action'],
input: parsed.payload?.input ?? {},
output: parsed.payload?.output ?? {},
status: parsed.status ?? 'success'
};
});
}
Architecture Rationale: Normalization happens upfront to guarantee that downstream detectors operate on a consistent schema. This eliminates framework-specific parsing logic from the detection layer and allows detectors to be framework-agnostic.
Step 2: Deterministic Detection Registry
Structural failures are identified by purpose-built matchers. Each detector operates independently, scoring events against a known failure signature.
interface DetectionResult {
detectorId: string;
severity: number; // 0-100
matchedSteps: string[];
evidence: Record<string, unknown>;
}
type Detector = (events: TraceEvent[]) => DetectionResult | null;
const structuralDetectors: Record<string, Detector> = {
loop_detection: (events) => {
const stateHashes = events
.filter(e => e.action === 'tool_call')
.map(e => JSON.stringify(e.input));
const consecutiveRepeats = stateHashes.reduce((acc, hash, i) => {
if (i > 0 && hash === stateHashes[i - 1]) {
acc.current++;
} else {
acc.current = 1;
}
acc.max = Math.max(acc.max, acc.current);
return acc;
}, { current: 1, max: 1 });
if (consecutiveRepeats.max >= 3) {
return {
detectorId: 'loop_detection',
severity: 85,
matchedSteps: events.filter(e => e.action === 'tool_call').slice(-consecutiveRepeats.max).map(e => e.stepId),
evidence: { repeatCount: consecutiveRepeats.max }
};
}
return null;
},
context_neglect: (events) => {
const initialInput = events[0]?.input?.prompt ?? '';
const keyElements = initialInput.match(/\b\d{4}\b|\b[A-Z][a-z]+\b|https?:\/\/\S+/g) ?? [];
if (keyElements.length === 0) return null;
const finalOutput = events[events.length - 1]?.output?.text ?? '';
const matchedElements = keyElements.filter(el => finalOutput.includes(el));
const coverageRatio = matchedElements.length / keyElements.length;
if (coverageRatio < 0.4) {
return {
detectorId: 'context_neglect',
severity: 70,
matchedSteps: [events[0].stepId, events[events.length - 1].stepId],
evidence: { coverageRatio, missingElements: keyElements.filter(el => !finalOutput.includes(el)) }
};
}
return null;
}
};
Architecture Rationale: Detectors are isolated functions returning null when no failure is found. This design enables parallel execution, easy unit testing, and deterministic scoring. Severity thresholds are calibrated on validation sets rather than hardcoded, preventing false positives from cascading into production alerts.
Step 3: Conditional LLM Escalation
When structural detectors return no matches, or when multi-agent blame attribution is required, the pipeline routes the trace to an LLM judge. This tier is expensive, so it only activates when necessary.
interface EscalationRequest {
traceId: string;
unresolvedSteps: string[];
multiAgentContext: boolean;
}
async function escalateToLLM(request: EscalationRequest): Promise<AttributionResult> {
const prompt = `
Analyze the following agent trace for causal failure attribution.
Trace ID: ${request.traceId}
Steps requiring analysis: ${request.unresolvedSteps.join(', ')}
Multi-agent environment: ${request.multiAgentContext}
Return JSON with fields: { responsibleAgent: string, failureStep: string, causalReason: string }
`;
const response = await llmClient.chat.completions.create({
model: 'claude-sonnet-4-20250514',
messages: [{ role: 'user', content: prompt }],
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
Architecture Rationale: Escalation is gated by two conditions: (1) structural detectors found zero matches, indicating a novel or semantic failure, and (2) the trace involves multiple agents where blame cannot be resolved through tool ownership alone. This keeps LLM costs predictable and ensures the model focuses exclusively on causal reasoning rather than pattern matching.
Pitfall Guide
1. Semantic Overreach on Structural Problems
Explanation: Routing loop detection, repetition counting, or tool success verification through an LLM. Language models are probabilistic and will occasionally miscount or hallucinate repetition patterns.
Fix: Replace LLM calls with deterministic state hashing, regex matching, or set-based overlap calculations. Reserve LLMs for causal chains and intent analysis.
2. Conflating Detection with Attribution
Explanation: Asking a single LLM prompt to both identify that a failure occurred and determine which agent caused it. This forces the model to perform two distinct cognitive tasks simultaneously, degrading accuracy on both.
Fix: Decouple the pipeline. Run structural detectors first to flag failures. Only escalate to LLM when attribution is explicitly required or when detection yields zero results.
3. Hardcoded Severity Thresholds
Explanation: Setting fixed pass/fail boundaries (e.g., if coverage < 0.5 then fail) without calibrating against production data. Thresholds that work on benchmarks often trigger false positives on noisy real-world traces.
Fix: Implement dynamic thresholding using percentile-based calibration on a validation set. Log detector scores continuously and adjust boundaries based on precision/recall trade-offs.
4. Ignoring Sliding Window Context
Explanation: Analyzing each trace step in isolation. Many failures (e.g., goal deviation, context drift) only become visible when comparing early steps against late steps.
Fix: Maintain a rolling context window that preserves initial constraints, user requirements, and intermediate state. Compare final outputs against the full window, not just the immediate predecessor step.
5. Cost Blindness in Escalation Logic
Explanation: Routing every trace to an LLM judge because "it's safer." This quickly erodes margins and introduces latency that breaks real-time agent loops.
Fix: Implement strict escalation gates. Only call the LLM when structural detectors return null or when multi-agent attribution is explicitly requested. Cache attribution results for identical trace patterns.
6. Novelty Blindness in Heuristic Systems
Explanation: Assuming rule-based detectors cover all failure modes. Heuristics only catch known patterns. A new tool failure mode, prompt injection, or emergent agent behavior will bypass them entirely.
Fix: Maintain an LLM catch-all tier specifically for out-of-distribution traces. Route traces that pass all structural checks but still exhibit anomalous latency, token usage, or confidence scores to the LLM for semantic review.
7. Precision Neglect in Benchmarking
Explanation: Optimizing exclusively for recall (catching every failure) without measuring false positive rates. High recall with low precision creates alert fatigue and erodes engineering trust in the observability stack.
Fix: Track precision alongside recall. On the TRAIL benchmark, heuristic systems achieved 100% precision across 481 detections. Prioritize precision in production; it is easier to tune recall upward than to recover from a system that cries wolf.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume single-agent monitoring | Heuristic-only pipeline | Structural failures dominate; LLM adds latency with minimal gain | ~$0.001/trace |
| Multi-agent blame attribution | Tiered hybrid (Heuristic + Sonnet 4) | Causal chains require semantic reasoning; heuristics identify failure steps | ~$0.02/trace |
| Novel failure hunting / OOD detection | LLM catch-all tier | Heuristics miss unseen patterns; LLM provides semantic coverage | ~$0.08/trace |
| Compliance / audit logging | Heuristic-only + structured JSON output | Deterministic scoring ensures reproducibility and auditability | ~$0.00/trace |
| Real-time agent loop debugging | Heuristic-only with sub-100ms latency | LLM latency breaks interactive debugging workflows | ~$0.00/trace |
Configuration Template
trace_pipeline:
normalization:
schema_version: "2.1"
timestamp_format: "iso8601"
max_events_per_trace: 500
detectors:
loop_detection:
enabled: true
min_consecutive_repeats: 3
severity_weight: 0.85
context_neglect:
enabled: true
min_coverage_ratio: 0.4
element_types: ["dates", "urls", "proper_nouns", "numbers"]
tool_failure_correlation:
enabled: true
max_error_rate: 0.15
severity_weight: 0.75
specification_coverage:
enabled: true
matching_strategy: "stem_and_synonym"
min_keyword_overlap: 0.6
escalation:
trigger_condition: "no_structural_match OR multi_agent_attribution"
model: "claude-sonnet-4-20250514"
max_tokens: 512
response_format: "json_object"
cost_budget_per_day_usd: 5.00
output:
format: "structured_json"
include_evidence: true
severity_threshold_alert: 70
Quick Start Guide
- Install dependencies: Add your trace parsing library and LLM client SDK to your project. Ensure TypeScript 5.4+ is configured.
- Define your event schema: Map your agent framework's raw logs to the
TraceEvent interface. Handle missing fields with sensible defaults.
- Register detectors: Import the structural detector registry and attach it to your pipeline. Run a dry pass on 50 historical traces to calibrate severity thresholds.
- Configure escalation rules: Set the escalation trigger to activate only when structural detectors return
null or when multi-agent attribution is explicitly requested. Point the LLM client to Sonnet 4 for attribution tasks.
- Deploy and monitor: Route production traces through the pipeline. Log detector scores, escalation rates, and precision/recall metrics. Adjust thresholds weekly based on alert fatigue and missed failure rates.