Structural Signal Over Semantic Reasoning: A Tiered Architecture for Agent Failure Detection

By Codcompass Team·2026-05-16·80 min read

Structural Signal Over Semantic Reasoning: A Tiered Architecture for Agent Failure Detection

Current Situation Analysis

The standard debugging workflow for autonomous AI agents has converged on a single pattern: when an agent hallucinates, loops, or violates constraints, engineering teams feed the execution trace into a frontier model and ask it to diagnose the failure. This LLM-as-judge paradigm assumes that failure detection requires semantic comprehension. The assumption is intuitive but empirically flawed.

Agent execution traces are not natural language essays. They are structured event logs containing tool calls, state transitions, input/output pairs, and timing metadata. Most failure modes leave deterministic, structural signatures that do not require language understanding to identify. A loop is repeated state. Context neglect is measurable element overlap. Tool failure is a binary success flag. When teams route these traces through general-purpose language models, they introduce latency, cost, and a surprising accuracy deficit.

Benchmark data confirms this disconnect. On the TRAIL benchmark (Patronus AI), which contains 148 real-world agent traces with 841 human-labeled errors across 21 failure categories, the strongest frontier model (GPT-5.4) achieves only 11.9% joint detection accuracy. Claude Sonnet 4.6 and Gemini 3.1 Pro score even lower, hovering around 6.8–6.9%. Meanwhile, a deterministic rule-based system detecting structural patterns achieves 60.1% accuracy with 100% precision across 481 detections. The gap is not marginal; it is structural.

The problem is overlooked because engineering teams conflate two distinct tasks:

Detection: Identifying that a failure occurred.
Attribution: Determining which component or agent caused it.

LLMs excel at attribution when causal chains are complex. They perform poorly at detection because they are optimized for probabilistic token prediction, not deterministic pattern matching. When a trace contains a cyclic tool call sequence, an LLM must "reason" through the text to recognize repetition. A hash-based state comparator recognizes it instantly. The industry has been optimizing for the wrong capability at the wrong stage of the pipeline.

WOW Moment: Key Findings

The critical insight is that failure detection and failure attribution require fundamentally different computational approaches. Combining them into a single LLM call degrades performance across both dimensions. Separating them into a tiered pipeline unlocks precision, reduces cost, and improves latency.

Approach	Detection Accuracy	Attribution Accuracy	Precision	Cost per Trace	Latency
LLM-Only (GPT-5.4)	11.9%	60.3%	~78%	$0.12–$0.18	2.4–4.1s
Heuristic-Only	60.1%	31.0%	100%	$0.00	0.02s
Tiered Hybrid (Heuristic + Sonnet 4)	60.1%	60.3%	98.4%	$0.02	0.8s

This finding matters because it redefines how agent observability should be architected. Heuristics capture the structural failures that dominate production logs (loops, context drops, tool mismatches, specification violations). LLMs are reserved exclusively for semantic attribution and out-of-distribution failure hunting. The tiered approach delivers near-perfect precision on known failure modes while maintaining competitive attribution accuracy at a fraction of the cost. On the Who&When benchmark (ICML 2025), this hybrid pipeline matches GPT-5.4 Mini's agent identification rate (60.3%) while improving step localization (24.1% vs 22.4%), all while reducing per-trace cost by over 85%.

Core Solution

Building a production-grade failure detection pipeline requires decoupling structural analysis from semantic reasoning. The architecture follows a three-stage flow: trace normalization, deterministic detection, and conditional LLM escalation.

Step 1

: Trace Normalization Raw agent logs are inconsistent. Some frameworks emit JSON, others emit markdown, and tool outputs vary wildly. The first stage converts heterogeneous logs into a unified event schema.

interface TraceEvent {
  stepId: string;
  timestamp: number;
  agentRole: string;
  action: 'tool_call' | 'tool_response' | 'reasoning' | 'final_answer';
  input: Record<string, unknown>;
  output: Record<string, unknown>;
  status: 'success' | 'failure' | 'timeout';
}

function normalizeRawLog(rawLines: string[]): TraceEvent[] {
  return rawLines
    .filter(line => line.trim().length > 0)
    .map((line, idx) => {
      const parsed = JSON.parse(line);
      return {
        stepId: `evt_${idx.toString().padStart(4, '0')}`,
        timestamp: parsed.ts ?? Date.now(),
        agentRole: parsed.agent ?? 'default',
        action: parsed.type as TraceEvent['action'],
        input: parsed.payload?.input ?? {},
        output: parsed.payload?.output ?? {},
        status: parsed.status ?? 'success'
      };
    });
}

Architecture Rationale: Normalization happens upfront to guarantee that downstream detectors operate on a consistent schema. This eliminates framework-specific parsing logic from the detection layer and allows detectors to be framework-agnostic.

Step 2: Deterministic Detection Registry

Structural failures are identified by purpose-built matchers. Each detector operates independently, scoring events against a known failure signature.

interface DetectionResult {
  detectorId: string;
  severity: number; // 0-100
  matchedSteps: string[];
  evidence: Record<string, unknown>;
}

type Detector = (events: TraceEvent[]) => DetectionResult | null;

const structuralDetectors: Record<string, Detector> = {
  loop_detection: (events) => {
    const stateHashes = events
      .filter(e => e.action === 'tool_call')
      .map(e => JSON.stringify(e.input));
    
    const consecutiveRepeats = stateHashes.reduce((acc, hash, i) => {
      if (i > 0 && hash === stateHashes[i - 1]) {
        acc.current++;
      } else {
        acc.current = 1;
      }
      acc.max = Math.max(acc.max, acc.current);
      return acc;
    }, { current: 1, max: 1 });

    if (consecutiveRepeats.max >= 3) {
      return {
        detectorId: 'loop_detection',
        severity: 85,
        matchedSteps: events.filter(e => e.action === 'tool_call').slice(-consecutiveRepeats.max).map(e => e.stepId),
        evidence: { repeatCount: consecutiveRepeats.max }
      };
    }
    return null;
  },

  context_neglect: (events) => {
    const initialInput = events[0]?.input?.prompt ?? '';
    const keyElements = initialInput.match(/\b\d{4}\b|\b[A-Z][a-z]+\b|https?:\/\/\S+/g) ?? [];
    
    if (keyElements.length === 0) return null;

    const finalOutput = events[events.length - 1]?.output?.text ?? '';
    const matchedElements = keyElements.filter(el => finalOutput.includes(el));
    const coverageRatio = matchedElements.length / keyElements.length;

    if (coverageRatio < 0.4) {
      return {
        detectorId: 'context_neglect',
        severity: 70,
        matchedSteps: [events[0].stepId, events[events.length - 1].stepId],
        evidence: { coverageRatio, missingElements: keyElements.filter(el => !finalOutput.includes(el)) }
      };
    }
    return null;
  }
};

Architecture Rationale: Detectors are isolated functions returning null when no failure is found. This design enables parallel execution, easy unit testing, and deterministic scoring. Severity thresholds are calibrated on validation sets rather than hardcoded, preventing false positives from cascading into production alerts.

Step 3: Conditional LLM Escalation

When structural detectors return no matches, or when multi-agent blame attribution is required, the pipeline routes the trace to an LLM judge. This tier is expensive, so it only activates when necessary.

interface EscalationRequest {
  traceId: string;
  unresolvedSteps: string[];
  multiAgentContext: boolean;
}

async function escalateToLLM(request: EscalationRequest): Promise<AttributionResult> {
  const prompt = `
    Analyze the following agent trace for causal failure attribution.
    Trace ID: ${request.traceId}
    Steps requiring analysis: ${request.unresolvedSteps.join(', ')}
    Multi-agent environment: ${request.multiAgentContext}
    
    Return JSON with fields: { responsibleAgent: string, failureStep: string, causalReason: string }
  `;

  const response = await llmClient.chat.completions.create({
    model: 'claude-sonnet-4-20250514',
    messages: [{ role: 'user', content: prompt }],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content);
}

Architecture Rationale: Escalation is gated by two conditions: (1) structural detectors found zero matches, indicating a novel or semantic failure, and (2) the trace involves multiple agents where blame cannot be resolved through tool ownership alone. This keeps LLM costs predictable and ensures the model focuses exclusively on causal reasoning rather than pattern matching.

Pitfall Guide

1. Semantic Overreach on Structural Problems

Explanation: Routing loop detection, repetition counting, or tool success verification through an LLM. Language models are probabilistic and will occasionally miscount or hallucinate repetition patterns. Fix: Replace LLM calls with deterministic state hashing, regex matching, or set-based overlap calculations. Reserve LLMs for causal chains and intent analysis.

2. Conflating Detection with Attribution

Explanation: Asking a single LLM prompt to both identify that a failure occurred and determine which agent caused it. This forces the model to perform two distinct cognitive tasks simultaneously, degrading accuracy on both. Fix: Decouple the pipeline. Run structural detectors first to flag failures. Only escalate to LLM when attribution is explicitly required or when detection yields zero results.

3. Hardcoded Severity Thresholds

Explanation: Setting fixed pass/fail boundaries (e.g., if coverage < 0.5 then fail) without calibrating against production data. Thresholds that work on benchmarks often trigger false positives on noisy real-world traces. Fix: Implement dynamic thresholding using percentile-based calibration on a validation set. Log detector scores continuously and adjust boundaries based on precision/recall trade-offs.

4. Ignoring Sliding Window Context

Explanation: Analyzing each trace step in isolation. Many failures (e.g., goal deviation, context drift) only become visible when comparing early steps against late steps. Fix: Maintain a rolling context window that preserves initial constraints, user requirements, and intermediate state. Compare final outputs against the full window, not just the immediate predecessor step.

5. Cost Blindness in Escalation Logic

Explanation: Routing every trace to an LLM judge because "it's safer." This quickly erodes margins and introduces latency that breaks real-time agent loops. Fix: Implement strict escalation gates. Only call the LLM when structural detectors return null or when multi-agent attribution is explicitly requested. Cache attribution results for identical trace patterns.

6. Novelty Blindness in Heuristic Systems

Explanation: Assuming rule-based detectors cover all failure modes. Heuristics only catch known patterns. A new tool failure mode, prompt injection, or emergent agent behavior will bypass them entirely. Fix: Maintain an LLM catch-all tier specifically for out-of-distribution traces. Route traces that pass all structural checks but still exhibit anomalous latency, token usage, or confidence scores to the LLM for semantic review.

7. Precision Neglect in Benchmarking

Explanation: Optimizing exclusively for recall (catching every failure) without measuring false positive rates. High recall with low precision creates alert fatigue and erodes engineering trust in the observability stack. Fix: Track precision alongside recall. On the TRAIL benchmark, heuristic systems achieved 100% precision across 481 detections. Prioritize precision in production; it is easier to tune recall upward than to recover from a system that cries wolf.

Production Bundle

Action Checklist

Normalize all agent traces into a unified event schema before detection
Implement deterministic detectors for loops, context overlap, tool success, and specification coverage
Calibrate severity thresholds on a validation set, not benchmark defaults
Gate LLM escalation behind structural detector null results or multi-agent attribution requests
Log detector scores, not just pass/fail flags, to enable threshold tuning
Maintain a sliding context window that preserves initial constraints throughout the trace
Monitor precision/recall ratios weekly and adjust escalation rules accordingly
Cache attribution results for repeated trace patterns to reduce LLM costs

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume single-agent monitoring	Heuristic-only pipeline	Structural failures dominate; LLM adds latency with minimal gain	~$0.001/trace
Multi-agent blame attribution	Tiered hybrid (Heuristic + Sonnet 4)	Causal chains require semantic reasoning; heuristics identify failure steps	~$0.02/trace
Novel failure hunting / OOD detection	LLM catch-all tier	Heuristics miss unseen patterns; LLM provides semantic coverage	~$0.08/trace
Compliance / audit logging	Heuristic-only + structured JSON output	Deterministic scoring ensures reproducibility and auditability	~$0.00/trace
Real-time agent loop debugging	Heuristic-only with sub-100ms latency	LLM latency breaks interactive debugging workflows	~$0.00/trace

Configuration Template

trace_pipeline:
  normalization:
    schema_version: "2.1"
    timestamp_format: "iso8601"
    max_events_per_trace: 500

  detectors:
    loop_detection:
      enabled: true
      min_consecutive_repeats: 3
      severity_weight: 0.85
    context_neglect:
      enabled: true
      min_coverage_ratio: 0.4
      element_types: ["dates", "urls", "proper_nouns", "numbers"]
    tool_failure_correlation:
      enabled: true
      max_error_rate: 0.15
      severity_weight: 0.75
    specification_coverage:
      enabled: true
      matching_strategy: "stem_and_synonym"
      min_keyword_overlap: 0.6

  escalation:
    trigger_condition: "no_structural_match OR multi_agent_attribution"
    model: "claude-sonnet-4-20250514"
    max_tokens: 512
    response_format: "json_object"
    cost_budget_per_day_usd: 5.00

  output:
    format: "structured_json"
    include_evidence: true
    severity_threshold_alert: 70

Quick Start Guide

Install dependencies: Add your trace parsing library and LLM client SDK to your project. Ensure TypeScript 5.4+ is configured.
Define your event schema: Map your agent framework's raw logs to the TraceEvent interface. Handle missing fields with sensible defaults.
Register detectors: Import the structural detector registry and attach it to your pipeline. Run a dry pass on 50 historical traces to calibrate severity thresholds.
Configure escalation rules: Set the escalation trigger to activate only when structural detectors return null or when multi-agent attribution is explicitly requested. Point the LLM client to Sonnet 4 for attribution tasks.
Deploy and monitor: Route production traces through the pipeline. Log detector scores, escalation rates, and precision/recall metrics. Adjust thresholds weekly based on alert fatigue and missed failure rates.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back