Back to KB
Difficulty
Intermediate
Read Time
10 min

Reading the Prompt You Did Not Send: Detection at the Inference Boundary

By Codcompass Team··10 min read

Hardening the Inference Plane: Ensemble-Based Detection for Agentic Workloads

Current Situation Analysis

Agentic systems operate in environments where context is inherently untrusted. Every email, calendar invite, web scrape, or database record injected into a prompt represents a potential attack vector. The industry has historically focused on securing the tool-use layer and credential boundaries, treating the inference path as a trusted conduit. This assumption is collapsing under the weight of production incidents.

The inference boundary is the most observable layer in the agent stack. The harness captures the complete prompt and response payload on every model call. Despite this visibility, many implementations lack runtime detection mechanisms, relying instead on static system prompts or post-hoc log analysis. This gap enables LLM Scope Violations, where external content manipulates the model to exfiltrate data or execute actions outside its authorized domain.

The existence proof of this failure mode is CVE-2025-32711 (EchoLeak), disclosed in June 2025. In this incident, a vendor calendar invite contained a markdown payload instructing Microsoft 365 Copilot to encode sensitive user data into a SharePoint URL path within a summary response. The model complied, exfiltrating MFA codes via an auto-unfurling link. Microsoft scored this CVSS 9.3; NVD scored it 7.5. The vulnerability was patched server-side, but the architectural flaw—processing unvalidated context without inference-layer scrutiny—remains prevalent.

The 2025–2026 CVE corpus demonstrates that inference failures rarely occur in isolation. They chain across boundaries:

  • Semantic Kernel CVE-2026-25592: Prompt injection bypassed validation to trigger remote code execution via DownloadFileAsync.
  • GitHub Copilot CVE-2025-53773: Injection manipulated chat.tools.autoApprove, leading to terminal RCE.
  • OpenClaw Claw Chain: Four chained vulnerabilities culminated in agent-runtime takeover, starting with inference manipulation.

These incidents confirm that tool permissions and credential brokers are insufficient without robust detection at the inference plane. The prompt is the attack surface; if you cannot score the prompt before the model executes, you are operating blind.

WOW Moment: Key Findings

Ensemble detection strategies have matured to the point where they can neutralize the majority of known inference attacks with acceptable operational overhead. Single-model classifiers are inadequate due to high false-positive rates on security-themed benign inputs and susceptibility to adversarial evasion. Ensemble architectures combining pattern matching, semantic classification, and scope validation deliver order-of-magnitude improvements in attack success rate (ASR) reduction.

The following data compares defense strategies against cross-source OWASP LLM01 corpora and production telemetry:

Defense StrategyAttack Success Rate (ASR)PrecisionLatency OverheadKey Limitation
Baseline (No Guard)50% – 86%N/A0 msVulnerable to all injection classes.
Single Classifier~40%~82%~400 msHigh false positives on security contexts; adversarial evasion.
LLMTrace Ensemble<12.4%95.5%~1.5 s79.7% recall leaves gap for novel variants.
Anthropic Constitutional4.4% (from 86%)High~1.2 sRequires proprietary model integration.
Microsoft Spotlighting<2% (from >50%)High~1.0 sSpecialized for indirect injection; less coverage on jailbreaks.

Why this matters: The LLMTrace ensemble demonstrates that a four-detector architecture achieves 95.5% precision while reducing ASR by over 85%. Microsoft Spotlighting and Anthropic's classifiers prove that enterprise-grade reductions are achievable. The latency cost (~1.5s median) is a trade-off that production systems can manage through async pre-checks and caching, whereas the risk of scope violation or exfiltration is often unacceptable.

Core Solution

Implementing inference-plane detection requires an ensemble architecture that scores prompts and responses before tool execution or output delivery. The solution comprises three detection layers and a voting arbiter.

Architecture Rationale

  1. Pattern Matcher: Low-latency regex and keyword analysis catches known injection signatures and structural anomalies. This layer filters obvious attacks instantly.
  2. Semantic Classifier: A distilled model evaluates the intent and context of the prompt. This layer detects subtle manipulations, jailbreaks, and indirect injections that bypass pattern matching.
  3. **Scope Vali

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back