Reading the Prompt You Did Not Send: Detection at the Inference Boundary
By Codcompass Team··10 min read
Hardening the Inference Plane: Ensemble-Based Detection for Agentic Workloads
Current Situation Analysis
Agentic systems operate in environments where context is inherently untrusted. Every email, calendar invite, web scrape, or database record injected into a prompt represents a potential attack vector. The industry has historically focused on securing the tool-use layer and credential boundaries, treating the inference path as a trusted conduit. This assumption is collapsing under the weight of production incidents.
The inference boundary is the most observable layer in the agent stack. The harness captures the complete prompt and response payload on every model call. Despite this visibility, many implementations lack runtime detection mechanisms, relying instead on static system prompts or post-hoc log analysis. This gap enables LLM Scope Violations, where external content manipulates the model to exfiltrate data or execute actions outside its authorized domain.
The existence proof of this failure mode is CVE-2025-32711 (EchoLeak), disclosed in June 2025. In this incident, a vendor calendar invite contained a markdown payload instructing Microsoft 365 Copilot to encode sensitive user data into a SharePoint URL path within a summary response. The model complied, exfiltrating MFA codes via an auto-unfurling link. Microsoft scored this CVSS 9.3; NVD scored it 7.5. The vulnerability was patched server-side, but the architectural flaw—processing unvalidated context without inference-layer scrutiny—remains prevalent.
The 2025–2026 CVE corpus demonstrates that inference failures rarely occur in isolation. They chain across boundaries:
Semantic Kernel CVE-2026-25592: Prompt injection bypassed validation to trigger remote code execution via DownloadFileAsync.
GitHub Copilot CVE-2025-53773: Injection manipulated chat.tools.autoApprove, leading to terminal RCE.
OpenClaw Claw Chain: Four chained vulnerabilities culminated in agent-runtime takeover, starting with inference manipulation.
These incidents confirm that tool permissions and credential brokers are insufficient without robust detection at the inference plane. The prompt is the attack surface; if you cannot score the prompt before the model executes, you are operating blind.
WOW Moment: Key Findings
Ensemble detection strategies have matured to the point where they can neutralize the majority of known inference attacks with acceptable operational overhead. Single-model classifiers are inadequate due to high false-positive rates on security-themed benign inputs and susceptibility to adversarial evasion. Ensemble architectures combining pattern matching, semantic classification, and scope validation deliver order-of-magnitude improvements in attack success rate (ASR) reduction.
The following data compares defense strategies against cross-source OWASP LLM01 corpora and production telemetry:
Defense Strategy
Attack Success Rate (ASR)
Precision
Latency Overhead
Key Limitation
Baseline (No Guard)
50% – 86%
N/A
0 ms
Vulnerable to all injection classes.
Single Classifier
~40%
~82%
~400 ms
High false positives on security contexts; adversarial evasion.
LLMTrace Ensemble
<12.4%
95.5%
~1.5 s
79.7% recall leaves gap for novel variants.
Anthropic Constitutional
4.4% (from 86%)
High
~1.2 s
Requires proprietary model integration.
Microsoft Spotlighting
<2% (from >50%)
High
~1.0 s
Specialized for indirect injection; less coverage on jailbreaks.
Why this matters: The LLMTrace ensemble demonstrates that a four-detector architecture achieves 95.5% precision while reducing ASR by over 85%. Microsoft Spotlighting and Anthropic's classifiers prove that enterprise-grade reductions are achievable. The latency cost (~1.5s median) is a trade-off that production systems can manage through async pre-checks and caching, whereas the risk of scope violation or exfiltration is often unacceptable.
Core Solution
Implementing inference-plane detection requires an ensemble architecture that scores prompts and responses before tool execution or output delivery. The solution comprises three detection layers and a voting arbiter.
Architecture Rationale
Pattern Matcher: Low-latency regex and keyword analysis catches known injection signatures and structural anomalies. This layer filters obvious attacks instantly.
Semantic Classifier: A distilled model evaluates the intent and context of the prompt. This layer detects subtle manipulations, jailbreaks, and indirect injections that bypass pattern matching.
**Scope Vali
dator:** Domain-specific logic checks for scope violations, such as requests to access sensitive data or generate exfiltration vectors. This layer addresses LLM Scope Violations like CVE-2025-32711.
4. Voting Arbiter: Aggregates results using weighted majority voting. This reduces false positives by requiring consensus across heterogeneous detectors.
Implementation
The following TypeScript implementation defines a modular ensemble guardian. Detectors are pluggable, and the arbiter supports configurable voting strategies.
const guardian = new InferenceGuardian({
votingThreshold: 0.6,
maxLatencyMs: 2000,
detectors: [
new RegexPatternDetector([
/ignore previous instructions/i,
/system prompt override/i,
/encode.*url.*path/i,
]),
new SemanticIntentClassifier('https://guard-model.internal/v1/analyze'),
new ScopeViolationDetector(['mfa_codes', 'api_keys', 'ssn']),
],
});
const verdict = await guardian.evaluate(userPrompt, agentContext);
if (verdict === 'BLOCK') {
throw new Error('Inference guard blocked request');
}
Pitfall Guide
Deploying inference detection in production introduces specific failure modes. The following pitfalls are derived from CVE analysis and ensemble telemetry.
The LLM06 Mirage
Explanation: Inference detectors target LLM01 (Prompt Injection) and scope violations. They do not detect Excessive Agency (LLM06). An agent may delete a production database because it interprets a benign request as "helpful," without any injection. The inference detector passes the prompt, but the action is unauthorized.
Fix: Compose inference detection with a decision layer (e.g., Cedar policies) that governs tool permissions and action scope. Inference guards the prompt; decision guards the action.
Adversarial Evasion of Detectors
Explanation: Detectors themselves are attackable. Research such as STACK and adversarial-judge studies demonstrates that prompts can be crafted to evade specific classifiers. Relying on a single model or static patterns allows attackers to probe and bypass defenses.
Fix: Use heterogeneous ensembles with different model architectures. Rotate detector models periodically. Monitor detector logs for evasion patterns and retrain classifiers on adversarial examples.
Security Context False Positives
Explanation: Classifiers often over-defend on security-themed benign inputs. PromptGuard, for example, flags 99.1% of security-themed benign inputs as malicious. Agents operating in security domains (e.g., vulnerability scanning) may trigger constant blocks.
Fix: Implement context-aware whitelisting. Tune thresholds based on domain-specific corpora. Use the ensemble to require consensus; a single classifier flagging a security term should not block if other detectors pass.
Latency Budget Blowout
Explanation: Ensemble detection adds latency. LLMTrace reports ~1.5s median overhead. In high-throughput chat or real-time agents, this can degrade user experience or cause timeouts.
Fix: Execute detectors asynchronously where possible. Cache results for identical prompts. Use distilled, smaller models for detection. Implement timeout fallbacks to REVIEW rather than blocking.
Recall Gaps and False Negatives
Explanation: No detector achieves 100% recall. LLMTrace reports 79.7% recall, meaning 16 false negatives per 79 malicious samples. Novel injection techniques or obfuscation may bypass detection.
Fix: Implement human-in-the-loop review for low-confidence scores. Continuously analyze trace stores for missed attacks. Update patterns and retrain classifiers based on new CVEs and adversarial research.
Scope Blindness
Explanation: Detectors may miss scope violations if they lack context about sensitive data. CVE-2025-32711 succeeded because the model was instructed to encode sensitive data into a URL path, which appeared benign to basic detectors.
Fix: Integrate data classification tags into the context. The scope validator must know which fields are sensitive and check for exfiltration patterns, not just access requests.
Detector Drift
Explanation: As models and prompts evolve, detector performance may degrade. Static configurations become outdated.
Fix: Establish a continuous evaluation pipeline. Run detector benchmarks against updated corpora. Automate alerts when precision or recall drops below thresholds.
Production Bundle
Action Checklist
Map Context Ingestion Points: Identify all sources of untrusted context (emails, calendars, web scrapes, databases) and ensure each is routed through the inference guardian.
Deploy Ensemble Architecture: Implement a multi-detector ensemble with pattern matching, semantic classification, and scope validation. Configure majority voting.
Calibrate Thresholds: Tune voting thresholds and detector sensitivities using a corpus of benign security-themed inputs to minimize false positives.
Integrate Decision Layer: Combine inference detection with policy-based decision controls (e.g., Cedar) to address LLM06 Excessive Agency.
Implement Latency Safeguards: Add timeout handling, async execution, and caching to manage latency overhead. Set fallback to REVIEW on timeout.
Audit Trace Stores: Analyze historical traces to quantify the fraction of prompts authored by non-human sources. Identify past scope violations.
Monitor Detector Health: Track precision, recall, and latency metrics. Set alerts for performance degradation or adversarial evasion patterns.
Test Against Adversarial Inputs: Regularly evaluate the ensemble against updated OWASP LLM01 corpora and novel injection techniques.
Decision Matrix
Scenario
Recommended Approach
Why
Cost Impact
High-Throughput Chat
Regex + Lightweight Classifier
Low latency is critical; regex catches most obvious attacks.
Low compute cost; minimal latency impact.
Financial Agent
Full Ensemble + Scope Validator
High risk of exfiltration and scope violation; precision is paramount.
Higher compute cost; acceptable latency for security.
Security Domain Agent
Ensemble + Context-Aware Whitelist
Avoids false positives on security-themed inputs; maintains accuracy.
Moderate cost; requires tuning effort.
Internal Tooling
Single Classifier + Pattern Matcher
Lower risk profile; balance between security and cost.
Low cost; faster deployment.
Regulated Environment
Full Ensemble + Human Review
Compliance requires high assurance and auditability.