Reading the Prompt You Did Not Send: Detection at the Inference Boundary

By Codcompass Team·2026-05-22·10 min read

Hardening the Inference Plane: Ensemble-Based Detection for Agentic Workloads

Current Situation Analysis

Agentic systems operate in environments where context is inherently untrusted. Every email, calendar invite, web scrape, or database record injected into a prompt represents a potential attack vector. The industry has historically focused on securing the tool-use layer and credential boundaries, treating the inference path as a trusted conduit. This assumption is collapsing under the weight of production incidents.

The inference boundary is the most observable layer in the agent stack. The harness captures the complete prompt and response payload on every model call. Despite this visibility, many implementations lack runtime detection mechanisms, relying instead on static system prompts or post-hoc log analysis. This gap enables LLM Scope Violations, where external content manipulates the model to exfiltrate data or execute actions outside its authorized domain.

The existence proof of this failure mode is CVE-2025-32711 (EchoLeak), disclosed in June 2025. In this incident, a vendor calendar invite contained a markdown payload instructing Microsoft 365 Copilot to encode sensitive user data into a SharePoint URL path within a summary response. The model complied, exfiltrating MFA codes via an auto-unfurling link. Microsoft scored this CVSS 9.3; NVD scored it 7.5. The vulnerability was patched server-side, but the architectural flaw—processing unvalidated context without inference-layer scrutiny—remains prevalent.

The 2025–2026 CVE corpus demonstrates that inference failures rarely occur in isolation. They chain across boundaries:

Semantic Kernel CVE-2026-25592: Prompt injection bypassed validation to trigger remote code execution via DownloadFileAsync.
GitHub Copilot CVE-2025-53773: Injection manipulated chat.tools.autoApprove, leading to terminal RCE.
OpenClaw Claw Chain: Four chained vulnerabilities culminated in agent-runtime takeover, starting with inference manipulation.

These incidents confirm that tool permissions and credential brokers are insufficient without robust detection at the inference plane. The prompt is the attack surface; if you cannot score the prompt before the model executes, you are operating blind.

WOW Moment: Key Findings

Ensemble detection strategies have matured to the point where they can neutralize the majority of known inference attacks with acceptable operational overhead. Single-model classifiers are inadequate due to high false-positive rates on security-themed benign inputs and susceptibility to adversarial evasion. Ensemble architectures combining pattern matching, semantic classification, and scope validation deliver order-of-magnitude improvements in attack success rate (ASR) reduction.

The following data compares defense strategies against cross-source OWASP LLM01 corpora and production telemetry:

Defense Strategy	Attack Success Rate (ASR)	Precision	Latency Overhead	Key Limitation
Baseline (No Guard)	50% – 86%	N/A	0 ms	Vulnerable to all injection classes.
Single Classifier	~40%	~82%	~400 ms	High false positives on security contexts; adversarial evasion.
LLMTrace Ensemble	<12.4%	95.5%	~1.5 s	79.7% recall leaves gap for novel variants.
Anthropic Constitutional	4.4% (from 86%)	High	~1.2 s	Requires proprietary model integration.
Microsoft Spotlighting	<2% (from >50%)	High	~1.0 s	Specialized for indirect injection; less coverage on jailbreaks.

Why this matters: The LLMTrace ensemble demonstrates that a four-detector architecture achieves 95.5% precision while reducing ASR by over 85%. Microsoft Spotlighting and Anthropic's classifiers prove that enterprise-grade reductions are achievable. The latency cost (~1.5s median) is a trade-off that production systems can manage through async pre-checks and caching, whereas the risk of scope violation or exfiltration is often unacceptable.

Core Solution

Implementing inference-plane detection requires an ensemble architecture that scores prompts and responses before tool execution or output delivery. The solution comprises three detection layers and a voting arbiter.

Architecture Rationale

Pattern Matcher: Low-latency regex and keyword analysis catches known injection signatures and structural anomalies. This layer filters obvious attacks instantly.
Semantic Classifier: A distilled model evaluates the intent and context of the prompt. This layer detects subtle manipulations, jailbreaks, and indirect injections that bypass pattern matching.
**Scope Vali

dator:** Domain-specific logic checks for scope violations, such as requests to access sensitive data or generate exfiltration vectors. This layer addresses LLM Scope Violations like CVE-2025-32711. 4. Voting Arbiter: Aggregates results using weighted majority voting. This reduces false positives by requiring consensus across heterogeneous detectors.

Implementation

The following TypeScript implementation defines a modular ensemble guardian. Detectors are pluggable, and the arbiter supports configurable voting strategies.

// Core types for the inference guardian
type Verdict = 'ALLOW' | 'BLOCK' | 'REVIEW';

interface DetectorResult {
  detectorId: string;
  verdict: Verdict;
  confidence: number;
  reason: string;
  metadata?: Record<string, unknown>;
}

interface GuardConfig {
  votingThreshold: number;
  maxLatencyMs: number;
  detectors: IDetector[];
}

interface IDetector {
  id: string;
  check(prompt: string, context: AgentContext): Promise<DetectorResult>;
}

// Pattern Matcher: Fast, rule-based detection
class RegexPatternDetector implements IDetector {
  id = 'regex-pattern';
  private patterns: RegExp[];

  constructor(patterns: RegExp[]) {
    this.patterns = patterns;
  }

  async check(prompt: string, _context: AgentContext): Promise<DetectorResult> {
    for (const pattern of this.patterns) {
      if (pattern.test(prompt)) {
        return {
          detectorId: this.id,
          verdict: 'BLOCK',
          confidence: 0.95,
          reason: `Pattern match: ${pattern.source}`,
        };
      }
    }
    return {
      detectorId: this.id,
      verdict: 'ALLOW',
      confidence: 0.8,
      reason: 'No pattern violations detected.',
    };
  }
}

// Semantic Classifier: ML-based intent analysis
class SemanticIntentClassifier implements IDetector {
  id = 'semantic-classifier';
  private modelEndpoint: string;

  constructor(endpoint: string) {
    this.modelEndpoint = endpoint;
  }

  async check(prompt: string, context: AgentContext): Promise<DetectorResult> {
    // In production, this calls a distilled guard model or API
    const analysis = await this.analyzeWithModel(prompt, context);
    
    if (analysis.riskScore > 0.7) {
      return {
        detectorId: this.id,
        verdict: 'BLOCK',
        confidence: analysis.riskScore,
        reason: analysis.explanation,
        metadata: { riskCategory: analysis.category },
      };
    }
    return {
      detectorId: this.id,
      verdict: 'ALLOW',
      confidence: 1 - analysis.riskScore,
      reason: 'Semantic analysis passed.',
    };
  }

  private async analyzeWithModel(prompt: string, context: AgentContext) {
    // Placeholder for model inference
    return { riskScore: 0.2, explanation: 'Benign context.', category: 'none' };
  }
}

// Scope Validator: Checks for data exfiltration and scope violations
class ScopeViolationDetector implements IDetector {
  id = 'scope-validator';
  private sensitiveFields: Set<string>;

  constructor(sensitiveFields: string[]) {
    this.sensitiveFields = new Set(sensitiveFields);
  }

  async check(prompt: string, context: AgentContext): Promise<DetectorResult> {
    const violations = this.detectScopeBreaches(prompt, context);
    if (violations.length > 0) {
      return {
        detectorId: this.id,
        verdict: 'BLOCK',
        confidence: 0.9,
        reason: `Scope violation: ${violations.join(', ')}`,
        metadata: { violations },
      };
    }
    return {
      detectorId: this.id,
      verdict: 'ALLOW',
      confidence: 0.85,
      reason: 'Scope constraints satisfied.',
    };
  }

  private detectScopeBreaches(prompt: string, context: AgentContext): string[] {
    const breaches: string[] = [];
    // Check for requests to encode sensitive data in URLs or outputs
    if (prompt.match(/encode.*sensitive|exfiltrate|path.*secret/i)) {
      breaches.push('Potential data exfiltration pattern');
    }
    // Check if prompt requests access to fields outside agent scope
    for (const field of this.sensitiveFields) {
      if (prompt.includes(field) && !context.authorizedFields.includes(field)) {
        breaches.push(`Unauthorized access request for ${field}`);
      }
    }
    return breaches;
  }
}

// Voting Arbiter: Aggregates detector results
class MajorityVoteArbiter {
  private threshold: number;

  constructor(threshold: number) {
    this.threshold = threshold;
  }

  decide(results: DetectorResult[]): Verdict {
    const blockCount = results.filter(r => r.verdict === 'BLOCK').length;
    const reviewCount = results.filter(r => r.verdict === 'REVIEW').length;
    const total = results.length;

    const blockRatio = blockCount / total;
    const reviewRatio = reviewCount / total;

    if (blockRatio >= this.threshold) return 'BLOCK';
    if (reviewRatio >= this.threshold) return 'REVIEW';
    return 'ALLOW';
  }
}

// Guardian Orchestrator
class InferenceGuardian {
  private config: GuardConfig;
  private arbiter: MajorityVoteArbiter;

  constructor(config: GuardConfig) {
    this.config = config;
    this.arbiter = new MajorityVoteArbiter(config.votingThreshold);
  }

  async evaluate(prompt: string, context: AgentContext): Promise<Verdict> {
    const startTime = Date.now();
    
    // Run detectors in parallel to minimize latency
    const results = await Promise.allSettled(
      this.config.detectors.map(detector => 
        this.runWithTimeout(detector, prompt, context, this.config.maxLatencyMs)
      )
    );

    const validResults = results
      .filter((r): r is PromiseFulfilledResult<DetectorResult> => r.status === 'fulfilled')
      .map(r => r.value);

    // If latency exceeded or results insufficient, default to safe state
    if (Date.now() - startTime > this.config.maxLatencyMs || validResults.length < this.config.detectors.length * 0.5) {
      return 'REVIEW';
    }

    return this.arbiter.decide(validResults);
  }

  private async runWithTimeout(
    detector: IDetector, 
    prompt: string, 
    context: AgentContext, 
    timeoutMs: number
  ): Promise<DetectorResult> {
    return Promise.race([
      detector.check(prompt, context),
      new Promise<DetectorResult>((resolve) => 
        setTimeout(() => resolve({
          detectorId: detector.id,
          verdict: 'REVIEW',
          confidence: 0,
          reason: 'Detector timeout',
        }), timeoutMs)
      )
    ]);
  }
}

Usage Example

const guardian = new InferenceGuardian({
  votingThreshold: 0.6,
  maxLatencyMs: 2000,
  detectors: [
    new RegexPatternDetector([
      /ignore previous instructions/i,
      /system prompt override/i,
      /encode.*url.*path/i,
    ]),
    new SemanticIntentClassifier('https://guard-model.internal/v1/analyze'),
    new ScopeViolationDetector(['mfa_codes', 'api_keys', 'ssn']),
  ],
});

const verdict = await guardian.evaluate(userPrompt, agentContext);
if (verdict === 'BLOCK') {
  throw new Error('Inference guard blocked request');
}

Pitfall Guide

Deploying inference detection in production introduces specific failure modes. The following pitfalls are derived from CVE analysis and ensemble telemetry.

The LLM06 Mirage
- Explanation: Inference detectors target LLM01 (Prompt Injection) and scope violations. They do not detect Excessive Agency (LLM06). An agent may delete a production database because it interprets a benign request as "helpful," without any injection. The inference detector passes the prompt, but the action is unauthorized.
- Fix: Compose inference detection with a decision layer (e.g., Cedar policies) that governs tool permissions and action scope. Inference guards the prompt; decision guards the action.
Adversarial Evasion of Detectors
- Explanation: Detectors themselves are attackable. Research such as STACK and adversarial-judge studies demonstrates that prompts can be crafted to evade specific classifiers. Relying on a single model or static patterns allows attackers to probe and bypass defenses.
- Fix: Use heterogeneous ensembles with different model architectures. Rotate detector models periodically. Monitor detector logs for evasion patterns and retrain classifiers on adversarial examples.
Security Context False Positives
- Explanation: Classifiers often over-defend on security-themed benign inputs. PromptGuard, for example, flags 99.1% of security-themed benign inputs as malicious. Agents operating in security domains (e.g., vulnerability scanning) may trigger constant blocks.
- Fix: Implement context-aware whitelisting. Tune thresholds based on domain-specific corpora. Use the ensemble to require consensus; a single classifier flagging a security term should not block if other detectors pass.
Latency Budget Blowout
- Explanation: Ensemble detection adds latency. LLMTrace reports ~1.5s median overhead. In high-throughput chat or real-time agents, this can degrade user experience or cause timeouts.
- Fix: Execute detectors asynchronously where possible. Cache results for identical prompts. Use distilled, smaller models for detection. Implement timeout fallbacks to REVIEW rather than blocking.
Recall Gaps and False Negatives
- Explanation: No detector achieves 100% recall. LLMTrace reports 79.7% recall, meaning 16 false negatives per 79 malicious samples. Novel injection techniques or obfuscation may bypass detection.
- Fix: Implement human-in-the-loop review for low-confidence scores. Continuously analyze trace stores for missed attacks. Update patterns and retrain classifiers based on new CVEs and adversarial research.
Scope Blindness
- Explanation: Detectors may miss scope violations if they lack context about sensitive data. CVE-2025-32711 succeeded because the model was instructed to encode sensitive data into a URL path, which appeared benign to basic detectors.
- Fix: Integrate data classification tags into the context. The scope validator must know which fields are sensitive and check for exfiltration patterns, not just access requests.
Detector Drift
- Explanation: As models and prompts evolve, detector performance may degrade. Static configurations become outdated.
- Fix: Establish a continuous evaluation pipeline. Run detector benchmarks against updated corpora. Automate alerts when precision or recall drops below thresholds.

Production Bundle

Action Checklist

Map Context Ingestion Points: Identify all sources of untrusted context (emails, calendars, web scrapes, databases) and ensure each is routed through the inference guardian.
Deploy Ensemble Architecture: Implement a multi-detector ensemble with pattern matching, semantic classification, and scope validation. Configure majority voting.
Calibrate Thresholds: Tune voting thresholds and detector sensitivities using a corpus of benign security-themed inputs to minimize false positives.
Integrate Decision Layer: Combine inference detection with policy-based decision controls (e.g., Cedar) to address LLM06 Excessive Agency.
Implement Latency Safeguards: Add timeout handling, async execution, and caching to manage latency overhead. Set fallback to REVIEW on timeout.
Audit Trace Stores: Analyze historical traces to quantify the fraction of prompts authored by non-human sources. Identify past scope violations.
Monitor Detector Health: Track precision, recall, and latency metrics. Set alerts for performance degradation or adversarial evasion patterns.
Test Against Adversarial Inputs: Regularly evaluate the ensemble against updated OWASP LLM01 corpora and novel injection techniques.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-Throughput Chat	Regex + Lightweight Classifier	Low latency is critical; regex catches most obvious attacks.	Low compute cost; minimal latency impact.
Financial Agent	Full Ensemble + Scope Validator	High risk of exfiltration and scope violation; precision is paramount.	Higher compute cost; acceptable latency for security.
Security Domain Agent	Ensemble + Context-Aware Whitelist	Avoids false positives on security-themed inputs; maintains accuracy.	Moderate cost; requires tuning effort.
Internal Tooling	Single Classifier + Pattern Matcher	Lower risk profile; balance between security and cost.	Low cost; faster deployment.
Regulated Environment	Full Ensemble + Human Review	Compliance requires high assurance and auditability.	Highest cost; includes review workflow overhead.

Configuration Template

inference_guard:
  ensemble:
    - type: regex
      weight: 1
      patterns:
        - "ignore previous instructions"
        - "system prompt override"
        - "encode.*url.*path"
    - type: semantic
      model: guard-model-v2
      endpoint: https://guard-model.internal/v1/analyze
      weight: 2
      risk_threshold: 0.7
    - type: scope
      policy: data-classification
      sensitive_fields:
        - mfa_codes
        - api_keys
        - ssn
      weight: 2
  voting:
    strategy: majority
    threshold: 0.6
  latency:
    max_ms: 2000
    fallback: review
  monitoring:
    metrics:
      - precision
      - recall
      - latency_p95
    alerts:
      - metric: precision
        threshold: 0.9
        duration: 5m

Quick Start Guide

Define Detectors: Implement or configure regex, semantic, and scope detectors based on your domain requirements.
Wrap Agent Calls: Integrate the InferenceGuardian into your agent harness. Evaluate prompts before model execution.
Configure Voting: Set voting thresholds and latency limits in the configuration file. Adjust based on initial telemetry.
Run Validation: Test the ensemble against a corpus of benign and malicious prompts. Verify precision, recall, and latency metrics.
Deploy and Monitor: Roll out to production. Monitor trace stores and detector metrics. Iterate on configurations based on feedback.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back