Back to KB
Difficulty
Intermediate
Read Time
8 min

How to build a 22ms agent goal-drift detector

By Codcompass Team··8 min read

Robust Trajectory Verification: Rank-Weighted Embedding Voting for Agent Safety

Current Situation Analysis

Modern AI agents operate as multi-step trajectory executors. Unlike single-turn chatbots, these systems chain tool calls, state mutations, and external API interactions across extended sessions. The industry standard for guarding these trajectories relies on semantic similarity checks: embed each step, query a vector database, and compare the nearest neighbor against a policy allowlist. This approach assumes that malicious or drifted steps will naturally fall outside the safe embedding manifold.

The assumption breaks under adversarial pressure. Attackers do not need to generate completely novel malicious instructions. They only need to craft inputs that sit precisely on the decision boundary between safe and unsafe regions. A step like fs.read("/etc/passwd") can be linguistically reframed as fs.read("system_config.env") with a narrative wrapper about environment validation. When passed through a standard top-1 nearest-neighbor lookup, the query lands close enough to a legitimate configuration-reading step to bypass the threshold. The guardrail returns ALLOW. The session continues. The breach occurs.

This vulnerability stems from a fundamental mismatch between the decision rule and the threat model. Single-nearest-neighbor classification treats the embedding space as a static partition. It applies a hard threshold to a single distance metric. In benign distributions, safe and unsafe manifolds are well-separated, and top-1 works adequately. In adversarial or high-drift environments, the boundary becomes locally exploitable. An attacker only needs to optimize toward one reference point. The decision surface has zero redundancy.

Production telemetry confirms the fragility. When evaluated against hand-crafted adversarial trajectories containing goal-shift, tool-abuse, and capability-laundering patterns, standard top-1 cosine similarity achieves a precision of 0.71 and recall of 0.74. The F1 score settles at 0.72. That 28% gap represents exploitable surface area. The problem is not the vector search engine. The problem is the voting mechanism.

WOW Moment: Key Findings

Replacing the hard top-1 decision with a rank-weighted soft vote fundamentally alters the attack surface. Instead of optimizing toward a single reference point, an adversarial input must now align with the centroid of multiple neighbors simultaneously. The mathematical consequence is immediate: the required perturbation grows proportionally to the number of voters, pushing the embedding further from the safe manifold and triggering secondary drift signals.

Retrieval StrategyPrecisionRecallF1 Score
Top-1 (Standard NN)0.710.740.72
Top-3 Rank-Weighted0.790.810.80
Top-5 Rank-Weighted0.830.850.84
Top-7 Rank-Weighted0.830.840.83

The data reveals a clear inflection point. Moving from k=1 to k=5 yields a 12-point F1 improvement. The gain plateaus at k=7, where distant neighbors introduce semantic noise that dilutes the vote. k=5 emerges as the operational sweet spot: it expands the attack surface enough to neutralize one-shot boundary perturbations while maintaining tight computational bounds.

This finding matters because it transforms trajectory verification from a brittle gate into a probabilistic filter. Instead of binary pass/fail decisions, the system outputs a continuous safety probability. That probability can be smoothed across time, weighted by action criticality, and combined with complementary signals (plan-execution matching, action-class Jaccard similarity, paraphrase stability) to form a resilient governance layer.

Core Solution

The architecture replaces single-point classification with ensemble-style soft voting in embedding space. The implementation

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back