Stopping the LLM from calling the same tool twice (and other things it shouldn't)

Designing a Pre-Execution Policy Gate for Agentic Tool Safety

Current Situation Analysis

Agentic systems are increasingly deployed with the ability to execute side effects: creating documents, charging payments, and modifying database records. However, a critical failure mode is emerging that standard observability tools miss: unintended execution loops.

The industry often treats tool calls as pure functions, focusing on whether the model generates a syntactically valid JSON payload. This perspective ignores the operational reality that tool calls are state-changing operations. When an agent enters a loop, it can generate valid, well-formed calls that result in catastrophic side effects, such as duplicate charges or resource exhaustion.

Consider a production incident where an agent was tasked with summarizing a product catalog and sharing the result. The agent's tool definition was incomplete; it lacked tools for fetching data, writing content, or sharing files. It only possessed a create_document tool. Faced with a multi-step goal and a single available action, the agent entered a loop, invoking create_document seven times with identical arguments. The result was seven empty documents in the user's drive. No data was processed, no sharing occurred, and the user's storage was polluted.

This incident highlights a fundamental gap: tool calls require a policy layer that evaluates intent and state before execution, not an audit log that records damage after the fact. Relying on downstream systems to reject duplicates is insufficient; those systems often lack the conversational context to distinguish between a legitimate retry and an agent loop. The cost of a post-hoc apology is significantly higher than the latency introduced by a pre-execution gate.

WOW Moment: Key Findings

The shift from reactive auditing to proactive gating fundamentally changes the risk profile of agentic systems. The table below compares architectural approaches based on their ability to mitigate side-effect risks.

Strategy	Detection Latency	False Positive Rate	Implementation Effort	Blast Radius Mitigation
Post-Hoc Audit	High (After damage)	Low	Low	None
Byte-Hash Dedup	Low	Medium	Low	Partial
Semantic + Idempotency Gate	Low	Low	High	High
Side-Effect Graph	Low	Very Low	Very High	Complete

Why this matters: A byte-hash deduplication layer catches obvious loops but fails when the model varies argument formatting or when downstream idempotency keys are mishandled. A comprehensive policy gate that includes semantic canonicalization, idempotency injection, and authorization tiers reduces the blast radius to near zero. The "Side-Effect Graph" approach, while expensive, is the only method that prevents "intent-equal" loops where the agent uses different tools to achieve the same forbidden mutation.

Core Solution

Building a pre-execution policy gate requires intercepting the tool call between the model's output and the actual execution environment. The gate must canonicalize inputs, check for duplicates, enforce authorization, and inject safety mechanisms like idempotency keys.

Architecture Decisions

Canonicalization over Raw Comparison: Models may generate semantically identical arguments with different formatting (e.g., {"amount": 100.00} vs {"amount": 100}). The gate must normalize arguments based on tool-specific schemas before hashing.
Content-Based Idempotency: The model should never generate idempotency keys. If the model retries a call, it might generate a new key, causing the downstream system to treat it as a new operation. The policy gate must derive the idempotency key from the canonicalized argument hash.
Structured Refusals: When a call is denied, the refusal must be a structured object that the model can parse. This includes the reason for denial, the result of the previous call (if applicable), and suggested next steps to prevent the model from looping on the refusal.
Tiered Authorization: Not all tools require the same level of scrutiny. Reads should be allowlisted, medium-risk writes should use per-conversation grants, and high-risk operations should trigger Human-in-the-Loop (HITL) workflows.

Implementation

The following TypeScript implementation demonstrates a PolicyEngine that enforces these rules. This example uses a hypothetical ToolCall interface and focuses on the policy logic.

import { createHash } from 'crypto';

interface ToolCall {
  name: string;
  arguments: Record<string, unknown>;
  callId: string;
}

interface PolicyDecision {
  action: 'ALLOW' | 'DENY';
  reason?: string;
  idempotencyKey?: string;
  previousResult?: unknown;
  suggestedNext?: string[];
}

interface ToolSchema {
  name: string;
  canonicalFields: string[];
  normalizers: Record<string, (val: unknown) => unknown>;
  riskLevel: 'LOW' | 'MEDIUM' | 'HIGH';
  sideEffectResource?: string;
}

class AgenticPolicyEngine {
  private dedupCache: Map<string, { timestamp: number; result: unknown }>;
  private conversationGrants: Map<string, { ceiling: number; count: number }>;
  private toolSchemas: Map<string, ToolSchema>;

  constructor() {
    this.dedupCache = new Map();
    this.conversationGrants = new Map();
    this.toolSchemas = new Map();
  }

  registerToolSchema(schema: ToolSchema): void {
    this.toolSchemas.set(schema.name, schema);
  }

  evaluate(call: ToolCall, conversationId: string): PolicyDecision {
    const schema = this.toolSchemas.get(call.name);
    if (!schema) {
      return { action: 'DENY', reason: 'unknown_tool' };
    }

    // 1. Canonicalize arguments
    const canonicalArgs = this.canonicalize(call.arguments, schema);
    const contentHash = this.computeHash(call.name, canonicalArgs);

    // 2. Check for duplicates
    const existing = this.dedupCache.get(contentHash);
    if (existing) {
      return this.buildRefusal(call, 'duplicate_detected', existing.result, schema);
    }

    // 3. Authorization Check
    const authResult = this.checkAuthorization(call, conversationId, schema);
    if (authResult.denied) {
      return { action: 'DENY', reason: authResult.reason };
    }

    // 4. Idempotency Injection
    const idempotencyKey = contentHash;

    // 5. Allow and Record
    this.dedupCache.set(contentHash, { timestamp: Date.now(), result: null });
    
    // Update grant counter for medium-risk tools
    if (schema.riskLevel === 'MEDIUM') {
      this.incrementGrant(conversationId);
    }

    return {
      action: 'ALLOW',
      idempotencyKey,
    };
  }

  private canonicalize(args: Record<string, unknown>, schema: ToolSchema): Record<string, unknown> {
    const normalized: Record<string, unknown> = {};
    
    for (const field of schema.canonicalFields) {
      const value = args[field];
      const normalizer = schema.normalizers[field];
      normalized[field] = normalizer ? normalizer(value) : value;
    }
    
    return normalized;
  }

  private computeHash(toolName: string, args: Record<string, unknown>): string {
    const payload = JSON.stringify({ tool: toolName, args }, Object.keys(args).sort());
    return createHash('sha256').update(payload).digest('hex');
  }

  private checkAuthorization(
    call: ToolCall, 
    convId: string, 
    schema: ToolSchema
  ): { denied: boolean; reason?: string } {
    switch (schema.riskLevel) {
      case 'LOW':
        return { denied: false };
      case 'MEDIUM':
        const grant = this.conversationGrants.get(convId);
        if (!grant || grant.count >= grant.ceiling) {
          return { denied: true, reason: 'grant_exceeded' };
        }
        return { denied: false };
      case 'HIGH':
        // In production, this would trigger a HITL workflow
        return { denied: true, reason: 'requires_human_approval' };
      default:
        return { denied: true, reason: 'unknown_risk_level' };
    }
  }

  private buildRefusal(
    call: ToolCall, 
    reason: string, 
    previousResult: unknown, 
    schema: ToolSchema
  ): PolicyDecision {
    return {
      action: 'DENY',
      reason,
      previousResult,
      suggestedNext: this.generateSuggestions(call.name, schema),
    };
  }

  private generateSuggestions(toolName: string, schema: ToolSchema): string[] {
    // Logic to suggest alternative actions based on tool metadata
    if (schema.sideEffectResource) {
      return [`read_${schema.sideEffectResource}`, `list_${schema.sideEffectResource}s`];
    }
    return [];
  }

  private incrementGrant(convId: string): void {
    const grant = this.conversationGrants.get(convId);
    if (grant) {
      grant.count++;
    }
  }
}

Rationale

Canonicalization: By defining canonicalFields and normalizers per tool, the system handles semantic duplicates. For example, a phone_number field can be normalized to E.164 format before hashing, ensuring +1-555-0199 and 5550199 are treated as identical.
Idempotency Key Derivation: The idempotencyKey is derived from the content hash. This ensures that even if the model retries the call, the downstream system receives the same key and returns the cached result, preventing double-charges or duplicate creations.
Structured Refusal: The buildRefusal method returns previousResult and suggestedNext. This allows the model to understand that the action was already completed and suggests it proceed to the next step (e.g., reading the created resource) rather than retrying.

Pitfall Guide

1. The "Approve-All" Fatigue

Explanation: Requiring human approval for every tool call leads to approval fatigue. Users begin clicking "approve" without reading, rendering the safety mechanism useless. Fix: Implement tiered authorization. Allowlist low-risk reads. Use per-conversation grants for medium-risk writes. Reserve HITL for high-risk or out-of-policy actions.

2. Canonicalization Blind Spots

Explanation: If the canonicalization logic is incomplete, the model can bypass deduplication by slightly altering arguments (e.g., adding a whitespace or changing a boolean representation). Fix: Define strict schemas for all write tools. Use comprehensive normalizers for all identifying fields. Regularly audit canonicalization rules against model output patterns.

3. Idempotency Key Leakage

Explanation: If the model generates idempotency keys, it may generate new keys on retry, causing the downstream system to treat retries as new operations. Fix: Never expose idempotency key generation to the model. The policy gate must inject keys based on content hashes. Ensure downstream APIs accept these injected keys.

4. The Paraphrase Bypass

Explanation: The model may use a different tool to achieve the same side effect (e.g., calling create_invoice vs. charge_card for the same order). Byte-hash deduplication will not catch this. Fix: Implement a side-effect graph. Map tools to the resources they mutate. The policy gate should refuse multiple mutations of the same resource within a conversation without explicit confirmation.

5. Opaque Refusals

Explanation: Returning a generic error message (e.g., "Tool call failed") leaves the model without context, causing it to loop or hallucinate. Fix: Always return structured refusals with reason, previousResult, and suggestedNext. This enables the model to replan effectively.

6. Loop Blindness on Successful Calls

Explanation: A loop of successful calls (like the seven empty documents) may not trigger a refusal counter. The system only sees valid executions. Fix: Track tool execution frequency per conversation. If a tool is called N times with identical arguments, trigger a circuit breaker even if the calls are successful.

7. Grant Ceiling Misconfiguration

Explanation: Setting per-conversation grant ceilings too low can interrupt legitimate workflows. Setting them too high increases blast radius. Fix: Start with conservative ceilings and tune based on telemetry. Monitor grant exhaustion rates and adjust dynamically based on tool risk profiles.

Production Bundle

Action Checklist

Define Tool Schemas: Create canonicalization rules and risk levels for all tools exposed to agents.
Implement Policy Gate: Deploy the AgenticPolicyEngine as a middleware between the model and tool execution.
Configure Idempotency: Ensure all write tools accept content-based idempotency keys injected by the gate.
Set Authorization Tiers: Configure allowlists for reads, per-conversation grants for writes, and HITL for destructive actions.
Add Structured Refusals: Update refusal payloads to include previousResult and suggestedNext fields.
Deploy Circuit Breakers: Implement frequency tracking to detect loops of successful calls.
Map Side Effects: For high-risk tools, define resource mutation mappings to prevent paraphrase bypasses.
Monitor Telemetry: Track dedup hits, refusal rates, and grant exhaustion to tune policies.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Read-Only Queries	Allowlist	Low risk, high frequency. No side effects.	Negligible
Create Resource	Per-Conv Grant	Medium risk. User context implies intent. Ceiling limits blast radius.	Low UX cost
Update Resource	Per-Conv Grant	Medium risk. Idempotency prevents duplicates.	Low UX cost
Delete Resource	HITL	High risk. Destructive action requires explicit confirmation.	High UX cost
Financial Transfer	Side-Effect Graph + HITL	Critical risk. Complex dependencies. Must prevent paraphrase bypass.	High Dev cost
External API Call	Semantic Dedup + Idempotency	Risk of duplicate charges. Canonicalization handles formatting variance.	Medium Dev cost

Configuration Template

policy_engine:
  deduplication:
    window_ms: 300000
    strategy: semantic_hash
    loop_threshold: 3
  
  idempotency:
    generation: content_based
    header: X-Agent-Idempotency-Key
    ttl_ms: 86400000
  
  authorization:
    tiers:
      - name: reads
        pattern: "get_*|list_*"
        action: allowlist
        rate_limit: 100/min
      
      - name: writes
        pattern: "create_*|update_*"
        action: per_conversation_grant
        ceiling: 10
        ttl_ms: 3600000
      
      - name: destructive
        pattern: "delete_*|transfer_*"
        action: hitl
        require_reason: true
  
  tool_schemas:
    create_invoice:
      risk_level: MEDIUM
      canonical_fields: ["amount_cents", "currency", "customer_id"]
      normalizers:
        amount_cents: "to_int"
        currency: "to_upper"
      side_effect_resource: "invoice"
    
    send_email:
      risk_level: LOW
      canonical_fields: ["recipient", "subject"]
      normalizers:
        recipient: "to_lower"

Quick Start Guide

Wrap Tool Execution: Integrate the PolicyEngine.evaluate() method into your tool execution pipeline. Ensure all calls pass through the gate before reaching the actual tool implementation.
Register Schemas: Define schemas for your tools, specifying canonical fields, normalizers, and risk levels. Start with high-risk tools and expand coverage.
Configure Auth Tiers: Set up authorization rules based on your risk tolerance. Enable allowlists for reads and per-conversation grants for writes.
Test with Duplicate Injection: Simulate agent loops by sending duplicate tool calls. Verify that the gate returns structured refusals and prevents execution.
Monitor and Tune: Deploy telemetry to track policy decisions. Adjust canonicalization rules, grant ceilings, and loop thresholds based on observed traffic patterns.

Mid-Year Sale — Unlock Full Article