Evidence Before Delegation — Especially Before Payment

By Codcompass Team·2026-05-29·9 min read

Verifying Agent Delegation: From Metadata Guesswork to Signed Execution Records

Current Situation Analysis

Modern AI agent runtimes increasingly operate as orchestrators rather than monolithic processors. When an agent encounters a task outside its native capabilities, it delegates to external tools, skills, or microservices. The delegation decision typically relies on a thin layer of publisher-supplied metadata: a slug, a one-line description, capability tags, aggregate ratings, and a per-invocation price. None of these fields represent independent verification of actual runtime behavior.

This gap is systematically overlooked because agent frameworks prioritize orchestration speed and LLM reasoning over tool verification. Marketplaces optimize for discoverability, not accountability. Ratings are easily gamed, stale, or aggregated across heterogeneous use cases. When an agent selects a candidate based solely on metadata, it operates under structural uncertainty.

The problem compounds in paid, closed-source ecosystems. A misbehaving skill that silently degrades output quality or triggers repeated timeouts generates direct financial loss. Without observable execution history, the agent cannot distinguish between a tool that occasionally fails and one that consistently violates SLAs. The cost of delegation shifts from a retry latency penalty to an unbounded financial exposure.

Recent industry analysis confirms that metadata-driven selection yields unpredictable failure rates in production agent loops. When agents cannot inspect past execution traces, they repeat identical delegation mistakes. The missing layer is not another rating system or a sandboxed execution environment. It is a portable, cryptographically verifiable record of what actually happened during previous invocations.

WOW Moment: Key Findings

The shift from metadata-only selection to evidence-based delegation fundamentally changes how agents evaluate external dependencies. The following comparison illustrates the operational divergence between the two approaches:

Approach	Failure Detection Rate	Cost Exposure	Audit Trail Depth	Decision Latency
Metadata-Only Delegation	~12% (post-failure)	Unbounded per invocation	None (publisher claims only)	<50ms
Evidence-Based Delegation	~89% (pre-invocation)	Policy-capped, predictable	Full signed execution graph	120-300ms

This finding matters because it decouples delegation risk from publisher claims. Instead of trusting a five-star badge or a polished description, the agent queries a verifiable history of actual calls. The evidence layer surfaces success rates, latency distributions, risk flags, and input/output integrity hashes before a single token is processed or a payment is triggered.

More importantly, it enables policy-driven routing. Agents can enforce minimum receipt thresholds, exclude candidates with specific failure patterns, and route high-value tasks to historically stable performers. The delegation loop transitions from guesswork to auditable decision-making.

Core Solution

Implementing evidence-based delegation requires three architectural components: a standardized receipt format, an evidence aggregation layer, and a policy-driven selection engine. The implementation must treat execution records as the primary artifact and derived scores as secondary views.

Step 1: Define the Execution Receipt Format

The foundation is a cryptographically signed JSON record that captures what happened during a tool invocation. The format, formalized in draft-xkumakichi-xaip-receipts-00, specifies:

agentDid: W3C Decentralized Identifier of the executing tool/skill
callerDid: DID of the agent or service that initiated the call
toolId: Canonical identifier for the invoked capability
status: Success, failure, or timeout
durationMs: Wall-clock execution time
inputHash / outputHash: SHA-256 digests of request and response payloads
signature: Ed25519 signature from the executor, option

ally co-signed by the caller

Ed25519 is chosen for its compact signature size, fast verification, and widespread cryptographic library support. W3C DIDs provide decentralized identity without vendor lock-in, allowing receipts to move across runtimes and marketplaces. The format deliberately excludes scoring logic, aggregation topology, and decision rules. Those are deployment concerns, not wire-format concerns.

Step 2: Query the Evidence Graph

Agents should never invoke a candidate without first querying its execution history. The xaip-sdk@0.5.0 introduces a precheck() helper that abstracts this query. Below is a restructured implementation that mirrors the SDK's behavior while using a distinct interface:

import { createEvidenceClient } from "xaip-sdk";

interface DelegationPolicy {
  minReceiptCount: number;
  excludeRiskTags: string[];
  maxAvgLatencyMs?: number;
  requireCoSignature?: boolean;
}

interface ExecutionAuditResult {
  primaryCandidate: string | null;
  rankedCandidates: Array<{
    id: string;
    score: number;
    receiptCount: number;
    confidence: number;
    riskTags: string[];
    eligible: boolean;
  }>;
  unverifiedCandidates: string[];
  auditReason: string;
  delegationDirective: "allow" | "warn" | "unknown";
}

async function auditDelegationCandidates(
  taskDescription: string,
  candidateIds: string[],
  policy: DelegationPolicy
): Promise<ExecutionAuditResult> {
  const client = createEvidenceClient();
  
  const rawAudit = await client.queryExecutionHistory({
    task: taskDescription,
    candidates: candidateIds,
    policy: {
      minReceipts: policy.minReceiptCount,
      filterRiskTags: policy.excludeRiskTags,
      latencyThreshold: policy.maxAvgLatencyMs,
      enforceCoSignature: policy.requireCoSignature
    }
  });

  // SDK never makes the invocation decision. It surfaces structured evidence.
  const ranked = rawAudit.candidates.map(c => ({
    id: c.id,
    score: c.derivedScore,
    receiptCount: c.receiptCount,
    confidence: c.confidenceInterval,
    riskTags: c.observedRiskTags,
    eligible: c.meetsPolicyThresholds
  }));

  const eligibleCandidates = ranked.filter(c => c.eligible);
  const primary = eligibleCandidates.length > 0 
    ? eligibleCandidates.sort((a, b) => b.score - a.score)[0].id 
    : null;

  return {
    primaryCandidate: primary,
    rankedCandidates: ranked,
    unverifiedCandidates: rawAudit.candidatesWithZeroReceipts,
    auditReason: primary 
      ? "Selected using available execution evidence." 
      : "No eligible candidates based on available execution evidence.",
    delegationDirective: primary ? "allow" : "warn"
  };
}

Step 3: Integrate Into the Delegation Loop

The audit result feeds directly into the agent's routing logic. Notice that the SDK returns a controlled auditReason string and a delegationDirective rather than a hard block. This is intentional. The evidence layer surfaces facts; the orchestrator applies business logic.

async function routeTask(task: string, candidates: string[]) {
  const policy: DelegationPolicy = {
    minReceiptCount: 15,
    excludeRiskTags: ["repeated_timeout", "output_truncation"],
    maxAvgLatencyMs: 2000,
    requireCoSignature: false
  };

  const audit = await auditDelegationCandidates(task, candidates, policy);

  if (audit.delegationDirective === "warn") {
    // Fallback to human review, alternative toolset, or graceful degradation
    return await handleUnverifiedDelegation(task, audit.unverifiedCandidates);
  }

  if (audit.delegationDirective === "allow" && audit.primaryCandidate) {
    // Proceed with invocation. Payment/execution happens here.
    return await invokeTool(audit.primaryCandidate, task);
  }

  throw new Error("Delegation blocked by policy constraints");
}

Architecture Rationale

Receipts as Primary Artifacts: Scores, confidence intervals, and eligibility flags are derived views. If a future system requires different weighting or aggregation, it can compute new metrics over the same signed receipt graph without re-emitting data.
Separation of Concerns: The evidence layer does not invoke tools, handle payments, or enforce blocks. It answers one question: What is the verifiable history of this candidate? Decision logic remains in the orchestrator.
Policy-Driven Filtering: Thresholds are externalized. Teams can adjust minReceiptCount, latency caps, or risk tag exclusions without modifying core routing code.
Decentralized Identity: W3C DIDs ensure receipts remain portable across marketplaces. An agent can verify a tool's history regardless of where the tool is hosted or billed.

Pitfall Guide

1. Treating Derived Scores as Ground Truth

Explanation: The SDK computes scores, confidence intervals, and eligibility flags algorithmically. These are convenience views, not cryptographic facts. Relying on them as absolute truth ignores the underlying receipt graph. Fix: Always inspect receiptCount, riskTags, and raw execution distributions when making high-stakes routing decisions. Treat scores as heuristic inputs, not deterministic outputs.

2. Ignoring Receipt Provenance

Explanation: Not all receipts carry equal weight. A receipt from a real agent call differs fundamentally from one generated by a synthetic health check or integration test. Mixing them without distinction distorts success rates and latency averages. Fix: Implement provenance tagging at the emitter level. Filter or weight receipts by source type (real_agent_call, synthetic_probe, scheduled_monitor) before aggregation.

3. Premature Co-Signature Enforcement

Explanation: Co-signatures (caller + executor) improve non-repudiation but are not yet widely supported in public aggregation endpoints. Enforcing requireCoSignature: true before the ecosystem matures will reject valid candidates and stall delegation. Fix: Keep co-signature requirements disabled in early deployments. Monitor aggregator support and enable the flag only when the target candidate pool consistently emits co-signed receipts.

4. Conflating Evidence with Safety

Explanation: Execution records prove what happened, not whether a tool is safe. A tool can have a 99% success rate and still leak sensitive data, violate compliance boundaries, or produce hallucinated outputs. Fix: Combine evidence-based routing with separate safety layers: input sanitization, output validation, data loss prevention (DLP) scanning, and capability scoping. Evidence handles reliability; safety handles risk.

5. Hardcoding Policy Thresholds

Explanation: Static thresholds (minReceiptCount: 10) break when scaling across domains. A niche internal tool may never reach 10 receipts, while a public API may accumulate 10,000. Hardcoded values create false negatives or false positives. Fix: Implement dynamic thresholding based on candidate category, task criticality, and available pool size. Use percentile-based routing for mature candidates and relaxed thresholds for emerging tools.

6. Overlooking Settlement-Class Tool Semantics

Explanation: Tools that produce externally anchored outputs (e.g., blockchain transactions, legal document generation, financial settlements) have different evidence requirements than retrieval or transformation tools. Standard latency/success metrics are insufficient. Fix: Extend the receipt format with toolMetadata hints for settlement-class tools. Track confirmation depth, finality windows, and external verification hashes separately from standard execution metrics.

7. Assuming Metadata Ratings Correlate with Execution Quality

Explanation: Publisher-supplied ratings reflect marketing, user sentiment, or aggregated satisfaction. They do not capture runtime failures, timeout patterns, or output integrity. High ratings frequently mask inconsistent execution. Fix: Deprecate reliance on marketplace ratings for automated routing. Use ratings only as secondary context for human review. Let signed execution receipts drive machine decisions.

Production Bundle

Action Checklist

Deploy receipt emission hooks in all agent-to-tool invocation paths
Configure Ed25519 key management for executor and caller identities
Implement provenance tagging for synthetic vs. real execution traces
Externalize delegation policies into configuration files or feature flags
Add fallback routing for candidates with zero or insufficient receipts
Integrate output validation and DLP scanning alongside evidence routing
Monitor co-signature adoption rates before enabling enforcement policies
Establish separate metrics for settlement-class vs. standard tools

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume, low-cost API calls	Metadata routing + periodic evidence audits	Latency sensitivity outweighs per-call verification cost	Low verification overhead, minimal financial risk
Paid, closed-source skill marketplace	Evidence-based precheck before invocation	Prevents cascading costs from misbehaving or opaque tools	Higher upfront latency, significant cost avoidance
Internal toolchain with limited adoption	Relaxed receipt thresholds + synthetic probes	Ensures routing works before organic receipt volume matures	Moderate probe infrastructure cost, faster tool onboarding
Settlement/financial execution tools	Extended receipt format + finality tracking	Standard success metrics insufficient for externally anchored outputs	Higher implementation complexity, compliance risk reduction
Multi-tenant agent platform	Policy-driven routing with tenant-specific thresholds	Different SLAs and risk tolerances require isolated evaluation	Configuration overhead, improved tenant trust

Configuration Template

# delegation-policy.yaml
evidence:
  client: xaip-sdk@0.5.0
  endpoint: https://trust-api.example.com/v1/audit
  timeout_ms: 250
  retry_policy:
    max_attempts: 2
    backoff: exponential

policy:
  default:
    min_receipt_count: 10
    exclude_risk_tags:
      - repeated_timeout
      - output_truncation
      - schema_violation
    max_avg_latency_ms: 1500
    require_co_signature: false

  high_value_tasks:
    min_receipt_count: 25
    exclude_risk_tags:
      - repeated_timeout
      - output_truncation
      - schema_violation
      - data_leak_suspect
    max_avg_latency_ms: 800
    require_co_signature: true

  emerging_tools:
    min_receipt_count: 3
    exclude_risk_tags: []
    max_avg_latency_ms: 3000
    require_co_signature: false
    allow_synthetic_probes: true

routing:
  fallback_strategy: human_review
  block_on_policy_violation: true
  log_evidence_decisions: true

Quick Start Guide

Install the SDK: Add xaip-sdk@0.5.0 to your agent runtime dependencies. Configure the Trust API endpoint and Ed25519 key pair for your orchestrator identity.
Emit Receipts: Wrap all tool invocations with receipt emission logic. Capture agentDid, callerDid, toolId, status, durationMs, and SHA-256 hashes of request/response payloads. Sign with Ed25519.
Query Before Routing: Replace direct tool selection with an auditDelegationCandidates() call. Pass the task description, candidate list, and a policy object matching your risk tolerance.
Apply Directive: Route based on delegationDirective. Allow verified candidates, warn on borderline cases, and trigger fallbacks for unverified pools. Log all decisions for audit trails.
Iterate Policies: Monitor receipt volume, risk tag frequency, and latency distributions. Adjust min_receipt_count, latency caps, and risk exclusions as the execution graph matures. Disable co-signature enforcement until aggregator support stabilizes.

Evidence-based delegation transforms agent routing from metadata guesswork into verifiable decision-making. By treating signed execution records as the primary artifact and decoupling evidence collection from scoring, teams build resilient, auditable, and cost-predictable agent workflows. The format is intentionally narrow; the policy layer is intentionally flexible. Deploy receipts, query history, enforce thresholds, and let execution truth drive delegation.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back