ally co-signed by the caller
Ed25519 is chosen for its compact signature size, fast verification, and widespread cryptographic library support. W3C DIDs provide decentralized identity without vendor lock-in, allowing receipts to move across runtimes and marketplaces. The format deliberately excludes scoring logic, aggregation topology, and decision rules. Those are deployment concerns, not wire-format concerns.
Step 2: Query the Evidence Graph
Agents should never invoke a candidate without first querying its execution history. The xaip-sdk@0.5.0 introduces a precheck() helper that abstracts this query. Below is a restructured implementation that mirrors the SDK's behavior while using a distinct interface:
import { createEvidenceClient } from "xaip-sdk";
interface DelegationPolicy {
minReceiptCount: number;
excludeRiskTags: string[];
maxAvgLatencyMs?: number;
requireCoSignature?: boolean;
}
interface ExecutionAuditResult {
primaryCandidate: string | null;
rankedCandidates: Array<{
id: string;
score: number;
receiptCount: number;
confidence: number;
riskTags: string[];
eligible: boolean;
}>;
unverifiedCandidates: string[];
auditReason: string;
delegationDirective: "allow" | "warn" | "unknown";
}
async function auditDelegationCandidates(
taskDescription: string,
candidateIds: string[],
policy: DelegationPolicy
): Promise<ExecutionAuditResult> {
const client = createEvidenceClient();
const rawAudit = await client.queryExecutionHistory({
task: taskDescription,
candidates: candidateIds,
policy: {
minReceipts: policy.minReceiptCount,
filterRiskTags: policy.excludeRiskTags,
latencyThreshold: policy.maxAvgLatencyMs,
enforceCoSignature: policy.requireCoSignature
}
});
// SDK never makes the invocation decision. It surfaces structured evidence.
const ranked = rawAudit.candidates.map(c => ({
id: c.id,
score: c.derivedScore,
receiptCount: c.receiptCount,
confidence: c.confidenceInterval,
riskTags: c.observedRiskTags,
eligible: c.meetsPolicyThresholds
}));
const eligibleCandidates = ranked.filter(c => c.eligible);
const primary = eligibleCandidates.length > 0
? eligibleCandidates.sort((a, b) => b.score - a.score)[0].id
: null;
return {
primaryCandidate: primary,
rankedCandidates: ranked,
unverifiedCandidates: rawAudit.candidatesWithZeroReceipts,
auditReason: primary
? "Selected using available execution evidence."
: "No eligible candidates based on available execution evidence.",
delegationDirective: primary ? "allow" : "warn"
};
}
Step 3: Integrate Into the Delegation Loop
The audit result feeds directly into the agent's routing logic. Notice that the SDK returns a controlled auditReason string and a delegationDirective rather than a hard block. This is intentional. The evidence layer surfaces facts; the orchestrator applies business logic.
async function routeTask(task: string, candidates: string[]) {
const policy: DelegationPolicy = {
minReceiptCount: 15,
excludeRiskTags: ["repeated_timeout", "output_truncation"],
maxAvgLatencyMs: 2000,
requireCoSignature: false
};
const audit = await auditDelegationCandidates(task, candidates, policy);
if (audit.delegationDirective === "warn") {
// Fallback to human review, alternative toolset, or graceful degradation
return await handleUnverifiedDelegation(task, audit.unverifiedCandidates);
}
if (audit.delegationDirective === "allow" && audit.primaryCandidate) {
// Proceed with invocation. Payment/execution happens here.
return await invokeTool(audit.primaryCandidate, task);
}
throw new Error("Delegation blocked by policy constraints");
}
Architecture Rationale
- Receipts as Primary Artifacts: Scores, confidence intervals, and eligibility flags are derived views. If a future system requires different weighting or aggregation, it can compute new metrics over the same signed receipt graph without re-emitting data.
- Separation of Concerns: The evidence layer does not invoke tools, handle payments, or enforce blocks. It answers one question:
What is the verifiable history of this candidate? Decision logic remains in the orchestrator.
- Policy-Driven Filtering: Thresholds are externalized. Teams can adjust
minReceiptCount, latency caps, or risk tag exclusions without modifying core routing code.
- Decentralized Identity: W3C DIDs ensure receipts remain portable across marketplaces. An agent can verify a tool's history regardless of where the tool is hosted or billed.
Pitfall Guide
1. Treating Derived Scores as Ground Truth
Explanation: The SDK computes scores, confidence intervals, and eligibility flags algorithmically. These are convenience views, not cryptographic facts. Relying on them as absolute truth ignores the underlying receipt graph.
Fix: Always inspect receiptCount, riskTags, and raw execution distributions when making high-stakes routing decisions. Treat scores as heuristic inputs, not deterministic outputs.
2. Ignoring Receipt Provenance
Explanation: Not all receipts carry equal weight. A receipt from a real agent call differs fundamentally from one generated by a synthetic health check or integration test. Mixing them without distinction distorts success rates and latency averages.
Fix: Implement provenance tagging at the emitter level. Filter or weight receipts by source type (real_agent_call, synthetic_probe, scheduled_monitor) before aggregation.
3. Premature Co-Signature Enforcement
Explanation: Co-signatures (caller + executor) improve non-repudiation but are not yet widely supported in public aggregation endpoints. Enforcing requireCoSignature: true before the ecosystem matures will reject valid candidates and stall delegation.
Fix: Keep co-signature requirements disabled in early deployments. Monitor aggregator support and enable the flag only when the target candidate pool consistently emits co-signed receipts.
4. Conflating Evidence with Safety
Explanation: Execution records prove what happened, not whether a tool is safe. A tool can have a 99% success rate and still leak sensitive data, violate compliance boundaries, or produce hallucinated outputs.
Fix: Combine evidence-based routing with separate safety layers: input sanitization, output validation, data loss prevention (DLP) scanning, and capability scoping. Evidence handles reliability; safety handles risk.
5. Hardcoding Policy Thresholds
Explanation: Static thresholds (minReceiptCount: 10) break when scaling across domains. A niche internal tool may never reach 10 receipts, while a public API may accumulate 10,000. Hardcoded values create false negatives or false positives.
Fix: Implement dynamic thresholding based on candidate category, task criticality, and available pool size. Use percentile-based routing for mature candidates and relaxed thresholds for emerging tools.
Explanation: Tools that produce externally anchored outputs (e.g., blockchain transactions, legal document generation, financial settlements) have different evidence requirements than retrieval or transformation tools. Standard latency/success metrics are insufficient.
Fix: Extend the receipt format with toolMetadata hints for settlement-class tools. Track confirmation depth, finality windows, and external verification hashes separately from standard execution metrics.
Explanation: Publisher-supplied ratings reflect marketing, user sentiment, or aggregated satisfaction. They do not capture runtime failures, timeout patterns, or output integrity. High ratings frequently mask inconsistent execution.
Fix: Deprecate reliance on marketplace ratings for automated routing. Use ratings only as secondary context for human review. Let signed execution receipts drive machine decisions.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume, low-cost API calls | Metadata routing + periodic evidence audits | Latency sensitivity outweighs per-call verification cost | Low verification overhead, minimal financial risk |
| Paid, closed-source skill marketplace | Evidence-based precheck before invocation | Prevents cascading costs from misbehaving or opaque tools | Higher upfront latency, significant cost avoidance |
| Internal toolchain with limited adoption | Relaxed receipt thresholds + synthetic probes | Ensures routing works before organic receipt volume matures | Moderate probe infrastructure cost, faster tool onboarding |
| Settlement/financial execution tools | Extended receipt format + finality tracking | Standard success metrics insufficient for externally anchored outputs | Higher implementation complexity, compliance risk reduction |
| Multi-tenant agent platform | Policy-driven routing with tenant-specific thresholds | Different SLAs and risk tolerances require isolated evaluation | Configuration overhead, improved tenant trust |
Configuration Template
# delegation-policy.yaml
evidence:
client: xaip-sdk@0.5.0
endpoint: https://trust-api.example.com/v1/audit
timeout_ms: 250
retry_policy:
max_attempts: 2
backoff: exponential
policy:
default:
min_receipt_count: 10
exclude_risk_tags:
- repeated_timeout
- output_truncation
- schema_violation
max_avg_latency_ms: 1500
require_co_signature: false
high_value_tasks:
min_receipt_count: 25
exclude_risk_tags:
- repeated_timeout
- output_truncation
- schema_violation
- data_leak_suspect
max_avg_latency_ms: 800
require_co_signature: true
emerging_tools:
min_receipt_count: 3
exclude_risk_tags: []
max_avg_latency_ms: 3000
require_co_signature: false
allow_synthetic_probes: true
routing:
fallback_strategy: human_review
block_on_policy_violation: true
log_evidence_decisions: true
Quick Start Guide
- Install the SDK: Add
xaip-sdk@0.5.0 to your agent runtime dependencies. Configure the Trust API endpoint and Ed25519 key pair for your orchestrator identity.
- Emit Receipts: Wrap all tool invocations with receipt emission logic. Capture
agentDid, callerDid, toolId, status, durationMs, and SHA-256 hashes of request/response payloads. Sign with Ed25519.
- Query Before Routing: Replace direct tool selection with an
auditDelegationCandidates() call. Pass the task description, candidate list, and a policy object matching your risk tolerance.
- Apply Directive: Route based on
delegationDirective. Allow verified candidates, warn on borderline cases, and trigger fallbacks for unverified pools. Log all decisions for audit trails.
- Iterate Policies: Monitor receipt volume, risk tag frequency, and latency distributions. Adjust
min_receipt_count, latency caps, and risk exclusions as the execution graph matures. Disable co-signature enforcement until aggregator support stabilizes.
Evidence-based delegation transforms agent routing from metadata guesswork into verifiable decision-making. By treating signed execution records as the primary artifact and decoupling evidence collection from scoring, teams build resilient, auditable, and cost-predictable agent workflows. The format is intentionally narrow; the policy layer is intentionally flexible. Deploy receipts, query history, enforce thresholds, and let execution truth drive delegation.