Architecting AI-Agent Discoverability: A Three-Tier Protocol for API Visibility

Current Situation Analysis

The modern API ecosystem faces a silent visibility crisis. Traditional search engine optimization (SEO) was engineered for human browsers: it optimizes HTML rendering, click-through rates, and semantic keyword density. AI agents, however, do not browse. They parse. They operate on a fundamentally different stack that prioritizes machine-readable manifests, explicit crawler permissions, and single-round-trip decision making.

This disconnect is routinely misunderstood. Most engineering teams treat API discoverability as a subset of traditional SEO. They deploy sitemap.xml, configure Open Graph tags, and assume the job is complete. The reality is that AI crawlers (GPTBot, ClaudeBot, Google-Extended) and agent frameworks (Cursor, Claude Desktop, custom orchestration layers) ignore marketing copy and human-centric meta tags. They scan for specific, well-known file paths and structured data schemas. If those signals are absent, the API remains functionally invisible to autonomous tools, regardless of its human search ranking.

The infrastructure supporting agent-native discovery is still maturing. Unlike Google's 25-year-old indexing pipeline, agent crawlers operate on an 18-month-old paradigm with strict expectations:

They require explicit robots.txt allowances for agent-specific user agents.
They prioritize machine-readable manifests over human-readable documentation.
They make consumption decisions in a single HTTP round-trip, evaluating auth models, endpoint availability, and pricing before establishing a session.

When an API fails to satisfy these protocol-level expectations, it disappears from AI-generated answers, agent tool registries, and autonomous workflow integrations. The solution is not to optimize for keywords, but to implement a three-tier discoverability architecture that satisfies human search engines, agent-native crawlers, and live consumption clients simultaneously.

WOW Moment: Key Findings

The critical insight is that discoverability is not a single channel. It is a layered protocol stack where each tier serves a distinct consumer with different parsing logic. Treating them as interchangeable guarantees visibility gaps.

Tier	Primary Consumer	Core Signal	Decision Latency	Success Metric
Human Search (Layer 1)	Google, Bing, DuckDuckGo	`sitemap.xml`, Schema.org JSON-LD, Open Graph	Hours to days	Rich snippet eligibility, SERP ranking
Agent Discovery (Layer 2)	GPTBot, ClaudeBot, Perplexity, AI Search	`agent-manifest.json`, `llm.txt`, `/.well-known/*`, Content Signals	Milliseconds (single round-trip)	Crawler allowance, manifest parsing, tool registration
Live Consumption (Layer 3)	Cursor, Claude Desktop, Custom Agents	MCP Streamable HTTP, OpenAPI 3.0, REST	Real-time	Successful tool invocation, schema auto-discovery

Why this matters: Layer 1 gets you into human search results. Layer 2 gets you into AI-generated answers and agent toolboxes. Layer 3 enables actual execution. Missing Layer 2 is the most common failure point: an API can rank #1 on Google while remaining completely invisible to every AI assistant on the market. The three tiers must be implemented as a cohesive system, not isolated optimizations.

Core Solution

The architecture requires three independent but interconnected implementations. Each layer addresses a specific consumer's parsing behavior.

Step 1: Structured Data for Human Search (Layer 1)

Human search engines rely on Schema.org vocabulary to render rich results. For APIs, the SoftwareApplication type combined with Offer and UnitPriceSpecification is the standard. The goal is to expose pricing tiers in a format that search engines can parse into rich snippets.

Implementation: Embed JSON-LD in the <head> of your pricing or documentation page. The structure must strictly adhere to Schema.org validation rules.

// pricing-schema.ts
export const generatePricingSchema = (apiName: string, baseUrl: string) => {
  return {
    "@context": "https://schema.org",
    "@graph": [
      {
        "@type": "SoftwareApplication",
        "@id": `${baseUrl}/#api-service`,
        "name": apiName,
        "applicationCategory": "DeveloperApplication",
        "operatingSystem": "Cloud (HTTPS)",
        "offers": [
          {
            "@type": "Offer",
            "name": "Starter",
            "price": "0",
            "priceCurrency": "USD",
            "availability": "https://schema.org/InStock",
            "eligibleQuantity": {
              "@type": "QuantitativeValue",
              "value": 500,
              "unitText": "requests/month"
            }
          },
          {
            "@type": "Offer",
            "name": "Professional",
            "price": "29",
            "priceCurrency": "USD",
            "priceSpecification": {
              "@type": "UnitPriceSpecification",
              "price": "29",
              "priceCurrency": "USD",
              "billingDuration": "P1M"
            }
          },
          {
            "@type": "Offer",
            "name": "Enterprise",
            "price": "149",
            "priceCurrency": "USD",
            "priceSpecification": {
              "@type": "UnitPriceSpecification",
              "price": "149",
              "priceCurrency": "USD",
              "billingDuration": "P1M"
            }
          }
        ]
      }
    ]
  };
};

Architecture Rationale:

billingDuration: "P1M" uses ISO 8601 duration format. Search engines require this to correctly render recurring pricing (e.g., "$29/month") instead of ambiguous one-time charges.
All price fields must be numeric strings. Custom or "Contact Sales" tiers break rich result validation if included here. Keep those in the HTML UI only.
The @graph wrapper allows multiple entities (e.g., SoftwareApplication + WebAPI) without schema conflicts.

Step 2: Agent-Native Manifests & Signals (Layer 2)

AI agents do not read hero sections. They fetch machine-readable manifests to determine if an API is compatible with their execution environment. This layer requires three components: a discovery manifest, LLM-readable documentation, and explicit crawler permissions.

A. Discovery Manifest (/.well-known/agent-manifest.json) This file acts as a single source of truth for agent frameworks. It exposes endpoints, authentication models, and documentation links in one payload.

{
  "name": "DataStream Metrics API",
  "version": "2.1.0",
  "description": "Real-time infrastructure telemetry and pricing data for autonomous agents.",
  "endpoints": {
    "rest": "https://api.datastream.dev/v2",
    "mcp": "https://api.datastream.dev/mcp/stream",
    "openapi": "https://api.datastream.dev/spec/openapi.json"
  },
  "authentication": {
    "type": "api_key",
    "header": "X-DS-Key",
    "registration": "https://api.datastream.dev/v2/auth/issue"
  },
  "pricing": "https://datastream.dev/pricing",
  "documentation": "https://datastream.dev/docs"
}

B. LLM-Readable Documentation (/llm.txt & /llms-full.txt) /llm.txt provides a concise, structured summary optimized for context window efficiency. /llms-full.txt contains the complete API surface, rate limits, and error codes. Agents use these to generate accurate tool definitions without parsing heavy HTML.

C. Crawler Permissions & Content Signals Update robots.txt to explicitly allow agent crawlers. Additionally, configure HTTP response headers to communicate usage rights.

# robots.txt
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: *
Allow: /
Sitemap: https://datastream.dev/sitemap.xml

# Cloudflare / Nginx / Application Headers
content-signal: ai-input=yes
content-signal: ai-train=no
content-signal: search=yes

Architecture Rationale:

ai-input=yes grants agents permission to use your data for live query resolution.
ai-train=no explicitly excludes your content from model training corpora, protecting proprietary pricing and endpoint logic.
search=yes signals compatibility with AI-powered search indices.
The manifest uses /.well-known/ per RFC 8615 conventions, ensuring predictable discovery paths for agent frameworks.

Step 3: Consumption Protocol Exposure (Layer 3)

Discovery and consumption are orthogonal. An agent may find your manifest but cannot execute without a standardized consumption protocol. Expose three interfaces:

REST API: Standard HTTP endpoints for traditional integrations.
OpenAPI 3.0 Specification: Machine-readable contract for code generation and client SDKs.
MCP Streamable HTTP: Model Context Protocol endpoint for direct tool injection into AI desktop clients and IDEs.

Architecture Rationale: MCP is a consumption layer, not a discovery layer. It enables agents to invoke your API as native tools, but it does not broadcast your existence. The manifest (Layer 2) points to the MCP endpoint, while the schema (Layer 1) ensures human search engines index the service. This triad creates a closed loop: discovery → validation → execution.

Pitfall Guide

1. Conflating Discovery with Consumption

Explanation: Deploying an MCP server without a discovery manifest or structured data. Agents cannot find the endpoint, so the MCP server remains unused. Fix: Always pair MCP/OpenAPI exposure with agent-manifest.json and Schema.org JSON-LD. Discovery precedes execution.

2. Non-Numeric Pricing in JSON-LD

Explanation: Including "Custom" or "Contact Sales" tiers with string prices in Offer objects. Schema.org validators reject non-numeric price fields, invalidating the entire rich result for the page. Fix: Restrict JSON-LD to numeric pricing tiers only. Render custom tiers exclusively in the HTML UI.

3. Ignoring Agent-Specific Crawler Allowances

Explanation: Relying on default User-agent: * rules. Many AI crawlers operate under distinct user-agent strings and may be blocked by overly restrictive default policies. Fix: Explicitly whitelist GPTBot, ClaudeBot, and Google-Extended in robots.txt. Verify allowances using crawler simulation tools.

4. Over-Engineering `llm.txt`

Explanation: Dumping raw HTML, marketing copy, or unstructured markdown into LLM-readable files. Agents parse these files for token efficiency; noise increases context window usage and degrades tool generation accuracy. Fix: Structure /llm.txt as a concise API summary with clear endpoint lists, auth requirements, and rate limits. Keep it under 2,000 tokens.

5. Single-Tool Validation

Explanation: Assuming one green checkmark means full compliance. Google Rich Results, schema.org validators, and agent crawlers enforce different rules. Fix: Run parallel validation: Google Rich Results Test for JSON-LD, schema.org validator for structural compliance, Bing Webmaster for indexing, and manual agent probes for manifest parsing.

6. Missing Content Signal Headers

Explanation: Failing to declare AI usage rights. Without content-signal headers, agents may default to conservative parsing or exclude your API from live query resolution. Fix: Implement ai-input=yes, ai-train=no, and search=yes at the edge (CDN) or application layer. Document these signals in your manifest.

7. Assuming Indexing Equals Visibility

Explanation: Celebrating indexed: yes in search consoles while missing structured data, Open Graph tags, or agent manifests. The page is technically crawled but functionally invisible to both humans and agents. Fix: Treat indexing as a baseline. Layer rich snippets, per-page Open Graph cards, and agent manifests on top of indexed URLs.

Production Bundle

Action Checklist

Audit existing robots.txt and explicitly allow GPTBot, ClaudeBot, Google-Extended
Implement Schema.org SoftwareApplication + Offer JSON-LD on pricing/documentation pages
Validate JSON-LD with Google Rich Results Test and schema.org validator (target: 0 errors)
Create /.well-known/agent-manifest.json with endpoints, auth model, and documentation links
Generate /llm.txt (concise summary) and /llms-full.txt (complete API surface)
Configure content-signal HTTP headers (ai-input=yes, ai-train=no, search=yes)
Expose REST, OpenAPI 3.0, and MCP Streamable HTTP endpoints
Link all consumption endpoints from the agent manifest and structured data

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Internal/Private API	Skip Layer 2 & 3; focus on Layer 1	Agents should not discover internal tooling	Zero (no public exposure)
Public SaaS API	Full three-tier implementation	Maximizes visibility across human search, AI answers, and agent toolboxes	Low (configuration only)
High-Volume Data API	Layer 2 + MCP + Rate-Limited OpenAPI	Agents require structured manifests; MCP enables efficient tool injection	Medium (edge header config, manifest hosting)
Enterprise/Custom Pricing	JSON-LD for numeric tiers only; HTML for custom	Prevents schema validation errors while preserving sales flexibility	Zero

Configuration Template

# nginx.conf - Edge headers for AI agent signals
location / {
    add_header content-signal "ai-input=yes" always;
    add_header content-signal "ai-train=no" always;
    add_header content-signal "search=yes" always;
    
    # Serve agent manifest from well-known path
    location = /.well-known/agent-manifest.json {
        default_type application/json;
        alias /etc/agent-discovery/manifest.json;
    }
}

// manifest-generator.ts
import type { AgentManifest } from './types';

export const buildAgentManifest = (config: {
  name: string;
  version: string;
  baseUrl: string;
  apiKeyHeader: string;
}): AgentManifest => ({
  name: config.name,
  version: config.version,
  description: `Machine-readable API surface for ${config.name}.`,
  endpoints: {
    rest: `${config.baseUrl}/api/v1`,
    mcp: `${config.baseUrl}/mcp/stream`,
    openapi: `${config.baseUrl}/spec/openapi.json`
  },
  authentication: {
    type: 'api_key',
    header: config.apiKeyHeader,
    registration: `${config.baseUrl}/api/v1/auth/issue`
  },
  pricing: `${config.baseUrl}/pricing`,
  documentation: `${config.baseUrl}/docs`
});

Quick Start Guide

Deploy the manifest: Place agent-manifest.json at /.well-known/agent-manifest.json. Ensure it returns 200 OK with application/json content type.
Inject structured data: Add the SoftwareApplication JSON-LD script to your pricing page <head>. Validate with Google Rich Results Test.
Configure edge signals: Add content-signal headers to your CDN or reverse proxy. Verify with curl -I https://yourdomain.com.
Expose consumption endpoints: Host your OpenAPI 3.0 spec and MCP Streamable HTTP endpoint. Link both in the manifest.
Validate end-to-end: Run a manual probe using an AI agent or crawler simulator. Confirm manifest parsing, schema eligibility, and tool registration.