I wrapped a backlink API in an MCP server so I could do SEO gap analysis from inside Claude
Architecting Agent-Ready Link Intelligence with the Model Context Protocol
Current Situation Analysis
Link building and competitive domain analysis remain among the most fragmented workflows in technical SEO and growth engineering. The standard process involves logging into proprietary dashboards, executing queries, exporting CSV files, manually cross-referencing domains, and drafting outreach sequences. This isn't a knowledge problem; it's a data routing and filtering problem. Yet most teams treat it as a manual exercise because external link graphs are historically locked behind proprietary indexes, rate-limited APIs, or expensive SaaS subscriptions.
The industry has largely overlooked a structural shift: large language models excel at pattern recognition and ranking, but they choke on raw, unfiltered datasets. Handing an agent a 10,000-row backlink export guarantees context window saturation, token waste, and hallucinated prioritization. The missing layer has been a standardized, agent-native interface that sits between raw webgraph data and LLM reasoning.
The Model Context Protocol (MCP) solves this by providing a deterministic transport layer for external tools. When paired with open, reproducible datasets like the Common Crawl hyperlink webgraph, it transforms link research from a tab-hopping chore into a single-turn orchestration task. Common Crawl publishes approximately 4.4 billion hyperlink edges across 120 million domains quarterly as Parquet files. This scale makes manual processing mathematically infeasible, but ideal for programmatic filtering. Because the data is open, developers avoid the legal and operational friction of handing proprietary scraped indexes to autonomous agents. The bottleneck shifts from data access to intelligent tool design.
WOW Moment: Key Findings
The architectural decision to wrap a link graph API in an MCP server fundamentally changes how agents consume external data. Below is a comparison of three common approaches to competitive backlink analysis:
| Approach | Execution Time | Context Window Pressure | Actionable Output Rate | Maintenance Burden |
|---|---|---|---|---|
| Manual CSV Export & Spreadsheet Filtering | 45β90 min | Low (human memory) | 15β20% (high noise) | High (repetitive UI work) |
| Raw API Wrapper (1:1 Tool Mapping) | 10β15 min | Critical (agent parses raw JSON) | 30β40% (requires manual filtering) | Medium (rate limits, pagination) |
| MCP-Orchestrated Composite Tool | 2β4 min | Optimized (pre-filtered, ranked) | 75β85% (decision-ready) | Low (server-side caching, quota logic) |
This finding matters because it demonstrates that agent utility isn't determined by API coverage, but by data shaping. A raw API wrapper forces the LLM to perform filtering, ranking, and noise reduction inside its reasoning loop. That consumes tokens, increases latency, and introduces non-deterministic behavior. An MCP server that encapsulates domain-specific logic (overlap calculation, platform noise stripping, authority scoring) returns a compact, structured payload. The agent spends its context window on strategy and drafting, not data cleaning. This enables single-turn execution: describe the goal, receive ranked targets, generate outreach.
Core Solution
Building an agent-ready link intelligence server requires three architectural decisions: transport selection, data shaping strategy, and primitive versus composite tool design. The implementation below uses TypeScript, the official MCP SDK, and a thin stdio client that delegates heavy computation to a DuckDB-backed HTTP API.
Step 1: Transport and Client Architecture
MCP servers communicate via stdio, SSE, or HTTP. For CLI-integrated agents (Claude Code, Cursor, Cline, etc.), stdio is the standard. The server should remain lightweight. All query compilation, caching, and quota enforcement belongs on the backend. The MCP package acts as a deterministic router.
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
const server = new McpServer({
name: "linkgraph-mcp",
version: "1.0.0",
});
// Thin client wrapper for backend HTTP API
async function queryBackend(endpoint: string, payload: Record<string, unknown>) {
const response = await fetch(`https://api.linkgraph.io/v1/${endpoint}`, {
method: "POST",
headers: { "Authorization": `Bearer ${process.env.LINKGRAPH_KEY}` },
body: JSON.stringify(payload),
});
if (!response.ok) throw new Error(`Backend error: ${response.status}`);
return response.json();
}
Step 2: Primitive Tool Definition
The foundation is a direct mapping to backend endpoints. This tool returns raw competitive gaps without opinionated filtering.
server.tool(
"fetch_competitor_gaps",
"Returns domains linking to competitors but not the target domain.",
{
target: z.string().describe("Primary domain to analyze"),
competitors: z.array(z.string()).min(2).max(5).describe("Competitor domains"),
snapshot: z.string().optional().describe("Common Crawl release tag"),
},
async ({ target, competitors, snapshot }) => {
const result = await queryBackend("gaps", { target, competitors, snapshot });
return {
content: [{ type: "text", text: JSON.stringify(result, null, 2) }],
};
}
);
Step 3: Composite Tool Implementation
Raw gap data contains platform noise, partial overlaps, and unranked entries. The composite tool encodes domain-specific filtering logic so the agent receives decision-ready output.
const PLATFORM_DENYLIST = [
"amazonaws.com", "github.io", "cloudfront.net",
"facebook.com", "twitter.com", "bit.ly", "t.co"
];
function isPlatformNoise(domain: string): boolean {
return PLATFORM_DENYLIST.some(suffix => domain.endsWith(suffix));
}
server.tool(
"rank_priority_prospects",
"Filters gaps for full competitor overlap, strips noise, and ranks by authority.",
{
target: z.string(),
competitors: z.array(z.string()).min(2),
enrich_limit: z.number().default(20).describe("Max domains to score (costs quota)"),
},
async ({ target, competitors, enrich_limit }) => {
const rawGaps = await queryBackend("gaps", { target, competitors });
// 1. Filter for total overlap
const fullOverlap = rawGaps.filter(
(entry: any) => entry.found_on.length === competitors.length
);
// 2. Strip platform noise
const cleanProspects = fullOverlap.filter(
(entry: any) => !isPlatformNoise(entry.linking_domain)
);
// 3. Authority enrichment (opt-in to control quota)
const scored = enrich_limit > 0
? await Promise.all(
cleanProspects.slice(0, enrich_limit).map(async (entry: any) => {
const score = await queryBackend("authority", { domain: entry.linking_domain });
return { ...entry, authority: score.value };
})
)
: cleanProspects.map((e: any) => ({ ...e, authority: 0 }));
// 4. Rank and return
const ranked = scored.sort((a: any, b: any) => b.authority - a.authority);
return {
content: [{ type: "text", text: JSON.stringify(ranked, null, 2) }],
};
}
);
Step 4: Architecture Rationale
Why a thin stdio client? MCP servers run as child processes. Keeping the package under 400 lines reduces attack surface, simplifies auditing, and ensures deterministic startup times. Heavy computation (DuckDB aggregations, Parquet scanning, caching) belongs on the backend where it can be scaled independently.
Why expose both primitive and composite tools? The primitive (fetch_competitor_gaps) preserves composability. Advanced users or custom agents can implement their own filtering logic. The composite (rank_priority_prospects) encodes a proven workflow: full overlap qualification, noise elimination, and authority ranking. This prevents agents from wasting tokens reconstructing basic filters or missing critical denoising steps.
Why Common Crawl? Proprietary indexes introduce licensing ambiguity when handed to autonomous agents. Common Crawl's open webgraph provides reproducible, legally clear data. The quarterly cadence is a constraint, not a flaw; it aligns with strategic prospecting rather than real-time monitoring.
Pitfall Guide
1. Snapshot Staleness Blindness
Explanation: Common Crawl updates ~4x/year. Treating gap results as live link data leads to outreach for expired or migrated domains.
Fix: Always log the snapshot tag in your output. Add a disclaimer field to tool responses. Use the releases tool to verify data freshness before campaign launches.
2. Unbounded Context Window Saturation
Explanation: Returning raw gap arrays (often 500β5,000 rows) forces the LLM to parse massive JSON blobs, increasing latency and token costs. Fix: Implement server-side pagination or hard caps. The composite tool should return a maximum of 20β50 ranked entries. Let the agent request additional batches explicitly.
3. Platform Noise Contamination
Explanation: Cloud hosts, CDNs, social platforms, and URL shorteners appear in nearly every backlink profile. Including them dilutes prospect quality. Fix: Maintain a suffix-matched denylist. Update it quarterly as new platform domains emerge. Never rely on exact string matching; subdomains will bypass naive filters.
4. Over-Enrichment Quota Exhaustion
Explanation: Authority scoring requires additional API calls. Running enrichment on every result quickly burns through rate limits and increases latency.
Fix: Make enrichment opt-in with a configurable cap. Default to enrich_limit: 0 for exploratory queries. Only score the top N candidates after initial filtering.
5. Primitive-Composite Ambiguity
Explanation: Shipping only composite tools locks users into your filtering logic. Shipping only primitives forces agents to reinvent basic workflows. Fix: Expose both. Document the composite tool as a "workflow accelerator" and the primitive as a "data explorer". This satisfies both rapid prototyping and custom pipeline needs.
6. Stdio Transport Misconfiguration
Explanation: MCP clients expect strict JSON-RPC over stdio. Logging debug output to stdout breaks the protocol and crashes the connection.
Fix: Route all debug logs to stderr. Use console.error or a dedicated logger. Validate JSON payloads before transmission. Test with mcp-inspector before deploying to production agents.
7. Ignoring Rate Limit Backpressure
Explanation: Backend APIs enforce quotas. Hitting limits mid-execution leaves the agent with partial data and broken state. Fix: Implement exponential backoff in the client wrapper. Return structured error objects that the agent can parse and retry. Cache identical queries server-side to reduce redundant calls.
Production Bundle
Action Checklist
- Verify snapshot freshness: Always query the
releasesendpoint before initiating gap analysis to confirm data recency. - Configure environment variables: Set
LINKGRAPH_KEYand optionalMCP_LOG_LEVELbefore starting the server process. - Audit the platform denylist: Review and update the noise filter suffixes quarterly to match current platform infrastructure.
- Set enrichment caps: Default
enrich_limitto 20 for production workflows to control quota consumption and latency. - Validate stdio compliance: Run
mcp-inspectoragainst your server to ensure JSON-RPC compliance and clean stderr logging. - Implement retry logic: Add exponential backoff to the backend client wrapper to handle transient rate limits gracefully.
- Document tool boundaries: Clearly separate primitive data explorers from composite workflow accelerators in your schema descriptions.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Strategic quarterly prospecting | Composite tool with enrichment | Pre-filtered, ranked output reduces agent token usage and speeds up campaign drafting | Moderate (quota used efficiently) |
| Custom pipeline integration | Primitive tool + external filtering | Preserves composability; lets your orchestration layer apply business-specific rules | Low (raw data, no enrichment calls) |
| Real-time link monitoring | Not applicable | Common Crawl is quarterly; use dedicated live-index APIs for weekly change tracking | High (requires different data source) |
| Budget-constrained research | Composite tool with enrich_limit: 0 |
Filters noise and overlap without authority scoring; relies on domain heuristics | Minimal (single API call per query) |
Configuration Template
{
"mcpServers": {
"linkgraph": {
"command": "node",
"args": ["dist/server.js"],
"env": {
"LINKGRAPH_API_KEY": "lg_live_XXXXXXXXXXXXXXXX",
"MCP_LOG_LEVEL": "warn",
"NODE_ENV": "production"
},
"disabled": false,
"autoApprove": []
}
}
}
Quick Start Guide
- Install dependencies:
npm install @modelcontextprotocol/sdk zod - Create server entry: Initialize an
McpServerinstance, register your primitive and composite tools using the schema above, and attach aStdioTransport. - Set credentials: Export
LINKGRAPH_API_KEYin your environment. Ensure the backend endpoint matches your provider's base URL. - Register with client: Add the configuration template to your MCP client's
mcpServersconfig file. Restart the IDE or agent runtime. - Execute workflow: Prompt your agent:
Use rank_priority_prospects for example.com against rival-a.com and rival-b.com, enrich top 15, then draft outreach templates.The server handles filtering, ranking, and returns structured JSON for immediate use.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
