How does an AI agent pick from 686 skills in a second?
Scaling Agent Tool Discovery: The Semantic Routing Pattern for Large Skill Catalogs
Current Situation Analysis
Modern AI agent frameworks increasingly rely on modular skill catalogs to extend capabilities. These skills typically ship as markdown files containing YAML frontmatter (name, description, tags) and a markdown body (instructions, prompts, tool definitions). The industry-standard loading strategy, often termed progressive disclosure, defers loading the skill body until invocation but eagerly injects the entire catalog index into the system prompt at session startup.
This approach creates a silent context tax. Developers assume progressive disclosure solves prompt bloat because unused bodies are never loaded. In reality, the index itself becomes the bottleneck. Each skill's name and description consumes tokens. At scale, the cumulative overhead fractures the available context window and degrades attention quality.
The mathematical reality is unforgiving. A catalog of 4,556 skills requires approximately 228,000 tokens just to represent names and descriptions. This exceeds the 200,000-token limit of standard high-context models, causing immediate overflow. Even within a 1,000,000-token window, the catalog consumes 23% of the budget before a single task is processed. Beyond raw token count, attention mechanisms degrade when parsing long, semantically similar lists. Empirical observation shows that past 1,000 entries, agents begin misrouting tasks that humans would distinguish effortlessly. Furthermore, static catalogs lack garbage collection: stale skills accumulate, duplicates persist, and the index only grows.
The semantic routing pattern addresses this by decoupling the catalog from the prompt. Instead of injecting the index, skills are embedded into a vector database. At runtime, the agent queries the index with the task description, retrieves a small candidate set, and dynamically loads only the selected skill's body. This transforms a linear context cost into a constant one, regardless of catalog size.
WOW Moment: Key Findings
The empirical validation of this pattern reveals a critical insight: retrieval accuracy saturates quickly, while context savings scale linearly. Testing against a 686-skill sample from a 4,556-skill corpus demonstrates that you do not need the entire catalog to achieve reliable routing.
| Approach | Context Overhead | Top-5 Cluster Accuracy | Strict Top-1 Accuracy | Query Latency | Scalability Profile |
|---|---|---|---|---|---|
| Static Catalog (Progressive Disclosure) | ~228,000 tokens | N/A (loads all) | N/A | >30s (context processing) | Degrades past 1,000 entries |
| Semantic Router | ~1,500 tokens | 87.5% | 62.5% | <1 second | Constant cost, linear accuracy gain |
The data shows a 456x reduction in context overhead per task turn. The router returns approximately 500 tokens for the top-5 candidates, plus 500-1,500 tokens for the selected skill body. Total per-turn overhead stays under 2,000 tokens.
More importantly, the convergence curve reveals that top-5 cluster accuracy stabilizes at ~85% once 500 skills are indexed. Adding the remaining 186 skills only pushes strict top-1 accuracy from 50% to 62.5%. This means the routing layer reliably surfaces the correct skill family early, and additional catalog depth primarily refines the exact match rather than enabling discovery. The pattern is not a magic bullet for missing skills, but it is a highly efficient filter for available ones.
Core Solution
Implementing a semantic router requires four architectural layers: metadata extraction, embedding generation, vector storage, and runtime resolution. The following implementation uses TypeScript, pgvector for storage, and intfloat/multilingual-e5-base for embeddings.
Step 1: Metadata Extraction & Normalization
Skills must be reduced to a routing signal. The full markdown body is ignored during indexing. Only the name and description are concatenated.
interface SkillMetadata {
id: string;
name: string;
description: string;
filePath: string;
version: string;
}
function extractRoutingSignal(skill: SkillMetadata): string {
// Strict formatting prevents embedding noise
return `${skill.name}\n\n${skill.description}`;
}
Step 2: Vector Indexing Pipeline
Embeddings are generated asynchronously and stored in Postgres with pgvector. The implementation uses a worker pool to handle batch ingestion without blocking the main event loop.
import { Pool } from 'pg';
import { createEmbedding } from './embedding-client'; // Wraps intfloat/multilingual-e5-base
const db = new Pool({ connectionString: process.env.DATABASE_URL });
export class SkillIndexer {
async ingestBatch(skills: SkillMetadata[]): Promise<void> {
const client = await db.connect();
try {
await client.query('BEGIN');
for (const skill of skills) {
const signal = extractRoutingSignal(skill);
const vector = await createEmbedding(signal, { model: 'intfloat/multilingual-e5-base' });
await client.query(
`INSERT INTO skill_vectors (skill_id, name, description, file_path, version, embedding)
VALUES ($1, $2, $3, $4, $5, $6)
ON CONFLICT (skill_id) DO UPDATE SET embedding = EXCLUDED.embedding, version = EXCLUDED.version`,
[skill.id, skill.name, skill.description, skill.filePath, skill.version, `[${vector.join(',')}]`]
);
}
await client.query('COMMIT');
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release();
}
}
}
Architecture Rationale:
intfloat/multilingual-e5-baseis selected for its 768-dimensional output, which balances semantic precision with inference speed. It handles technical terminology and cross-lingual descriptions effectively.pgvectoris chosen over dedicated vector databases to leverage existing Postgres infrastructure, ACID compliance, and native connection pooling. TheON CONFLICTclause ensures idempotent updates during catalog refreshes.- Batch transactions prevent partial index states during ingestion failures.
Step 3: Runtime Resolution
At task execution, the agent converts the user prompt into a query vector, performs a cosine similarity search, and returns the top candidates.
export class SkillRouter {
async resolveCandidates(taskPrompt: string, limit: number = 5): Promise<SkillMetadata[]> {
const queryVector = await createEmbedding(taskPrompt, { model: 'intfloat/multilingual-e5-base' });
const vectorLiteral = `[${queryVector.join(',')}]`;
const result = await db.query(
`SELECT skill_id, name, description, file_path, version,
1 - (embedding <=> $1::vector) AS similarity
FROM skill_vectors
ORDER BY embedding <=> $1::vector
LIMIT $2`,
[vectorLiteral, limit]
);
return result.rows.map(row => ({
id: row.skill_id,
name: row.name,
description: row.description,
filePath: row.file_path,
version: row.version
}));
}
}
Step 4: Dynamic Body Hydration
The agent selects the highest-scoring candidate (or applies a fallback heuristic) and reads the full markdown body only when needed.
import { readFileSync } from 'fs';
export async function hydrateSkillCandidate(candidate: SkillMetadata): Promise<string> {
try {
const rawContent = readFileSync(candidate.filePath, 'utf-8');
// Strip frontmatter, return only the instruction body
const bodyMatch = rawContent.match(/^---[\s\S]*?---\n([\s\S]*)$/);
return bodyMatch ? bodyMatch[1].trim() : rawContent;
} catch {
throw new Error(`Skill body unavailable at ${candidate.filePath}`);
}
}
Why this structure works: The routing layer operates independently of the agent's execution loop. Context consumption remains constant because the index query returns a fixed-size result set. The agent pays for skill bodies only upon explicit selection, eliminating speculative token expenditure.
Pitfall Guide
1. Corpus Depth Fallacy
Explanation: Assuming embeddings can retrieve skills that were never indexed. The router's accuracy is strictly bounded by catalog coverage. If a skill is missing from the vector store, no similarity threshold will surface it.
Fix: Implement coverage monitoring. Track indexed vs. total skill counts. Use incremental indexing pipelines that watch the filesystem for new or updated .md files.
2. Metadata Pollution
Explanation: Embedding full skill bodies, tags, or boilerplate frontmatter introduces noise that dilutes semantic signals. The vector space becomes dominated by repetitive instructional text rather than intent descriptors.
Fix: Enforce a strict extraction pipeline. Index only name + "\n\n" + description. Validate frontmatter schema during ingestion and reject files missing required fields.
3. Worker Bottlenecks
Explanation: Default concurrency settings (e.g., 3 workers) create ingestion backlogs. A single-CPU container processing 768-dimensional vectors sequentially will stall, leaving skills in a "pending" state. Fix: Parameterize worker counts based on available CPU cores. Monitor queue depth and implement exponential backoff for embedding API retries. Scale to 10+ workers for bulk operations.
4. Cold-Start Latency
Explanation: The first query after a service restart often times out due to connection pool initialization, model loading, or cache warming. This breaks agent workflows that expect sub-second routing. Fix: Implement a warm-up sequence that preloads the embedding model and establishes database connections. Use async preloading to cache frequent query patterns during idle periods.
5. Threshold Myopia
Explanation: Relying exclusively on cosine similarity scores without semantic validation. High similarity does not guarantee functional relevance, especially for overlapping tool categories. Fix: Add a lightweight verification layer. If the top candidate's similarity falls below 0.80, trigger a hybrid search (keyword + vector) or route to a fallback skill. Log low-confidence matches for manual review.
6. Stale Index Drift
Explanation: Skills are updated on disk, but the vector index retains outdated embeddings. The router returns candidates that no longer match the actual tool behavior.
Fix: Implement versioned indexing. Store a version or last_modified timestamp alongside each vector. Schedule periodic diff checks that re-embed only changed files.
7. Context Window Illusion
Explanation: Assuming larger context windows (e.g., 1M tokens) eliminate the need for routing. A 228K token catalog still consumes 23% of a 1M window, starving task-specific context and increasing inference cost. Fix: Calculate token tax explicitly. Treat the catalog index as a separate budget line item. Route aggressively regardless of window size to preserve attention quality and reduce API costs.
Production Bundle
Action Checklist
- Audit existing skill catalogs: Count total files, estimate token overhead, and identify stale/duplicate entries.
- Deploy pgvector instance: Provision a Postgres database with the
vectorextension enabled and configure connection pooling. - Implement extraction pipeline: Build a script that parses YAML frontmatter and outputs normalized
name + descriptionstrings. - Configure embedding workers: Set concurrency to match available CPU cores, implement retry logic, and add queue depth monitoring.
- Build routing endpoint: Expose a
/resolveAPI that accepts task prompts and returns top-5 candidates with similarity scores. - Add dynamic hydration: Integrate file reading logic that strips frontmatter and injects only the skill body into the agent context.
- Establish version control: Tag indexed skills with git commit hashes or semantic versions to detect drift.
- Monitor coverage metrics: Track indexed vs. total skill ratio and alert when drift exceeds 5%.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| < 50 skills | Eager loading | Catalog overhead is negligible; routing adds unnecessary complexity | Baseline |
| 50β200 skills | Hybrid indexing | Progressive disclosure works, but routing reduces context tax by ~60% | Moderate reduction |
| 200β1,000 skills | Semantic routing | Attention degradation begins; constant token cost preserves task context | High reduction |
| > 1,000 skills | Semantic routing + versioning | Catalog overflow is inevitable; routing is the only sustainable pattern | Critical reduction |
| Multi-tenant deployment | Isolated vector namespaces | Prevents cross-tenant skill leakage; enables per-tenant indexing pipelines | Infrastructure increase |
Configuration Template
# docker-compose.yml for pgvector + embedding worker
version: '3.8'
services:
vector-db:
image: pgvector/pgvector:pg16
environment:
POSTGRES_DB: agent_skills
POSTGRES_USER: router
POSTGRES_PASSWORD: ${DB_PASSWORD}
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
embedding-worker:
build: ./worker
environment:
DATABASE_URL: postgresql://router:${DB_PASSWORD}@vector-db:5432/agent_skills
EMBEDDING_MODEL: intfloat/multilingual-e5-base
WORKER_CONCURRENCY: 10
BATCH_SIZE: 50
depends_on:
- vector-db
volumes:
pgdata:
// router.config.ts
export const RouterConfig = {
embedding: {
model: 'intfloat/multilingual-e5-base',
dimensions: 768,
batchSize: 64,
timeoutMs: 5000
},
retrieval: {
defaultLimit: 5,
minSimilarity: 0.80,
fallbackStrategy: 'keyword-hybrid',
cacheTTL: 300 // seconds
},
storage: {
table: 'skill_vectors',
vectorIndex: 'IVFFlat',
lists: 100,
probe: 10
},
ingestion: {
maxWorkers: 10,
retryAttempts: 3,
backoffBase: 1000,
validateFrontmatter: true
}
};
Quick Start Guide
- Initialize the vector store: Run
CREATE EXTENSION vector;in your Postgres instance, then execute the schema migration to create theskill_vectorstable with avector(768)column and IVFFlat index. - Ingest your catalog: Point the extraction script at your skills directory. Run the batch ingestion pipeline. Verify that
SELECT COUNT(*) FROM skill_vectors;matches your expected file count. - Test routing: Send a sample task prompt to the
/resolveendpoint. Confirm that the response returns 5 candidates with similarity scores between 0.83 and 0.88. - Integrate with agent: Replace the static catalog injection in your system prompt with a single routing call. Parse the top candidate, hydrate the body, and append it to the conversation context.
- Monitor & iterate: Track top-1 accuracy and latency. Adjust
minSimilaritythresholds and worker concurrency based on production load. Schedule weekly diff checks to keep the index synchronized with disk.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
