How I indexed 69,000 Claude Code skills (and what I learned doing it)
Current Situation Analysis
AI agent ecosystems are rapidly adopting markdown-based instruction files to extend base model capabilities. In the Claude Code environment, these are defined as SKILL.md files containing YAML frontmatter and natural language directives. When placed in the designated user directory (~/.claude/skills/<name>/), they become invocable slash commands. The format is lightweight, human-readable, and highly portable. However, the operational reality of managing thousands of these files reveals a critical infrastructure gap: discovery and quality assurance are entirely decentralized.
The industry pain point is not the creation of skills, but their aggregation. Authors publish to public repositories, but there is no centralized registry, standardized search surface, or programmatic API. Developers relying on native platform search or community-curated lists encounter severe fragmentation. The long tail of the ecosystem remains invisible, and quality variance is extreme. A rigorously engineered skill with explicit boundary conditions sits alongside a four-line placeholder with identical discoverability.
This problem is frequently misunderstood because teams assume GitHub's native code search or social aggregation is sufficient. In practice, platform search engines impose hard result caps (typically 1,000 per query), ignore non-repository artifacts like gists or social mentions, and lack semantic understanding of agent instruction formats. Furthermore, the specification evolves rapidly. Frontmatter fields like allowed-tools, user-invokable, and metadata.api_base are added monthly. A parser built against an early draft will silently fail or misclassify newer entries.
Data from large-scale indexing operations confirms the scale of the blind spot. Over 69,000 skill files have been cataloged across public sources, yet fewer than 300 were historically visible in curated lists. The distribution follows a steep Pareto curve: the top 25 contributors account for roughly 30% of all indexed skills. Meanwhile, zero entries originate from major AI vendors, confirming the ecosystem is entirely community-driven. The format is also leaking laterally, appearing in repositories tagged for competing agent frameworks (Cursor, Cline, Aider, Windsurf), which means any registry must treat the file as a portable agent standard rather than a platform-specific artifact.
Without a dedicated indexing layer, teams building agent tooling, evaluation pipelines, or internal marketplaces are forced to scrape, parse, and score manually. This introduces latency, inconsistency, and operational debt. The solution requires a batch-driven discovery engine, a content-only quality model, and a hybrid storage architecture optimized for static asset delivery.
WOW Moment: Key Findings
The most counterintuitive insight from scaling a skill registry is that popularity metrics actively degrade signal quality. When ranking relies on stars, forks, or follower counts, the catalog becomes vulnerable to gaming, vendor bias, and hype cycles. Conversely, a purely structural scoring model surfaces skills that actually prevent agent misbehavior.
| Indexing Strategy | Discovery Coverage | Quality Signal Accuracy | Vendor/Influence Bias | Operational Complexity |
|---|---|---|---|---|
| Popularity-Driven | High (top-heavy) | Low (correlates with marketing, not utility) | High (favors established accounts) | Low (native platform APIs suffice) |
| Content-Structural | High (long-tail inclusive) | High (measures boundary discipline, transparency) | Zero (ignores author metrics) | Medium-High (requires custom parsing & scoring) |
| Hybrid (Popularity + Content) | Medium | Medium (dilutes structural signals) | Medium (reintroduces bias) | High (requires complex weighting logic) |
This finding matters because it redefines how agent instruction quality should be measured. A skill that explicitly documents when not to trigger, includes pricing/quota transparency, and maintains structured frontmatter will consistently outperform a viral but vague instruction set in production agent workflows. The structural approach also future-proofs the registry: as the spec evolves, the scoring model adapts by weighting new frontmatter keys rather than chasing social metrics. It enables objective evaluation layers, reliable recommender systems, and trustworthy internal marketplaces without introducing pay-to-rank dynamics.
Core Solution
Building a production-ready skill registry requires separating discovery, validation, scoring, and delivery into distinct pipelines. The architecture prioritizes idempotent batch processing, edge-optimized static delivery, and a scoring engine that ignores all social signals.
Step 1: Multi-Source Discovery Pipeline
A single orchestrator runs nightly, querying 24 distinct data surfaces. Instead of relying on one search endpoint, the pipeline distributes load across:
- Repository code search with query variants (language hints, date bounds, frontmatter field filters)
- Topic-tagged repositories and gists
- Community lists, alternative Git hosts, and dataset platforms
- Social and discussion platforms via Algolia or native search APIs
- Archive indexes for renamed or deleted repositories
- Graph traversal (stargazer enumeration) to surface skills from users who interact with known entries
- LLM-assisted query expansion to generate next-cycle search terms based on discovered patterns
Each source is rate-limited and wrapped in isolated execution blocks. A single endpoint failure does not cascade. The pipeline outputs a deduplicated list of candidate repositories.
Step 2: Frontmatter Parsing & Validation
The parser extracts YAML frontmatter and validates against a dynamic schema. It normalizes field names, strips markdown artifacts, and enforces type constraints. Critical fields include name, description, allowed-tools, model, and metadata.*. The parser also extracts structural markers: headings, code blocks, and explicit negative-space sections.
Step 3: Content-Only Quality Scoring
The scoring engine evaluates the instruction file itself. It calculates a weighted score based on:
- Anti-trigger discipline: Presence of "out of scope" or "when not to use" sections (+4 per pattern, capped at +16)
- Cost transparency: Documentation of API spend, rate limits, or quota expectations (+10)
- Frontmatter depth: Number of distinct configuration keys beyond name/description (capped at 10 to prevent padding)
- Structural density: Minimum description length, presence of multiple code examples, and hierarchical headings
- Filler penalty: Detection of placeholder text, TODO markers, or generic templates (-5)
The final score is normalized to a [50, 100] range for production evaluation layers. No stars, forks, or author metrics influence the result.
Step 4: Storage & API Delivery
Per-skill HTML pages and metadata are generated statically. To avoid deploy budget exhaustion at scale, files are stored in object storage (Cloudflare R2) and served via edge rewrites. The API layer runs as lightweight serverless functions (Cloudflare Workers) bound to the same domain, providing paginated listings, single-skill retrieval, category/tag filtering, and aggregate statistics. The entire API surface is ~300 lines of code, with heavy lifting handled by the nightly batch job.
Architecture Rationale
- Batch over real-time: Skill files change infrequently. Nightly runs reduce API costs, avoid rate-limit collisions, and allow comprehensive graph traversal.
- Content-only scoring: Prevents gaming, ensures objective evaluation, and aligns with actual agent reliability.
- Hybrid static/dynamic delivery: Object storage handles scale and cost; edge functions handle routing and API logic; static site generators handle hub pages. Each layer does what it does best.
- Orthogonal tagging: Skills are categorized by domain (Engineering, Security, Growth, etc.) and tagged across ~100 dimensions (language, framework, AI provider, integration type). This enables multi-axis filtering without hardcoding taxonomies.
Code Example: Skill Parser & Scorer (TypeScript)
import { parse as yamlParse } from 'yaml';
import { createHash } from 'crypto';
interface SkillManifest {
name: string;
description: string;
allowedTools?: string[];
model?: string;
metadata?: Record<string, unknown>;
[key: string]: unknown;
}
interface ScoringResult {
totalScore: number;
breakdown: Record<string, number>;
slug: string;
}
export class AgentSkillEvaluator {
private readonly MAX_FRONTMATTER_KEYS = 10;
private readonly FILLER_PENALTY = -5;
private readonly ANTI_TRIGGER_BONUS = 4;
private readonly MAX_ANTI_TRIGGER_BONUS = 16;
private readonly TRANSPARENCY_BONUS = 10;
public evaluate(rawMarkdown: string): ScoringResult {
const { frontmatter, body } = this.extractFrontmatter(rawMarkdown);
const breakdown: Record<string, number> = {};
breakdown.frontmatterDepth = this.scoreFrontmatterDepth(frontmatter);
breakdown.antiTrigger = this.scoreAntiTriggerSections(body);
breakdown.transparency = this.scoreCostTransparency(body);
breakdown.structure = this.scoreStructuralDensity(body);
breakdown.fillerPenalty = this.detectFillerPhrases(body) ? this.FILLER_PENALTY : 0;
const rawTotal = Object.values(breakdown).reduce((sum, val) => sum + val, 0);
const normalizedTotal = Math.max(50, Math.min(100, 50 + (rawTotal / 40) * 50));
return {
totalScore: Math.round(normalizedTotal),
breakdown,
slug: this.generateSlug(frontmatter.name)
};
}
private extractFrontmatter(content: string): { frontmatter: SkillManifest; body: string } {
const match = content.match(/^---\n([\s\S]*?)\n---\n([\s\S]*)$/);
if (!match) throw new Error('Invalid skill format: missing YAML frontmatter');
return {
frontmatter: yamlParse(match[1]) as SkillManifest,
body: match[2]
};
}
private scoreFrontmatterDepth(fm: SkillManifest): number {
const keys = Object.keys(fm).filter(k => !['name', 'description'].includes(k));
return Math.min(keys.length, this.MAX_FRONTMATTER_KEYS);
}
private scoreAntiTriggerSections(body: string): number {
const patterns = /(?:when\s+not\s+to\s+use|out\s+of\s+scope|negative\s+space|avoid\s+trigger)/gi;
const matches = body.match(patterns);
const count = matches ? matches.length : 0;
return Math.min(count * this.ANTI_TRIGGER_BONUS, this.MAX_ANTI_TRIGGER_BONUS);
}
private scoreCostTransparency(body: string): number {
const costMarkers = /(?:rate\s+limit|api\s+spend|quota|pricing|cost\s+estimate|token\s+budget)/i;
return costMarkers.test(body) ? this.TRANSPARENCY_BONUS : 0;
}
private scoreStructuralDensity(body: string): number {
const hasHeadings = /^#{1,3}\s+.+$/m.test(body);
const codeBlocks = (body.match(/```/g) || []).length / 2;
const descLength = body.length;
return (hasHeadings ? 2 : 0) + Math.min(codeBlocks, 4) + (descLength > 800 ? 3 : 0);
}
private detectFillerPhrases(body: string): boolean {
const fillers = /(?:todo:|lorem\s+ipsum|placeholder|example\s+only|draft)/i;
return fillers.test(body);
}
private generateSlug(name: string): string {
const base = name.toLowerCase().replace(/[^a-z0-9]+/g, '-').replace(/(^-|-$)/g, '');
const hash = createHash('sha256').update(base + Date.now().toString()).digest('hex').slice(0, 6);
return `${base}-${hash}`;
}
}
Pitfall Guide
1. Relying Solely on Platform Code Search
Explanation: Native repository search engines enforce hard result caps (usually 1,000 per query) and ignore non-repository artifacts. This blinds the indexer to gists, social mentions, and alternative hosts. Fix: Distribute discovery across 20+ sources. Use query variants with date bounds and field filters to bypass caps. Include archive indexes and graph traversal to recover deleted or renamed entries.
2. Weighting Popularity Metrics in Ranking
Explanation: Stars, forks, and follower counts correlate with marketing reach, not agent reliability. Introducing these signals invites gaming, vendor bias, and collapses trust in evaluation layers. Fix: Enforce a strict content-only scoring model. If a proposed ranking change could be influenced by payment or social manipulation, reject it. Normalize scores to a fixed range to maintain consistency.
3. Ignoring Frontmatter Spec Drift
Explanation: Agent instruction formats evolve monthly. New fields like allowed-tools or metadata.api_base appear without deprecation cycles. Hardcoded parsers break silently or misclassify entries.
Fix: Implement a dynamic schema validator that accepts unknown keys under a metadata namespace. Log schema version mismatches and trigger re-parsing when spec updates are detected.
4. Monolithic Static Site Deployment at Scale
Explanation: Generating tens of thousands of per-skill HTML pages in a single build pipeline exhausts deploy budgets, causes timeout failures, and slows iteration. Fix: Decouple generation from delivery. Store static assets in object storage, serve via edge rewrites, and keep the site generator focused on hub pages and navigation. Use CDN caching for API responses.
5. Missing Negative-Space Validation
Explanation: Skills without explicit "out of scope" or "when not to use" sections cause agents to trigger inappropriately, leading to hallucination or wasted API spend. Fix: Treat anti-trigger sections as a primary quality signal. Require or heavily weight negative-space documentation in the scoring engine. Flag skills lacking boundary conditions for manual review.
6. Over-Fetching External Social APIs
Explanation: Scraping discussion platforms, social feeds, and comment threads without strict rate limiting or noise filtering consumes budget and returns low-signal URLs. Fix: Use targeted search APIs (e.g., Algolia) with URL extraction patterns. Cache results, deduplicate aggressively, and apply a relevance threshold before adding candidates to the pipeline.
7. Hardcoding Category Taxonomies
Explanation: Relying on fixed categories (e.g., "Engineering", "Security") fails to capture cross-domain skills and becomes outdated as new use cases emerge. Fix: Use orthogonal tagging across ~100 dimensions. Separate domain classification from technical tagging. Allow multi-label assignment and generate dynamic hub pages from tag combinations.
Production Bundle
Action Checklist
- Deploy a nightly batch orchestrator with isolated execution blocks per data source
- Implement a dynamic YAML frontmatter parser with metadata namespace fallback
- Build a content-only scoring engine that ignores all social/popularity signals
- Store per-skill static assets in object storage and route via edge rewrites
- Expose a paginated, CORS-open REST API with OpenAPI 3.1 documentation
- Generate orthogonal tags across language, framework, provider, and integration type
- Archive daily snapshots in multiple formats (JSON, NDJSON, CSV, Parquet, Atom)
- Monitor schema drift and trigger re-indexing when frontmatter specifications change
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Small team, <5k skills | Single-repo static generator + GitHub Pages | Low operational overhead, sufficient for limited scale | Minimal (free tier) |
| Medium team, 5kβ50k skills | Object storage + edge rewrites + nightly batch | Prevents deploy budget exhaustion, scales horizontally | Low-Medium (storage + egress) |
| Enterprise, >50k skills | Hybrid CF Workers + R2 + Netlify + dedicated parser | Decouples concerns, enables high-throughput API, isolates failures | Medium (worker invocations + storage) |
| Internal agent marketplace | Content-only scoring + strict schema validation | Ensures objective evaluation, prevents vendor bias | Low (parser compute only) |
| Public discovery platform | Multi-source crawler + graph traversal + social APIs | Maximizes long-tail coverage, recovers deleted entries | Medium-High (API quotas + compute) |
Configuration Template
# skill-registry.config.yaml
discovery:
sources:
- type: repository_search
query_variants: 101
date_bound_days: 30
result_cap: 1000
- type: topic_index
primary_topic: claude-code-skills
variants: 31
- type: graph_traversal
seed_threshold_stars: 200
expansion_depth: 1
- type: archive_index
provider: wayback_cdx
include_deleted: true
rate_limits:
requests_per_minute: 60
max_concurrent: 4
retry_backoff_ms: 2000
parsing:
frontmatter_schema:
required: [name, description]
optional: [model, allowed_tools, user_invokable, version, license]
metadata_namespace: true
body_validation:
min_description_length: 800
require_code_blocks: true
anti_trigger_weight: 4
max_anti_trigger_bonus: 16
scoring:
model: content_only
transparency_bonus: 10
filler_penalty: -5
normalization_range: [50, 100]
popularity_signals: []
delivery:
storage: object_store
edge_routing: true
api_format: rest
pagination_limit: 50
cache_ttl_seconds: 3600
export_formats: [json, ndjson, csv, parquet, atom]
Quick Start Guide
- Initialize the parser: Clone the registry repository, install dependencies, and run the frontmatter validator against a sample
SKILL.mdto confirm YAML extraction and schema compliance. - Configure data sources: Edit the discovery configuration file, set rate limits, and enable the primary repository search and topic index sources. Disable social and archive sources initially to reduce noise.
- Execute the first batch: Run the nightly orchestrator in dry-run mode. Verify that candidates are deduplicated, parsed, and scored without errors. Check the output directory for generated metadata files.
- Deploy the delivery layer: Upload static assets to object storage, configure edge rewrite rules, and spin up the API worker. Validate pagination, single-skill retrieval, and tag filtering against the test dataset.
- Schedule production runs: Set up a cron job or CI pipeline to trigger the orchestrator daily. Monitor logs for schema drift warnings, rate limit hits, and scoring distribution shifts. Enable full source rotation once stability is confirmed.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
