How I indexed 69,000 Claude Code skills (and what I learned doing it)

Current Situation Analysis

AI agent ecosystems are rapidly adopting markdown-based instruction files to extend base model capabilities. In the Claude Code environment, these are defined as SKILL.md files containing YAML frontmatter and natural language directives. When placed in the designated user directory (~/.claude/skills/<name>/), they become invocable slash commands. The format is lightweight, human-readable, and highly portable. However, the operational reality of managing thousands of these files reveals a critical infrastructure gap: discovery and quality assurance are entirely decentralized.

The industry pain point is not the creation of skills, but their aggregation. Authors publish to public repositories, but there is no centralized registry, standardized search surface, or programmatic API. Developers relying on native platform search or community-curated lists encounter severe fragmentation. The long tail of the ecosystem remains invisible, and quality variance is extreme. A rigorously engineered skill with explicit boundary conditions sits alongside a four-line placeholder with identical discoverability.

This problem is frequently misunderstood because teams assume GitHub's native code search or social aggregation is sufficient. In practice, platform search engines impose hard result caps (typically 1,000 per query), ignore non-repository artifacts like gists or social mentions, and lack semantic understanding of agent instruction formats. Furthermore, the specification evolves rapidly. Frontmatter fields like allowed-tools, user-invokable, and metadata.api_base are added monthly. A parser built against an early draft will silently fail or misclassify newer entries.

Data from large-scale indexing operations confirms the scale of the blind spot. Over 69,000 skill files have been cataloged across public sources, yet fewer than 300 were historically visible in curated lists. The distribution follows a steep Pareto curve: the top 25 contributors account for roughly 30% of all indexed skills. Meanwhile, zero entries originate from major AI vendors, confirming the ecosystem is entirely community-driven. The format is also leaking laterally, appearing in repositories tagged for competing agent frameworks (Cursor, Cline, Aider, Windsurf), which means any registry must treat the file as a portable agent standard rather than a platform-specific artifact.

Without a dedicated indexing layer, teams building agent tooling, evaluation pipelines, or internal marketplaces are forced to scrape, parse, and score manually. This introduces latency, inconsistency, and operational debt. The solution requires a batch-driven discovery engine, a content-only quality model, and a hybrid storage architecture optimized for static asset delivery.

WOW Moment: Key Findings

The most counterintuitive insight from scaling a skill registry is that popularity metrics actively degrade signal quality. When ranking relies on stars, forks, or follower counts, the catalog becomes vulnerable to gaming, vendor bias, and hype cycles. Conversely, a purely structural scoring model surfaces skills that actually prevent agent misbehavior.

Indexing Strategy	Discovery Coverage	Quality Signal Accuracy	Vendor/Influence Bias	Operational Complexity
Popularity-Driven	High (top-heavy)	Low (correlates with marketing, not utility)	High (favors established accounts)	Low (native platform APIs suffice)
Content-Structural	High (long-tail inclusive)	High (measures boundary discipline, transparency)	Zero (ignores author metrics)	Medium-High (requires custom parsing & scoring)
Hybrid (Popularity + Content)	Medium	Medium (dilutes structural signals)	Medium (reintroduces bias)	High (requires complex weighting logic)

This finding matters because it redefines how agent instruction quality should be measured. A skill that explicitly documents when not to trigger, includes pricing/quota transparency, and maintains structured frontmatter will consistently outperform a viral but vague instruction set in production agent workflows. The structural approach also future-proofs the registry: as the spec evolves, the scoring model adapts by weighting new frontmatter keys rather than chasing social metrics. It enables objective evaluation layers, reliable recommender systems, and trustworthy internal marketplaces without introducing pay-to-rank dynamics.

Core Solution

Building a production-ready skill registry requires separating discovery, validation, scoring, and delivery into distinct pipelines. The architecture prioritizes idempotent batch processing, edge-optimized static delivery, and a scoring engine that ignores all social signals.

Step 1: Multi-Source Discovery Pipeline

A single orchestrator runs nightly, querying 24 distinct data surfaces. Instead of relying on one search endpoint, the pipeline distributes load across:

Repository code search with query variants (language hints, date bounds, frontmatter field filters)
Topic-tagged repositories and gists
Community lists, alternative Git hosts, and dataset platforms
Social and discussion platforms via Algolia or native search APIs
Archive indexes for renamed or deleted repositories
Graph traversal (stargazer enumeration) to surface skills from users who interact with known entries
LLM-assisted query expansion to generate next-cycle search terms based on discovered patterns

Each source is rate-limited and wrapped in isolated execution blocks. A single endpoint failure does not cascade. The pipeline outputs a deduplicated list of candidate repositories.

Step 2: Frontmatter Parsing & Validation

The parser extracts YAML frontmatter and validates against a dynamic schema. It normalizes field names, strips markdown artifacts, and enforces type constraints. Critical fields include name, description, allowed-tools, model, and metadata.*. The parser also extracts structural markers: headings, code blocks, and explicit negative-space sections.

Step 3: Content-Only Quality Scoring

The scoring engine evaluates the instruction file itself. It calculates a weighted score based on:

Anti-trigger discipline: Presence of "out of scope" or "when not to use" sections (+4 per pattern, capped at +16)
Cost transparency: Documentation of API spend, rate limits, or quota expectations (+10)
Frontmatter depth: Number of distinct configuration keys beyond name/description (capped at 10 to prevent padding)
Structural density: Minimum description length, presence of multiple code examples, and hierarchical headings
Filler penalty: Detection of placeholder text, TODO markers, or generic templates (-5)

The final score is normalized to a [50, 100] range for production evaluation layers. No stars, forks, or author metrics influence the result.

Step 4: Storage & API Delivery

Per-skill HTML pages and metadata are generated statically. To avoid deploy budget exhaustion at scale, files are stored in object storage (Cloudflare R2) and served via edge rewrites. The API layer runs as lightweight serverless functions (Cloudflare Workers) bound to the same domain, providing paginated listings, single-skill retrieval, category/tag filtering, and aggregate statistics. The entire API surface is ~300 lines of code, with heavy lifting handled by the nightly batch job.

Architecture Rationale

Batch over real-time: Skill files change infrequently. Nightly runs reduce API costs, avoid rate-limit collisions, and allow comprehensive graph traversal.
Content-only scoring: Prevents gaming, ensures objective evaluation, and aligns with actual agent reliability.
Hybrid static/dynamic delivery: Object storage handles scale and cost; edge functions handle routing and API logic; static site generators handle hub pages. Each layer does what it does best.
Orthogonal tagging: Skills are categorized by domain (Engineering, Security, Growth, etc.) and tagged across ~100 dimensions (language, framework, AI provider, integration type). This enables multi-axis filtering without hardcoding taxonomies.

Code Example: Skill Parser & Scorer (TypeScript)

import { parse as yamlParse } from 'yaml';
import { createHash } from 'crypto';

interface SkillManifest {
  name: string;
  description: string;
  allowedTools?: string[];
  model?: string;
  metadata?: Record<string, unknown>;
  [key: string]: unknown;
}

interface ScoringResult {
  totalScore: number;
  breakdown: Record<string, number>;
  slug: string;
}

export class AgentSkillEvaluator {
  private readonly MAX_FRONTMATTER_KEYS = 10;
  private readonly FILLER_PENALTY = -5;
  private readonly ANTI_TRIGGER_BONUS = 4;
  private readonly MAX_ANTI_TRIGGER_BONUS = 16;
  private readonly TRANSPARENCY_BONUS = 10;

  public evaluate(rawMarkdown: string): ScoringResult {
    const { frontmatter, body } = this.extractFrontmatter(rawMarkdown);
    const breakdown: Record<string, number> = {};

    breakdown.frontmatterDepth = this.scoreFrontmatterDepth(frontmatter);
    breakdown.antiTrigger = this.scoreAntiTriggerSections(body);
    breakdown.transparency = this.scoreCostTransparency(body);
    breakdown.structure = this.scoreStructuralDensity(body);
    breakdown.fillerPenalty = this.detectFillerPhrases(body) ? this.FILLER_PENALTY : 0;

    const rawTotal = Object.values(breakdown).reduce((sum, val) => sum + val, 0);
    const normalizedTotal = Math.max(50, Math.min(100, 50 + (rawTotal / 40) * 50));

    return {
      totalScore: Math.round(normalizedTotal),
      breakdown,
      slug: this.generateSlug(frontmatter.name)
    };
  }

  private extractFrontmatter(content: string): { frontmatter: SkillManifest; body: string } {
    const match = content.match(/^---\n([\s\S]*?)\n---\n([\s\S]*)$/);
    if (!match) throw new Error('Invalid skill format: missing YAML frontmatter');
    return {
      frontmatter: yamlParse(match[1]) as SkillManifest,
      body: match[2]
    };
  }

  private scoreFrontmatterDepth(fm: SkillManifest): number {
    const keys = Object.keys(fm).filter(k => !['name', 'description'].includes(k));
    return Math.min(keys.length, this.MAX_FRONTMATTER_KEYS);
  }

  private scoreAntiTriggerSections(body: string): number {
    const patterns = /(?:when\s+not\s+to\s+use|out\s+of\s+scope|negative\s+space|avoid\s+trigger)/gi;
    const matches = body.match(patterns);
    const count = matches ? matches.length : 0;
    return Math.min(count * this.ANTI_TRIGGER_BONUS, this.MAX_ANTI_TRIGGER_BONUS);
  }

  private scoreCostTransparency(body: string): number {
    const costMarkers = /(?:rate\s+limit|api\s+spend|quota|pricing|cost\s+estimate|token\s+budget)/i;
    return costMarkers.test(body) ? this.TRANSPARENCY_BONUS : 0;
  }

  private scoreStructuralDensity(body: string): number {
    const hasHeadings = /^#{1,3}\s+.+$/m.test(body);
    const codeBlocks = (body.match(/```/g) || []).length / 2;
    const descLength = body.length;
    return (hasHeadings ? 2 : 0) + Math.min(codeBlocks, 4) + (descLength > 800 ? 3 : 0);
  }

  private detectFillerPhrases(body: string): boolean {
    const fillers = /(?:todo:|lorem\s+ipsum|placeholder|example\s+only|draft)/i;
    return fillers.test(body);
  }

  private generateSlug(name: string): string {
    const base = name.toLowerCase().replace(/[^a-z0-9]+/g, '-').replace(/(^-|-$)/g, '');
    const hash = createHash('sha256').update(base + Date.now().toString()).digest('hex').slice(0, 6);
    return `${base}-${hash}`;
  }
}

Pitfall Guide

1. Relying Solely on Platform Code Search

Explanation: Native repository search engines enforce hard result caps (usually 1,000 per query) and ignore non-repository artifacts. This blinds the indexer to gists, social mentions, and alternative hosts. Fix: Distribute discovery across 20+ sources. Use query variants with date bounds and field filters to bypass caps. Include archive indexes and graph traversal to recover deleted or renamed entries.

2. Weighting Popularity Metrics in Ranking

Explanation: Stars, forks, and follower counts correlate with marketing reach, not agent reliability. Introducing these signals invites gaming, vendor bias, and collapses trust in evaluation layers. Fix: Enforce a strict content-only scoring model. If a proposed ranking change could be influenced by payment or social manipulation, reject it. Normalize scores to a fixed range to maintain consistency.

3. Ignoring Frontmatter Spec Drift

Explanation: Agent instruction formats evolve monthly. New fields like allowed-tools or metadata.api_base appear without deprecation cycles. Hardcoded parsers break silently or misclassify entries. Fix: Implement a dynamic schema validator that accepts unknown keys under a metadata namespace. Log schema version mismatches and trigger re-parsing when spec updates are detected.

4. Monolithic Static Site Deployment at Scale

Explanation: Generating tens of thousands of per-skill HTML pages in a single build pipeline exhausts deploy budgets, causes timeout failures, and slows iteration. Fix: Decouple generation from delivery. Store static assets in object storage, serve via edge rewrites, and keep the site generator focused on hub pages and navigation. Use CDN caching for API responses.

5. Missing Negative-Space Validation

Explanation: Skills without explicit "out of scope" or "when not to use" sections cause agents to trigger inappropriately, leading to hallucination or wasted API spend. Fix: Treat anti-trigger sections as a primary quality signal. Require or heavily weight negative-space documentation in the scoring engine. Flag skills lacking boundary conditions for manual review.

6. Over-Fetching External Social APIs

Explanation: Scraping discussion platforms, social feeds, and comment threads without strict rate limiting or noise filtering consumes budget and returns low-signal URLs. Fix: Use targeted search APIs (e.g., Algolia) with URL extraction patterns. Cache results, deduplicate aggressively, and apply a relevance threshold before adding candidates to the pipeline.

7. Hardcoding Category Taxonomies

Explanation: Relying on fixed categories (e.g., "Engineering", "Security") fails to capture cross-domain skills and becomes outdated as new use cases emerge. Fix: Use orthogonal tagging across ~100 dimensions. Separate domain classification from technical tagging. Allow multi-label assignment and generate dynamic hub pages from tag combinations.

Production Bundle

Action Checklist

Deploy a nightly batch orchestrator with isolated execution blocks per data source
Implement a dynamic YAML frontmatter parser with metadata namespace fallback
Build a content-only scoring engine that ignores all social/popularity signals
Store per-skill static assets in object storage and route via edge rewrites
Expose a paginated, CORS-open REST API with OpenAPI 3.1 documentation
Generate orthogonal tags across language, framework, provider, and integration type
Archive daily snapshots in multiple formats (JSON, NDJSON, CSV, Parquet, Atom)
Monitor schema drift and trigger re-indexing when frontmatter specifications change

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Small team, <5k skills	Single-repo static generator + GitHub Pages	Low operational overhead, sufficient for limited scale	Minimal (free tier)
Medium team, 5k–50k skills	Object storage + edge rewrites + nightly batch	Prevents deploy budget exhaustion, scales horizontally	Low-Medium (storage + egress)
Enterprise, >50k skills	Hybrid CF Workers + R2 + Netlify + dedicated parser	Decouples concerns, enables high-throughput API, isolates failures	Medium (worker invocations + storage)
Internal agent marketplace	Content-only scoring + strict schema validation	Ensures objective evaluation, prevents vendor bias	Low (parser compute only)
Public discovery platform	Multi-source crawler + graph traversal + social APIs	Maximizes long-tail coverage, recovers deleted entries	Medium-High (API quotas + compute)

Configuration Template

# skill-registry.config.yaml
discovery:
  sources:
    - type: repository_search
      query_variants: 101
      date_bound_days: 30
      result_cap: 1000
    - type: topic_index
      primary_topic: claude-code-skills
      variants: 31
    - type: graph_traversal
      seed_threshold_stars: 200
      expansion_depth: 1
    - type: archive_index
      provider: wayback_cdx
      include_deleted: true
  rate_limits:
    requests_per_minute: 60
    max_concurrent: 4
    retry_backoff_ms: 2000

parsing:
  frontmatter_schema:
    required: [name, description]
    optional: [model, allowed_tools, user_invokable, version, license]
    metadata_namespace: true
  body_validation:
    min_description_length: 800
    require_code_blocks: true
    anti_trigger_weight: 4
    max_anti_trigger_bonus: 16

scoring:
  model: content_only
  transparency_bonus: 10
  filler_penalty: -5
  normalization_range: [50, 100]
  popularity_signals: []

delivery:
  storage: object_store
  edge_routing: true
  api_format: rest
  pagination_limit: 50
  cache_ttl_seconds: 3600
  export_formats: [json, ndjson, csv, parquet, atom]

Quick Start Guide

Initialize the parser: Clone the registry repository, install dependencies, and run the frontmatter validator against a sample SKILL.md to confirm YAML extraction and schema compliance.
Configure data sources: Edit the discovery configuration file, set rate limits, and enable the primary repository search and topic index sources. Disable social and archive sources initially to reduce noise.
Execute the first batch: Run the nightly orchestrator in dry-run mode. Verify that candidates are deduplicated, parsed, and scored without errors. Check the output directory for generated metadata files.
Deploy the delivery layer: Upload static assets to object storage, configure edge rewrite rules, and spin up the API worker. Validate pagination, single-skill retrieval, and tag filtering against the test dataset.
Schedule production runs: Set up a cron job or CI pipeline to trigger the orchestrator daily. Monitor logs for schema drift warnings, rate limit hits, and scoring distribution shifts. Enable full source rotation once stability is confirmed.

Mid-Year Sale — Unlock Full Article