Anna's Archive llms.txt: a routing guide for LLM crawlers

By Codcompass Team·2026-05-24·8 min read

Current Situation Analysis

Machine learning teams face a persistent infrastructure bottleneck: acquiring high-quality, large-scale training data without triggering anti-bot defenses, inflating proxy budgets, or violating data usage policies. The default approach remains adversarial web scraping. Teams deploy headless browsers, rotate residential proxies, and implement CAPTCHA-solving services to extract content from public-facing interfaces. This strategy works for small datasets but collapses under the weight of corpus-scale ingestion.

The misconception lies in treating CAPTCHA evasion as a purely technical challenge. In reality, it is an economic and operational liability. At the scale of tens of millions of documents, proxy rotation, solver APIs, and retry logic consume thousands of dollars monthly while introducing unpredictable latency and data fragmentation. Simultaneously, the legal posture remains unchanged: bypassing access controls does not alter copyright status or shield organizations from discovery requests.

The industry is shifting toward explicit routing protocols. The llms.txt convention, initially designed for documentation sites to serve LLM-friendly content, has evolved into a machine-readable routing contract. Sites hosting massive corpora now publish structured directives that redirect AI crawlers away from interactive UIs and toward bulk ingestion channels. This transition reduces infrastructure friction, standardizes data acquisition, and creates transparent compliance boundaries.

Evidence from recent deployments confirms the economic impact. Platforms hosting over 64 million books and 95 million academic papers have documented explicit bulk endpoints, signaling that adversarial scraping is both unnecessary and cost-inefficient. The convention provides four distinct ingestion pathways: version-controlled repository mirrors, torrent-based metadata catalogs, programmatic JSON APIs, and enterprise-grade SFTP access. Organizations that adopt these routes consistently report lower data acquisition costs, higher throughput, and cleaner pipeline architecture.

WOW Moment: Key Findings

The operational shift from adversarial scraping to bulk routing fundamentally changes pipeline economics and reliability. The following comparison illustrates the structural advantages of honoring explicit routing directives versus maintaining traditional scraping infrastructure.

Approach	Monthly Infrastructure Cost	Data Throughput	Pipeline Reliability	Compliance Posture
Adversarial Scraping	$2,500–$8,000 (proxies, solvers, retries)	50–200 docs/min	Low (CAPTCHA rotation, IP bans)	High risk (access control bypass)
Bulk JSON/Torrent API	$150–$600 (compute, storage, bandwidth)	5,000–15,000 docs/min	High (deterministic endpoints)	Medium (explicit routing, legal review required)
Enterprise SFTP Tier	$10,000–$25,000 (refundable via contribution)	50,000+ docs/min	Very High (dedicated channels)	High (documented acquisition trail)

This finding matters because it decouples data acquisition from infrastructure warfare. Organizations can redirect budget from proxy management to data validation, deduplication, and pipeline orchestration. The routing contract also establishes a verifiable acquisition trail, which simplifies internal compliance audits and reduces exposure during legal discovery. Most importantly, it transforms data ingestion from an adversarial process into a cooperative engineering workflow.

Core Solution

Building a production-grade LLM data pipeline requires shifting from reactive scraping to proactive routing. The architecture below demonstrates how to parse routing directives, authenticate against bulk endpoints, and ingest metadata and content with deterministic reliability.

Step 1: Parse Routing Directives

The pipeline begins by fetching and parsing the llms.txt file. This file contains structured hints about preferred ingestion channels, rate limits, and endpoint paths. Instead of hardcoding URLs, the system dynamically resolves routing instructions.

interface LlmRoutingHint {
  endpoint: string;
  protocol: 'json' | 'torrent' | 'sftp';
  authRequired: boolean;
  rateLimit?: number;
  description: string;
}

class LlmDataRouter {
  private hints: LlmRoutingHint[] = [];

  async initialize(sourceUrl: string): Promise<void> {
    const response = await fetch(`${sourceUrl}/llms.txt`);
    const raw = await response.text();
    this.hints = this.parseDirectives(raw);
  }

  private parseDirectives(raw: string): LlmRoutingHint[] {
    const lines = raw.split('\n').filter(l => l.trim().length > 0);
    return lines.map(line => {
      const [endpoint, protocol, auth, limit, ...descParts] = line.split('|').map(s => s.trim());
      return {
        endpoint,
        protocol: protocol as LlmRoutingHint['protocol'],
        authRequired: auth === 'true',
        rateLimit: limit ? parseInt(limit, 10) : undefined,
        description: descParts.join(' ')
      };
    });
  }

  getPreferredChannel(preferredProtocol: 'json' | 'torrent' | 'sftp'): LlmRoutingHint | undefined {
    return this.hints.find(h => h.protocol === preferredProtocol);
  }
}

Step 2: Stream Metadata via Programmatic API

Once routing is resolved, the pipeline queries the bulk JSON endpoint. This avoids HTML parsing entirely and returns structured metadata suitable for downstream validation.

interface CatalogEntry {
  id: string;
  title: string;
  format: string;
  sizeBytes: number;
  torrentHash: string;
  metadataVersion: string;
}

class BulkMetadataFetcher {
  private router: LlmDataRouter;
  private concurrency: number;

  constructor(router: LlmDataRouter, concurrency: number = 4) {
    this.router = router;
    this.concurrency = concurrency;
  }

  async streamCatalog(): Promise<AsyncIterable<CatalogEntry>> {
    const channel = this.router.getPreferredChannel('json');
    if (!channel) throw new Error('No JSON routing hint available');

    const response = await fetch(channel.endpoint, {
      headers: { 'Accept': 'application/json' }
    });

    if (!response.ok) throw new Error(`Metadata fetch failed: ${response.status}`);

    const data = await response.json();
    return this.batchYield(data.entries, this.concurrency);
  }

  private async *batchYield(entries: CatalogEntry[], batchSize: number): AsyncIterable<CatalogEntry> {
    for (let i = 0; i < entries.length; i += batchSize) {
      const batch = entries.slice(i, i + batchSize);
      yield* batch;
      await new Promise(res => setTimeout(res, 100)); // Respect implicit rate limits
    }
  }
}

Step 3: Validate Compliance & Deduplicate

Bulk ingestion does not eliminate legal or data quality responsibilities. The pipeline must verify metadata against internal compliance rules and apply content hashing to prevent duplicate ingestion.

class ComplianceValidator {
  private blockedPatterns: RegExp[];
  private seenHashes: Set<string>;

  constructor(blockedPatterns: string[]) {
    this.blockedPatterns = blockedPatterns.map(p => new RegExp(p, 'i'));
    this.seenHashes = new Set();
  }

  validate(entry: CatalogEntry): { valid: boolean; reason?: string } {
    const titleMatch = this.blockedPatterns.some(rx => rx.test(entry.title));
    if (titleMatch) return { valid: false, reason: 'Blocked by title pattern' };

    const contentHash = this.computeHash(entry);
    if (this.seenHashes.has(contentHash)) {
      return { valid: false, reason: 'Duplicate detected' };
    }

    this.seenHashes.add(contentHash);
    return { valid: true };
  }

  private computeHash(entry: CatalogEntry): string {
    const payload = `${entry.id}|${entry.title}|${entry.sizeBytes}`;
    return Buffer.from(payload).toString('base64').slice(0, 16);
  }
}

Architecture Decisions & Rationale

Dynamic Routing over Hardcoded URLs: llms.txt directives change as platforms update infrastructure. Parsing them at runtime prevents pipeline breakage during endpoint migrations.
Async Iterators for Memory Efficiency: Streaming metadata instead of loading entire catalogs into memory reduces RAM pressure and enables continuous ingestion across multi-terabyte datasets.
Explicit Compliance Layer: Routing directives provide operational guidance, not legal immunity. A dedicated validation step ensures copyright filters, internal policies, and deduplication rules execute before data enters training storage.
Rate Limit Awareness: Bulk endpoints often lack explicit headers. Implementing client-side throttling prevents accidental overload and maintains cooperative access agreements.

Pitfall Guide

1. Treating `llms.txt` as a Legal License

Explanation: The file is a routing contract, not a copyright waiver. Platforms publish it to reduce scraping overhead, not to grant training rights. Fix: Maintain a separate legal review pipeline. Treat llms.txt as an operational directive, not a compliance shield.

2. Underestimating CAPTCHA Bypass Economics

Explanation: Teams often calculate proxy costs in isolation, ignoring solver API fees, retry latency, and data fragmentation. At scale, these compound into unsustainable monthly spend. Fix: Model total acquisition cost (TAC) including infrastructure, engineering time, and data loss. Compare TAC against bulk endpoint pricing before committing to scraping.

3. Ingesting Metadata as Training Content

Explanation: Catalog entries contain titles, formats, and hashes. Feeding these directly into training pipelines produces low-signal noise and corrupts embedding spaces. Fix: Implement a strict separation between metadata ingestion (routing, validation, deduplication) and content retrieval (file download, text extraction, chunking).

4. Skipping Deduplication at Scale

Explanation: Bulk mirrors often contain overlapping editions, re-uploads, and regional variants. Without content hashing, training sets inflate with redundant samples. Fix: Compute deterministic hashes across title, size, and format fields. Maintain a persistent hash registry (Redis, S3-backed index) to reject duplicates before storage.

5. Misconfiguring SFTP/Enterprise Access

Explanation: Enterprise tiers require credential rotation, connection pooling, and retry logic. Naive implementations fail under network instability or authentication timeouts. Fix: Use established SFTP clients with exponential backoff, connection health checks, and automated credential rotation. Log transfer metrics for capacity planning.

6. Ignoring Implicit Rate Limits

Explanation: Bulk APIs rarely publish strict rate headers. Aggressive polling triggers temporary blocks or throttling, degrading pipeline throughput. Fix: Implement client-side rate limiting with jitter. Monitor response latency and adjust concurrency dynamically based on endpoint behavior.

7. Assuming Static Endpoint Paths

Explanation: Platforms migrate bulk endpoints during infrastructure updates. Hardcoded paths break pipelines without warning. Fix: Resolve endpoints via llms.txt at pipeline initialization. Cache routing hints with a TTL and implement fallback resolution logic.

Production Bundle

Action Checklist

Parse llms.txt at pipeline startup and cache routing hints with a 24-hour TTL
Implement async metadata streaming to prevent memory exhaustion during catalog ingestion
Deploy a compliance validator that checks title patterns, copyright flags, and internal policies
Configure content hashing with a persistent registry to eliminate duplicate training samples
Set client-side rate limiting with jitter and dynamic concurrency adjustment
Establish SFTP connection pooling with exponential backoff and credential rotation
Route all ingestion metrics to observability dashboards for throughput and cost tracking
Schedule quarterly legal reviews to verify training data acquisition trails remain compliant

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Small-scale prototype (<100k docs)	Bulk JSON API	Fast setup, deterministic routing, minimal infra	Low ($150–$300/mo)
Mid-scale training (1M–10M docs)	Torrent metadata + SFTP fallback	High throughput, deduplication-friendly, scalable	Medium ($800–$2,500/mo)
Enterprise compliance audit	Enterprise SFTP tier	Documented acquisition trail, refundable via contribution	High ($10k–$25k, offsettable)
Adversarial environment (no `llms.txt`)	Proxy rotation + CAPTCHA solver	Only viable when bulk channels are unavailable	Very High ($3k–$8k/mo)

Configuration Template

# llm-pipeline-config.yaml
routing:
  source_url: "https://example-archive.org"
  hint_ttl_hours: 24
  preferred_protocol: "json"

ingestion:
  concurrency: 6
  rate_limit_delay_ms: 150
  batch_size: 500
  output_format: "parquet"

compliance:
  blocked_title_patterns:
    - "^restricted.*"
    - ".*internal.*"
  hash_registry:
    type: "redis"
    connection_string: "redis://localhost:6379"
  legal_review_required: true

storage:
  metadata_bucket: "s3://training-catalog/metadata"
  content_bucket: "s3://training-catalog/content"
  retention_days: 90

observability:
  metrics_endpoint: "/metrics"
  log_level: "info"
  alert_on_failure: true

Quick Start Guide

Initialize the Router: Deploy the LlmDataRouter class against your target domain. Verify that llms.txt resolves and returns at least one bulk endpoint hint.
Configure Compliance Rules: Populate the ComplianceValidator with internal copyright filters and hash registry settings. Run a dry-run against a 1,000-entry sample to validate deduplication logic.
Launch Metadata Stream: Execute the BulkMetadataFetcher with async iteration. Monitor throughput and latency. Adjust concurrency based on endpoint response times.
Validate & Store: Pipe validated entries into your object storage. Verify parquet schema alignment and confirm hash registry updates.
Monitor & Iterate: Enable observability endpoints. Track ingestion rate, duplicate rejection ratio, and compliance flags. Adjust rate limits and concurrency weekly based on pipeline performance.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back