Difficulty

Intermediate

Read Time

9 min

Anna's Archive publica un llms.txt para los LLMs que rastrean su catálogo

By Codcompass Team·2026-05-23·9 min read

Engineering AI Crawlers Around `llms.txt`: A Protocol for Sustainable Data Acquisition

Current Situation Analysis

The infrastructure strain caused by automated AI data collection has reached a critical inflection point. Traditional web scraping pipelines, originally designed for search engine indexing or competitive intelligence, are now being repurposed at scale for large language model (LLM) training and retrieval-augmented generation (RAG) systems. This shift has created a fundamental mismatch: legacy scraping tools treat every website as a static HTML document to be parsed, while modern data providers are increasingly deploying dynamic defenses, rate limiting, and CAPTCHA challenges to protect server capacity.

The problem is frequently misunderstood by engineering teams building AI pipelines. Developers often assume that if a page is publicly accessible, it can be fetched indiscriminately. This assumption ignores the economic reality of server-side request processing. Every automated request consumes bandwidth, CPU cycles, and database queries. When thousands of concurrent AI crawlers hit a site simultaneously, the cumulative load triggers defensive mechanisms. CAPTCHA systems, while effective at blocking bots, introduce significant computational overhead for verification and degrade the experience for legitimate human users. The cost of this friction is ultimately borne by the site operators, but it also degrades data quality for the crawlers, who receive blocked responses, incomplete payloads, or legally ambiguous content.

Industry data from early 2026 underscores this shift. Major platforms have moved toward explicit access controls: Reddit deprecated its free API for training purposes, The New York Times initiated litigation over unauthorized data usage, and Cloudflare deployed default AI-bot mitigation suites for mid-tier web properties. In this environment, adversarial scraping is becoming economically unsustainable and legally precarious.

The emergence of the llms.txt standard represents a structural response to this friction. First formalized at llmstxt.org, the specification proposes a plain-text Markdown file hosted at the root of a domain. Unlike robots.txt, which operates as a restrictive deny-list, llms.txt functions as a cooperative access guide. It explicitly communicates data availability, preferred ingestion channels, rate limits, and ethical boundaries. On February 18, 2026, Anna's Archive—the world's largest open digital library aggregating LibGen, Sci-Hub, and Z-Library archives—published a highly structured /llms.txt file. Rather than blocking AI crawlers, the file redirected them toward bulk torrent mirrors, a programmatic JSON API, and enterprise SFTP channels, while explicitly requesting that computational resources saved from CAPTCHA avoidance be redirected toward preservation efforts. This marks a transition from adversarial data extraction to negotiated data exchange.

WOW Moment: Key Findings

The operational impact of adopting an llms.txt-aware ingestion strategy becomes clear when comparing traditional scraping against directive-guided acquisition. The following table contrasts three common approaches used by AI engineering teams:

Approach	Infrastructure Cost	Data Freshness	Legal/Compliance Risk	Implementation Complexity
Traditional Scraping	High (CAPTCHA solving, proxy rotation, retry logic)	High (real-time)	High (ToS violations, litigation exposure)	High (HTML parsing, anti-bot evasion)
`llms.txt`-Guided Access	Low (direct endpoints, bulk transfers)	Medium-High (scheduled updates)	Low (explicit provider consent)	Medium (directive parsing, routing logic)
Enterprise/Donation Channels	Variable (subscription or contribution-based)	High (dedicated pipelines)	Minimal (contractual or explicit terms)	Low (authenticated APIs, SFTP)

This comparison reveals a critical insight: llms.txt does not merely reduce technical overhead; it transforms data acquisition from a cat-and-mouse game into a predictable, auditable pipeline. By respecting provider-specified channels, engineering teams eliminate proxy costs, reduce retry sto

rms, and establish a documented consent trail. For organizations operating in regulated environments or building commercial RAG products, this shift is not optional—it is a baseline requirement for sustainable AI infrastructure.

Core Solution

Implementing an llms.txt-aware crawler requires a architectural shift from reactive HTML parsing to proactive directive routing. The solution involves four coordinated phases: directive discovery, structured parsing, request routing, and fallback management.

Step 1: Directive Discovery & Caching

Every crawler should first attempt to fetch /llms.txt before initiating any domain-scoped requests. The file is lightweight (typically under 5KB) and changes infrequently. Implementing a local cache with a TTL of 24-48 hours prevents redundant HTTP calls while ensuring updates are eventually propagated.

Step 2: Markdown Structure Parsing

The llms.txt specification relies on plain Markdown. A robust parser must extract:

Site identity and scope (H1 headers)
Access guidelines (blockquotes, bold text)
Endpoint mappings (hyperlinks with contextual labels)
Rate limits and ethical constraints (explicit statements)

Rather than relying on brittle regex, use a lightweight Markdown AST parser to traverse headings, links, and emphasis nodes. This preserves semantic context and allows dynamic routing based on provider intent.

Step 3: Request Routing Logic

Once parsed, the crawler must map acquisition goals to provider-specified channels. Common mappings include:

Bulk metadata → Torrent mirrors or JSON APIs
Real-time queries → Rate-limited REST endpoints
Enterprise datasets → Authenticated SFTP or donation-gated APIs
Source code → Public version control repositories

The routing layer should enforce provider-stated constraints, automatically switching channels if a primary endpoint returns HTTP 429 or 403.

Step 4: Fallback & Compliance Layer

If /llms.txt is absent or malformed, the crawler must fall back to conservative scraping heuristics: strict rate limiting, respectful User-Agent identification, and immediate cessation upon CAPTCHA detection. This layer ensures graceful degradation without violating implicit site policies.

Implementation Example (TypeScript)

The following implementation demonstrates a production-ready directive router. It replaces traditional scraping loops with a structured, cache-aware acquisition pipeline.

import { createHash } from 'crypto';
import { readFileSync, writeFileSync, existsSync } from 'fs';
import { join } from 'path';

interface DirectiveEndpoint {
  type: 'bulk' | 'api' | 'enterprise' | 'source';
  url: string;
  description: string;
  constraints?: string[];
}

interface SiteDirective {
  domain: string;
  fetchedAt: number;
  ttl: number;
  endpoints: DirectiveEndpoint[];
  rawMarkdown: string;
}

export class AiDataRouter {
  private cacheDir: string;
  private defaultTtlMs: number;

  constructor(cacheDir: string = './.cache/directives', ttlHours: number = 24) {
    this.cacheDir = cacheDir;
    this.defaultTtlMs = ttlHours * 60 * 60 * 1000;
    if (!existsSync(this.cacheDir)) {
      require('fs').mkdirSync(this.cacheDir, { recursive: true });
    }
  }

  private getCachePath(domain: string): string {
    const hash = createHash('sha256').update(domain).digest('hex').slice(0, 12);
    return join(this.cacheDir, `${hash}.json`);
  }

  private isCacheValid(directive: SiteDirective): boolean {
    return (Date.now() - directive.fetchedAt) < directive.ttl;
  }

  private parseMarkdownEndpoints(markdown: string): DirectiveEndpoint[] {
    const endpoints: DirectiveEndpoint[] = [];
    const lines = markdown.split('\n');
    
    for (let i = 0; i < lines.length; i++) {
      const line = lines[i].trim();
      const urlMatch = line.match(/\[([^\]]+)\]\(([^)]+)\)/);
      
      if (urlMatch) {
        const label = urlMatch[1].toLowerCase();
        const url = urlMatch[2];
        let type: DirectiveEndpoint['type'] = 'api';
        
        if (label.includes('torrent') || label.includes('bulk')) type = 'bulk';
        else if (label.includes('sftp') || label.includes('enterprise')) type = 'enterprise';
        else if (label.includes('git') || label.includes('source')) type = 'source';
        
        const constraints: string[] = [];
        if (i > 0) {
          const prevLine = lines[i - 1].trim();
          if (prevLine.includes('*') || prevLine.includes('-')) {
            constraints.push(prevLine.replace(/^[\*\-]\s*/, ''));
          }
        }
        
        endpoints.push({ type, url, description: urlMatch[1], constraints });
      }
    }
    return endpoints;
  }

  public async resolveAcquisitionChannel(
    domain: string,
    fetcher: (url: string) => Promise<string>
  ): Promise<SiteDirective> {
    const cachePath = this.getCachePath(domain);
    
    if (existsSync(cachePath)) {
      const cached = JSON.parse(readFileSync(cachePath, 'utf-8')) as SiteDirective;
      if (this.isCacheValid(cached)) return cached;
    }

    const directiveUrl = `https://${domain}/llms.txt`;
    const rawMarkdown = await fetcher(directiveUrl);
    
    const directive: SiteDirective = {
      domain,
      fetchedAt: Date.now(),
      ttl: this.defaultTtlMs,
      endpoints: this.parseMarkdownEndpoints(rawMarkdown),
      rawMarkdown
    };

    writeFileSync(cachePath, JSON.stringify(directive, null, 2));
    return directive;
  }

  public selectOptimalEndpoint(
    directive: SiteDirective,
    goal: 'metadata' | 'full_corpus' | 'source_code'
  ): DirectiveEndpoint | null {
    const typeMap: Record<string, DirectiveEndpoint['type']> = {
      metadata: 'api',
      full_corpus: 'bulk',
      source_code: 'source'
    };

    const target = typeMap[goal] || 'api';
    return directive.endpoints.find(ep => ep.type === target) ?? null;
  }
}

Architecture Rationale

Cache-First Resolution: llms.txt files are static by design. Caching prevents unnecessary network calls and ensures pipeline stability during provider outages.
Semantic Routing Over Hardcoding: Parsing Markdown AST nodes allows the router to adapt when providers update endpoints. Hardcoded URLs break on minor site changes; directive parsing survives them.
Goal-Based Selection: Separating acquisition intent (metadata vs full_corpus) from implementation details enables modular pipeline design. RAG systems can fetch metadata for indexing while deferring bulk downloads to offline workers.
Graceful Degradation: The router returns null when no matching endpoint exists, signaling the pipeline to switch to conservative fallback strategies rather than failing silently or aggressively retrying.

Pitfall Guide

1. Treating `llms.txt` as a Legal Shield

Explanation: The file is a technical guideline, not a binding contract. Providers can update or remove it without notice, and copyright law operates independently of site directives. Fix: Maintain a compliance audit trail. Log which directive version was active during data ingestion, and cross-reference with provider terms of service and jurisdictional copyright frameworks.

2. Ignoring Implicit Rate Limits

Explanation: Even when llms.txt provides direct endpoints, providers rarely state explicit rate limits. Assuming unlimited throughput will trigger infrastructure defenses. Fix: Implement exponential backoff with jitter. Start at 1 request per second per endpoint, and reduce concurrency if HTTP 429 or connection resets occur.

3. Mixing Bulk and Real-Time Channels

Explanation: Torrent mirrors and SFTP channels are optimized for high-throughput, low-frequency transfers. Querying them for real-time RAG lookups introduces latency and violates provider intent. Fix: Architect separate ingestion layers. Use bulk channels for offline corpus building, and reserve API endpoints for dynamic, low-latency queries.

4. Hardcoding Endpoint Paths

Explanation: Providers frequently restructure download pages or rotate torrent hashes. Hardcoded URLs break pipelines and require manual intervention. Fix: Always parse llms.txt dynamically. Store endpoint metadata in a configuration database that updates when the directive changes.

5. Assuming Anonymity Protects Against Enforcement

Explanation: Rotating IPs and spoofing User-Agent headers does not prevent legal action or infrastructure blocking. Many providers log request patterns and correlate them with known AI training pipelines. Fix: Use transparent User-Agent strings that identify your organization and provide contact information. Transparency reduces friction and enables direct communication with site operators.

6. Overlooking Data Provenance in Training Sets

Explanation: Ingesting data via llms.txt does not automatically resolve licensing ambiguities. Aggregated libraries often contain mixed-licensing materials. Fix: Implement metadata tagging at ingestion time. Record source domain, directive version, and licensing hints. Filter or watermark datasets before model training.

7. Failing to Handle Directive Absence

Explanation: Not all sites publish llms.txt. Assuming it exists leads to pipeline stalls or fallback to aggressive scraping. Fix: Design a three-tier fallback: (1) Check llms.txt, (2) Check robots.txt for crawl-delay, (3) Apply conservative default limits (e.g., 1 req/2s) with immediate halt on CAPTCHA detection.

Production Bundle

Action Checklist

Implement directive caching with configurable TTL and cache invalidation hooks
Build a Markdown AST parser that extracts endpoints, constraints, and access guidelines
Separate ingestion pipelines by data type (metadata indexing vs. bulk corpus download)
Add transparent User-Agent headers with organizational contact information
Log directive versions and endpoint selections for compliance auditing
Configure exponential backoff with jitter for all provider endpoints
Establish a fallback strategy for domains without llms.txt or with malformed directives
Tag ingested data with source provenance and licensing metadata before training or RAG indexing

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Building a RAG knowledge base from academic papers	`llms.txt`-guided metadata API + scheduled torrent sync	Balances real-time query needs with bulk storage efficiency	Low infrastructure cost, moderate engineering effort
Training a domain-specific LLM on technical documentation	Enterprise SFTP or donation-gated bulk API	Guarantees data completeness, reduces retry overhead, establishes provider consent	Higher direct cost (donation/subscription), lower operational risk
Monitoring competitor pricing or public catalogs	Traditional scraping with strict rate limits + `robots.txt` compliance	Real-time data required; bulk channels unavailable	High proxy/CAPTCHA cost, elevated legal risk
Archiving open-source documentation for offline AI agents	GitLab/Source repository clone + `llms.txt` directive routing	Preserves version history, eliminates HTML parsing overhead	Near-zero infrastructure cost, high maintainability

Configuration Template

// pipeline.config.ts
import { AiDataRouter } from './ai-data-router';

export const ingestionConfig = {
  cache: {
    directory: './.cache/directives',
    ttlHours: 24,
    maxEntries: 5000
  },
  routing: {
    defaultConcurrency: 3,
    backoff: {
      initialDelayMs: 1000,
      maxDelayMs: 30000,
      jitterFactor: 0.2
    },
    fallback: {
      enabled: true,
      maxRetries: 3,
      captchaThreshold: 2
    }
  },
  compliance: {
    userAgent: 'AcmeAI-Research/1.0 (+https://acme.ai/bot-info)',
    logDirectiveVersions: true,
    tagProvenance: true,
    licenseFilter: ['CC-BY', 'CC0', 'Public Domain', 'MIT']
  }
};

export const router = new AiDataRouter(
  ingestionConfig.cache.directory,
  ingestionConfig.cache.ttlHours
);

Quick Start Guide

Initialize the Router: Import AiDataRouter and configure the cache directory and TTL. Ensure the directory exists and has write permissions.
Resolve a Domain: Call router.resolveAcquisitionChannel('example.com', fetcher) where fetcher is an async function that returns raw Markdown. The router caches the result and parses endpoints.
Select an Endpoint: Use router.selectOptimalEndpoint(directive, 'metadata') to retrieve the provider-recommended channel for your ingestion goal.
Execute with Fallbacks: Wrap the endpoint fetch in a retry loop with exponential backoff. If the provider returns CAPTCHA or 403, switch to conservative scraping or halt ingestion and log the event for manual review.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back

Engineering AI Crawlers Around llms.txt: A Protocol for Sustainable Data Acquisition

Current Situation Analysis

WOW Moment: Key Findings

🎉 Mid-Year Sale — Unlock Full Article

Production Bundle

Engineering AI Crawlers Around `llms.txt`: A Protocol for Sustainable Data Acquisition