System Design Interview: Decentralized Web Crawler

By Codcompass Team·2026-05-26·9 min read

Architecting a Churn-Resilient Distributed Crawler with Consistent Hashing

Current Situation Analysis

Centralized web crawlers hit a hard scalability wall once they cross the thousand-node threshold. The bottleneck isn't network bandwidth or disk I/O; it's coordination. A shared URL queue, a global scheduler, and a centralized deduplication database create lock contention, network serialization, and a single point of failure. When thousands of agents compete for the same queue, throughput plateaus and latency spikes.

The industry routinely misunderstands deduplication in this context. Engineers instinctively design for exactly-once processing, assuming that crawling the same URL twice is a critical failure. This assumption forces expensive distributed transactions, cross-node consensus protocols, or heavy database locking. In reality, large-scale indexing systems operate on eventual coverage. Accepting a 0.1%–0.5% duplication rate or rare long-tail misses eliminates the need for global coordination entirely.

Production data from petabyte-scale indexing pipelines confirms this trade-off. Systems handling billions of URLs across 10,000+ independent nodes achieve 10x higher throughput when they abandon strong consistency. Churn is constant: nodes join, crash, or get reaped by orchestrators without warning. A design that requires global state synchronization will fracture under this volatility. The solution is to decouple discovery from execution, push deduplication to the edge, and route tasks deterministically using a distributed hash table (DHT) built on consistent hashing.

WOW Moment: Key Findings

The breakthrough in decentralized crawling isn't a new algorithm; it's a shift in ownership semantics. By mapping URLs to nodes via a deterministic hash ring, you eliminate negotiation, locking, and global state. The following comparison highlights why this approach outperforms traditional architectures at scale:

Approach	Coordination Overhead	Data Movement on Churn	Deduplication Model	Max Stable Node Count
Centralized Queue	High (lock contention, scheduler latency)	None (static topology)	Strong (ACID)	~1,000
Gossip-Based Task Pool	Medium (state sync, conflict resolution)	High (task reassignment)	Eventual (conflict resolution)	~5,000
DHT Consistent Ring	Low (deterministic routing, zero negotiation)	Minimal (local arc reassignment)	Eventual (owner-side check)	10,000+

This finding matters because it proves that coordination cost scales with topology awareness, not node count. A consistent hashing ring requires each node to maintain only O(log N) routing state. When a node crashes or joins, only the keys in the adjacent arc shift ownership. The rest of the ring remains untouched. This stability enables horizontal scaling without reconfiguration, while owner-side deduplication keeps memory usage predictable and network traffic linear.

Core Solution

Building a decentralized crawler requires four interconnected components: a deterministic hash space, a self-healing routing layer, a task distribution mechanism, and an edge-side deduplication store. Each component is designed to operate independently while guaranteeing eventual coverage.

1. Hash Space & Node Positioning

The foundation is a circular key space, typically 0 to 2^160 - 1. Every node and every URL maps to a position in this space. A node's position is derived from hashing its unique identifier. A URL's position is derived from hashing its canonical form. Ownership follows a simple clockwise rule: a URL belongs to the first node whose position equals or follows the URL's key.

This eliminates modulo-based partitioning, which redistributes nearly all keys when N changes. Consistent hashing ensures that adding or removing a node only affects keys in the immediate arc, preserving routing stability under churn.

2. Routing Architecture: Finger Tables

Deterministic ownership is

useless if nodes cannot locate each other efficiently. A naive successor-only ring requires O(N) hops. Full membership gossip requires O(N) state per node and floods the network during churn. The optimal middle ground is a finger table.

Each node maintains log₂(N) entries pointing to nodes at exponentially increasing distances around the ring (+1, +2, +4, +8, ...). When routing a task, a node forwards it to the farthest finger that does not overshoot the target key. Each hop halves the remaining distance, guaranteeing O(log N) routing latency.

Finger tables optimize speed; successor pointers guarantee correctness. Successor pointers must be refreshed frequently to handle crashes, while finger tables can update asynchronously without breaking routing.

3. Task Distribution & Ownership

When a node parses a fetched page, it extracts outbound links. Instead of queuing them locally, it hashes each link, routes it to its owner, and hands it off. The owner receives the task, checks its local deduplication store, and queues it if unseen. This decouples discovery from execution and ensures that duplicate links discovered by hundreds of nodes converge on a single owner.

4. Local Execution & Deduplication

Each node runs two concurrent loops: a crawler loop that processes its local queue, and a receiver loop that handles incoming tasks. Deduplication happens at the owner. A memory-efficient seen set (or probabilistic structure) filters duplicates before they enter the fetch queue. Rate limiting and robots.txt compliance are enforced locally per host, preventing owner nodes from overwhelming target domains.

Implementation (TypeScript)

import { createHash } from 'crypto';

// 1. Hash Ring & Finger Table Management
class HashRing {
  private nodeId: string;
  private position: number;
  private successor: string | null = null;
  private fingerTable: Map<number, string> = new Map();
  private readonly SPACE_SIZE = Math.pow(2, 160);

  constructor(nodeId: string) {
    this.nodeId = nodeId;
    this.position = this.hashToNumber(nodeId);
  }

  private hashToNumber(input: string): number {
    return parseInt(createHash('sha256').update(input).digest('hex').slice(0, 16), 16);
  }

  // Determines if 'key' falls between 'start' and 'end' on the ring
  private isBetween(key: number, start: number, end: number): boolean {
    if (start < end) return key >= start && key < end;
    return key >= start || key < end;
  }

  // Core routing logic using finger table optimization
  async findOwner(targetKey: number): Promise<string> {
    if (this.isBetween(targetKey, this.position, this.successorPosition())) {
      return this.successor!;
    }

    // Forward to farthest finger that precedes the target
    for (const [distance, nodeId] of this.fingerTable.entries()) {
      const fingerPos = (this.position + distance) % this.SPACE_SIZE;
      if (this.isBetween(fingerPos, this.position, targetKey)) {
        // In production, this triggers a network RPC to the target node
        return this.rpcLookup(nodeId, targetKey);
      }
    }

    // Fallback to successor if no finger matches
    return this.rpcLookup(this.successor!, targetKey);
  }

  private async rpcLookup(nodeId: string, key: number): Promise<string> {
    // Simulated network call; replace with actual RPC/HTTP client
    console.log(`[RPC] Forwarding lookup for key ${key} to ${nodeId}`);
    return nodeId; // Placeholder for recursive resolution
  }

  private successorPosition(): number {
    return this.fingerTable.get(1) 
      ? this.hashToNumber(this.fingerTable.get(1)!) 
      : this.position;
  }

  updateFingerTable(distance: number, nodeId: string) {
    this.fingerTable.set(distance, nodeId);
  }

  updateSuccessor(nodeId: string) {
    this.successor = nodeId;
  }
}

// 2. Task Dispatcher & Deduplication
class CrawlAgent {
  private ring: HashRing;
  private localQueue: Set<string> = new Set();
  private seenStore: Map<string, number> = new Map(); // In prod: use Bloom filter or Redis
  private readonly DEDUP_TTL_MS = 86400000; // 24 hours

  constructor(ring: HashRing) {
    this.ring = ring;
  }

  // Handles incoming tasks routed by other nodes
  async handleIncomingTask(url: string): Promise<void> {
    const key = this.ring['hashToNumber'](url);
    const owner = await this.ring.findOwner(key);
    
    if (owner === this.ring['nodeId']) {
      await this.processLocalTask(url);
    } else {
      // Forward to correct owner
      console.log(`[ROUTING] Task ${url} belongs to ${owner}, forwarding...`);
    }
  }

  private async processLocalTask(url: string): Promise<void> {
    const now = Date.now();
    const lastSeen = this.seenStore.get(url);
    
    if (lastSeen && (now - lastSeen) < this.DEDUP_TTL_MS) {
      return; // Duplicate within TTL window
    }

    this.seenStore.set(url, now);
    this.localQueue.add(url);
    console.log(`[QUEUE] Added ${url} to local crawl queue`);
  }

  // Crawling loop: fetch, parse, distribute
  async crawlLoop(): Promise<void> {
    for (const url of this.localQueue) {
      this.localQueue.delete(url);
      const pageContent = await this.fetchPage(url);
      const extractedLinks = this.parseLinks(pageContent);
      
      for (const link of extractedLinks) {
        await this.handleIncomingTask(link);
      }
    }
  }

  private async fetchPage(url: string): Promise<string> {
    // Simulate HTTP fetch with rate limiting & robots.txt compliance
    return `<html><body><a href="https://example.com/page2"></a></body></html>`;
  }

  private parseLinks(html: string): string[] {
    const regex = /href="([^"]+)"/g;
    const links: string[] = [];
    let match;
    while ((match = regex.exec(html)) !== null) {
      links.push(match[1]);
    }
    return links;
  }
}

Architecture Rationale

Why consistent hashing? Modulo partitioning (hash % N) forces global redistribution on every topology change. Consistent hashing limits movement to adjacent arcs, preserving cache locality and reducing network churn.
Why finger tables? They provide O(log N) routing with O(log N) state. This balances lookup speed against maintenance overhead. Successor pointers alone would degrade to O(N) hops at 10k nodes.
Why owner-side deduplication? Pushing deduplication to the owner eliminates cross-node coordination. Every node agrees on ownership deterministically. The seen set only needs to track URLs assigned to that node, keeping memory usage linear and predictable.
Why eventual coverage? Strong consistency requires distributed locks or consensus, which adds latency and reduces availability. Accepting rare duplicates or missed long-tail URLs keeps the system lightweight and resilient to partitions.

Pitfall Guide

1. Stale Finger Tables Causing Routing Loops

Explanation: Finger tables update asynchronously. If a node crashes and its successor changes, stale finger entries may point to dead nodes, causing routing loops or timeouts. Fix: Implement a background refresh routine that validates finger entries every 30–60 seconds. Add a hop limit (e.g., maxHops = 20) to break loops and fallback to successor-only routing when fingers fail.

2. Unbounded Seen Set Growth

Explanation: Storing every seen URL in a Map or Set will exhaust RAM as the crawl scales to billions of URLs. Fix: Use a probabilistic data structure like a Bloom filter for fast negative lookups, paired with a time-to-live (TTL) eviction policy. For persistent deduplication across restarts, offload to a distributed cache (Redis/Memcached) with LRU eviction.

3. Owner-Side Rate Limit Blindness

Explanation: Multiple nodes may route links from the same target domain to a single owner. The owner can inadvertently violate robots.txt or host rate limits by fetching too aggressively. Fix: Implement per-host token buckets at the owner level. Track request timestamps per domain and enforce sliding window limits before dequeuing tasks.

4. Hash Skew Creating Hotspot Nodes

Explanation: If URL keys cluster in a specific range, one node may receive disproportionate traffic, causing queue starvation elsewhere. Fix: Deploy virtual nodes. Each physical node claims multiple positions on the ring (e.g., nodeId-1, nodeId-2, ... nodeId-150). This smooths distribution and prevents hotspots.

5. Churn-Induced Lookup Timeouts

Explanation: During rapid scaling or crash events, successor pointers may lag, causing lookups to fail or route to decommissioned nodes. Fix: Implement exponential backoff with jitter for routing retries. Maintain a fallback routing table that bypasses stale fingers and walks successors directly until the ring stabilizes.

6. Over-Optimizing for Exactly-Once Semantics

Explanation: Attempting to guarantee zero duplicates requires distributed transactions or idempotency keys, adding latency and complexity. Fix: Accept eventual coverage. Design downstream consumers (indexers, parsers) to handle duplicates gracefully via content hashing or canonical URL normalization.

7. Network Partition Blindness

Explanation: During a split-brain scenario, two ring segments may operate independently, causing duplicate crawling and inconsistent ownership. Fix: Implement lightweight partition detection via heartbeat monitoring. When a partition heals, trigger a lightweight reconciliation pass that compares seen set digests rather than full URL lists.

Production Bundle

Action Checklist

Initialize hash ring with virtual nodes (100–150 per physical node) to prevent skew
Implement finger table refresh interval (30s) and successor heartbeat (5s)
Replace in-memory seen sets with Bloom filters + TTL eviction for memory safety
Add per-host rate limiting and robots.txt parsing at the owner queue level
Set hop limit (max 20) and fallback routing to prevent lookup loops
Instrument routing metrics: hop count distribution, lookup latency, dedup hit rate
Seed initial URLs across multiple nodes to prevent single-point bootstrapping

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
< 1,000 nodes, strict dedup required	Centralized queue + Redis dedup	Lower complexity, strong consistency achievable	High coordination cost, limited scale
1,000–5,000 nodes, moderate churn	Gossip-based task pool	Simpler routing, self-healing via state sync	Medium network overhead, conflict resolution latency
5,000–10,000+ nodes, high churn	DHT consistent ring + finger tables	Deterministic routing, minimal state sync, O(log N) hops	Low coordination cost, requires careful TTL/finger tuning
Multi-region deployment	Region-aware virtual nodes + cross-region RPC	Reduces latency, respects data residency	Higher infrastructure cost, requires partition handling

Configuration Template

ring:
  space_bits: 160
  virtual_nodes_per_physical: 128
  successor_heartbeat_ms: 5000
  finger_refresh_ms: 30000
  max_lookup_hops: 20

deduplication:
  strategy: bloom_filter
  expected_capacity: 100000000
  false_positive_rate: 0.01
  ttl_hours: 24
  persistence_backend: redis_cluster

routing:
  rpc_timeout_ms: 2000
  retry_backoff_base_ms: 100
  retry_backoff_max_ms: 5000
  fallback_to_successor: true

crawl:
  queue_size_limit: 50000
  concurrent_fetches: 50
  per_host_rate_limit: 100 # requests per minute
  robots_txt_cache_hours: 168

Quick Start Guide

Initialize the ring: Deploy 3–5 test nodes. Configure each with a unique nodeId and generate virtual node positions using SHA-256 hashing.
Bootstrap routing: Manually set successor pointers for the first node, then run the finger table discovery routine to populate routing tables across the cluster.
Inject seed URLs: Distribute 50–100 seed URLs across different nodes to prime local queues and trigger cross-node routing.
Monitor & validate: Track lookup hop counts, dedup hit rates, and queue depths. Verify that crashed nodes are bypassed gracefully and that routing stabilizes within 60 seconds of topology changes.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back