Back to KB
Difficulty
Intermediate
Read Time
9 min

System Design Interview: Decentralized Web Crawler

By Codcompass TeamΒ·Β·9 min read

Architecting a Churn-Resilient Distributed Crawler with Consistent Hashing

Current Situation Analysis

Centralized web crawlers hit a hard scalability wall once they cross the thousand-node threshold. The bottleneck isn't network bandwidth or disk I/O; it's coordination. A shared URL queue, a global scheduler, and a centralized deduplication database create lock contention, network serialization, and a single point of failure. When thousands of agents compete for the same queue, throughput plateaus and latency spikes.

The industry routinely misunderstands deduplication in this context. Engineers instinctively design for exactly-once processing, assuming that crawling the same URL twice is a critical failure. This assumption forces expensive distributed transactions, cross-node consensus protocols, or heavy database locking. In reality, large-scale indexing systems operate on eventual coverage. Accepting a 0.1%–0.5% duplication rate or rare long-tail misses eliminates the need for global coordination entirely.

Production data from petabyte-scale indexing pipelines confirms this trade-off. Systems handling billions of URLs across 10,000+ independent nodes achieve 10x higher throughput when they abandon strong consistency. Churn is constant: nodes join, crash, or get reaped by orchestrators without warning. A design that requires global state synchronization will fracture under this volatility. The solution is to decouple discovery from execution, push deduplication to the edge, and route tasks deterministically using a distributed hash table (DHT) built on consistent hashing.

WOW Moment: Key Findings

The breakthrough in decentralized crawling isn't a new algorithm; it's a shift in ownership semantics. By mapping URLs to nodes via a deterministic hash ring, you eliminate negotiation, locking, and global state. The following comparison highlights why this approach outperforms traditional architectures at scale:

ApproachCoordination OverheadData Movement on ChurnDeduplication ModelMax Stable Node Count
Centralized QueueHigh (lock contention, scheduler latency)None (static topology)Strong (ACID)~1,000
Gossip-Based Task PoolMedium (state sync, conflict resolution)High (task reassignment)Eventual (conflict resolution)~5,000
DHT Consistent RingLow (deterministic routing, zero negotiation)Minimal (local arc reassignment)Eventual (owner-side check)10,000+

This finding matters because it proves that coordination cost scales with topology awareness, not node count. A consistent hashing ring requires each node to maintain only O(log N) routing state. When a node crashes or joins, only the keys in the adjacent arc shift ownership. The rest of the ring remains untouched. This stability enables horizontal scaling without reconfiguration, while owner-side deduplication keeps memory usage predictable and network traffic linear.

Core Solution

Building a decentralized crawler requires four interconnected components: a deterministic hash space, a self-healing routing layer, a task distribution mechanism, and an edge-side deduplication store. Each component is designed to operate independently while guaranteeing eventual coverage.

1. Hash Space & Node Positioning

The foundation is a circular key space, typically 0 to 2^160 - 1. Every node and every URL maps to a position in this space. A node's position is derived from hashing its unique identifier. A URL's position is derived from hashing its canonical form. Ownership follows a simple clockwise rule: a URL belongs to the first node whose position equals or follows the URL's key.

This eliminates modulo-based partitioning, which redistributes nearly all keys when N changes. Consistent hashing ensures that adding or removing a node only affects keys in the immediate arc, preserving routing stability under churn.

2. Routing Architecture: Finger Tables

Deterministic ownership is

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back