Hash Set Pattern — LeetCode #217: Contains Duplicate

By Codcompass Team·2026-05-17·7 min read

Accelerating Duplicate Detection: The Hash Lookup Pattern

Current Situation Analysis

Data validation and deduplication are foundational operations in modern software systems. Whether you're sanitizing user input, processing event streams, or preparing datasets for machine learning pipelines, checking for repeated values is a routine requirement. Despite its frequency, this operation is routinely implemented using quadratic-time algorithms that scale poorly under production load.

The core pain point stems from a cognitive bias: developers naturally map the phrase "does any value appear twice?" to a pairwise comparison model. This mental model translates directly into nested iterations, where each element is scanned against every subsequent element. While logically sound for small datasets, this approach degrades rapidly as cardinality increases. A dataset of 10,000 items requires roughly 50 million comparisons. At 100,000 items, that jumps to 5 billion. In cloud environments where compute time directly correlates with cost, quadratic scaling becomes an immediate liability.

This problem is frequently overlooked because early-stage development rarely exercises the upper bounds of input size. Local testing with dozens of elements masks the algorithmic inefficiency. Additionally, many engineers fail to recognize the linguistic cue embedded in the problem statement. Phrases like "have I encountered this before?", "is this value already registered?", or "does this exist elsewhere?" are direct signals for constant-time lookup structures. When these cues are missed, teams default to sorting or nested loops, introducing unnecessary latency and memory pressure.

Empirical benchmarks across JavaScript and TypeScript runtimes consistently show that hash-based lookups maintain linear time complexity regardless of input distribution. The trade-off is explicit: you exchange O(n) auxiliary space for O(n) time. In nearly all modern architectures, this swap is favorable. Memory is cheap and predictable; CPU cycles spent on redundant comparisons are not. Recognizing this pattern early prevents architectural debt and establishes a foundation for more advanced frequency-tracking and sliding-window algorithms.

WOW Moment: Key Findings

The performance divergence between naive and optimized approaches becomes stark when measured across realistic workloads. The following comparison isolates the critical dimensions that dictate production viability.

Approach	Time Complexity	Space Complexity	Early Exit Capability	Memory Footprint
Nested Iteration	O(n²)	O(1)	Yes	Minimal
Sorting + Adjacent Check	O(n log n)	O(1) or O(n)	No	Low to Moderate
Hash Set (Full Scan)	O(n)	O(n)	No	High
Hash Set (Early Exit)	O(n)	O(n)	Yes	High

The early-exit hash set approach dominates in production scenarios where duplicates are likely to appear before the end of the dataset. By terminating execution immediately upon detection, you avoid processing the remaining 60–90% of the input. This is particularly valuable in API validation layers, where rejecting malformed payloads quickly reduces downstream load.

The full-scan one-liner

(new Set(collection).size !== collection.length) is syntactically elegant but operationally rigid. It guarantees complete traversal regardless of data distribution. In high-throughput systems, this unnecessary work accumulates into measurable latency spikes. The early-exit variant preserves the linear time guarantee while adding conditional termination, making it the superior default for validation logic.

Understanding this trade-off unlocks a broader architectural principle: constant-time membership testing is the cornerstone of efficient data processing. Once you internalize the "track what you've seen" mental model, you can extend it to frequency counting, index mapping, and windowed deduplication without rewriting core logic.

Core Solution

The optimal implementation relies on a single-pass traversal paired with a hash-based membership registry. The algorithm maintains a strict invariant: at iteration i, the registry contains every unique value encountered from index 0 through i-1. When the current element matches an existing entry, the invariant is violated, confirming a duplicate.

Implementation Strategy

Initialize a membership registry: Use a native Set structure. It provides O(1) average-case insertion and lookup without storing redundant metadata.
Iterate sequentially: Process elements in input order. No sorting or preprocessing required.
Check before storing: Query the registry for the current element. If present, return immediately.
Register and continue: If absent, add the element to the registry and proceed to the next iteration.
Handle completion: If the loop finishes without triggering a return, all values are unique.

TypeScript Implementation

/**
 * Validates whether a collection contains repeated values.
 * Uses early-exit hash lookup for optimal average-case performance.
 */
export function hasRepeatedEntry<T>(collection: readonly T[]): boolean {
  if (!Array.isArray(collection)) {
    throw new TypeError('Expected an array-like collection');
  }

  const membershipRegistry = new Set<T>();

  for (const currentItem of collection) {
    if (membershipRegistry.has(currentItem)) {
      return true;
    }
    membershipRegistry.add(currentItem);
  }

  return false;
}

Architecture Decisions & Rationale

Why Set over Map?
A Map stores key-value pairs, which introduces unnecessary memory overhead when you only need existence verification. Set allocates exactly one slot per unique value, reducing heap pressure. In TypeScript, Set<T> also preserves type safety without requiring dummy values.

Why early exit over full traversal?
Conditional termination transforms worst-case O(n) into best-case O(1) when duplicates cluster near the start. In real-world data, duplicates often follow predictable patterns (e.g., repeated user IDs, cached tokens, or malformed batch uploads). Early exit captures these cases without penalizing unique datasets.

Why not sort first?
Sorting requires O(n log n) time and either mutates the input or creates a shallow copy. While O(n log n) is acceptable for static datasets, it introduces unpredictable latency spikes and breaks referential integrity. Hash lookup maintains input immutability and guarantees linear scaling.

Why readonly T[]?
Accepting a readonly array signals that the function does not mutate input. This aligns with functional validation patterns and prevents accidental side effects in shared state environments.

Pitfall Guide

1. Ignoring Early Exit in Favor of One-Liners

Explanation: Writing new Set(arr).size !== arr.length is concise but forces complete traversal. In production pipelines processing thousands of requests per second, this adds unnecessary CPU cycles. Fix: Always implement the loop with conditional return. Reserve one-liners for static analysis or non-performance-critical scripts.

2. Assuming O(1) Guarantees Under Hash Collisions

Explanation: JavaScript engines use optimized hashing, but pathological inputs can trigger collision chains, degrading lookups to O(n) in worst-case scenarios. Fix: For security-sensitive contexts (e.g., user-controlled input), consider cryptographic hashing or fallback to balanced tree structures. In 99% of application logic, native Set remains safe.

3. Storing Metadata When Only Existence Matters

Explanation: Developers sometimes use Map<T, number> to track frequencies when the requirement is strictly binary (duplicate vs unique). This doubles memory allocation per entry. Fix: Use Set for existence checks. Upgrade to Map only when frequency thresholds, indices, or timestamps are required downstream.

4. Mutating Input Arrays for Sorting-Based Solutions

Explanation: Calling arr.sort() modifies the original reference. If the array is shared across modules or cached, this introduces silent state corruption. Fix: Never sort in-place for validation. If sorting is unavoidable, create a shallow copy first: [...collection].sort(). Prefer hash lookup to avoid the issue entirely.

5. Overlooking Type Constraints in Generic Implementations

Explanation: Using any or omitting generics in TypeScript erases compile-time safety, allowing incompatible types to bypass validation logic. Fix: Always parameterize with <T>. Add runtime type guards if the collection originates from untrusted sources (e.g., API payloads).

6. Applying Hash Lookup to Non-Primitive Types Without Care

Explanation: Objects and arrays are compared by reference, not value. { id: 1 } and { id: 1 } are treated as distinct entries in a Set. Fix: Serialize complex objects to stable strings (e.g., JSON.stringify) or extract primitive keys before insertion. Document this behavior explicitly in function contracts.

7. Neglecting Memory Pressure in Long-Running Processes

Explanation: In streaming or infinite-loop scenarios, continuously adding to a Set without eviction causes unbounded heap growth. Fix: Implement bounded caches with LRU eviction, or use probabilistic structures like Bloom filters when approximate deduplication is acceptable.

Production Bundle

Action Checklist

Validate input type before processing to prevent runtime type errors
Prefer Set over Map when only existence tracking is required
Implement early-exit logic to minimize CPU usage on duplicate-heavy datasets
Avoid in-place sorting; maintain input immutability for safe reuse
Parameterize functions with generics to preserve TypeScript type safety
Handle non-primitive types by extracting stable keys or serializing values
Add unit tests covering empty arrays, single elements, all-duplicates, and all-unique cases
Benchmark against realistic data distributions before deploying to high-throughput endpoints

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Small datasets (<1K items), strict memory limits	Nested iteration	O(1) space avoids allocation overhead	Lowest memory cost, acceptable CPU
Large datasets, duplicates likely early	Hash Set with early exit	O(n) time with conditional termination	Optimal CPU/memory balance
Static data, one-time validation, code brevity prioritized	Full-scan Set one-liner	Concise syntax, predictable behavior	Higher CPU cost, negligible in batch jobs
Need index tracking or frequency counts	Hash Map with value storage	Extends existence check to metadata	Higher memory, enables advanced analytics
Memory-constrained embedded or edge runtime	Sorting + adjacent check	O(1) auxiliary space, no dynamic allocation	Higher CPU, deterministic memory usage

Configuration Template

// deduplication-validator.ts
export interface DeduplicationOptions<T> {
  /** Custom key extractor for non-primitive types */
  keyExtractor?: (item: T) => string | number;
  /** Maximum registry size before throwing (prevents memory leaks) */
  maxRegistrySize?: number;
}

export function createDeduplicationValidator<T>(
  options: DeduplicationOptions<T> = {}
): (collection: readonly T[]) => boolean {
  const { keyExtractor, maxRegistrySize } = options;

  return (collection: readonly T[]): boolean => {
    const registry = new Set<string | number>();

    for (const item of collection) {
      const key = keyExtractor ? keyExtractor(item) : (item as any);

      if (registry.has(key)) {
        return true;
      }

      if (maxRegistrySize && registry.size >= maxRegistrySize) {
        throw new RangeError(
          `Registry exceeded maximum size of ${maxRegistrySize}. Consider bounded caching.`
        );
      }

      registry.add(key);
    }

    return false;
  };
}

// Usage example:
// const validateUserIds = createDeduplicationValidator<{ id: string }>({
//   keyExtractor: (user) => user.id,
//   maxRegistrySize: 10000
// });
// const hasDuplicates = validateUserIds(userBatch);

Quick Start Guide

Install/Import: Copy the hasRepeatedEntry function into your validation utilities module. No external dependencies required.
Define Input: Pass your array to the function. Ensure elements are primitives or implement a stable equality strategy.
Handle Result: The function returns true if duplicates exist, false otherwise. Branch your logic accordingly (e.g., reject request, log warning, or proceed).
Extend if Needed: For complex objects, supply a keyExtractor or serialize values before comparison. Adjust maxRegistrySize for streaming contexts.
Validate: Run against edge cases: empty arrays, single-item arrays, fully duplicate arrays, and large unique datasets. Confirm early exit triggers on duplicate-heavy inputs.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back