← Back to Blog
DevOps2026-05-06Β·51 min read

Why I Made WebSocket Delivery the Disposable Part of the Tracking System

By Kingsley Onoh

Why I Made WebSocket Delivery the Disposable Part of the Tracking System

Current Situation Analysis

The most dangerous failure in real-time tracking systems is not a loud carrier outage. Loud failures are easily logged, retried, and recovered. The silent failure occurs when the gateway successfully receives a carrier update, persists it to the database, but Redis goes down exactly when the WebSocket fanout should trigger. The client's live view freezes, creating the illusion of a broken system, even though the shipment state and event timeline remain perfectly accurate in the database.

Traditional real-time architectures fail because they treat the delivery layer (WebSockets/Redis) as the source of truth. This creates several critical failure modes:

  • State Drift: When the cache or socket layer fails, the UI diverges from the actual shipment state, forcing expensive full-page reloads or reconciliation logic.
  • Coupled Processing & Delivery: Direct socket broadcasting ties event processing to live connection management. If a worker restarts, a send throws, or the event lands on a node that doesn't own the socket, delivery fails or duplicates occur.
  • Polling Stack & Load Spikes: Naive polling loops without cycle guards create overlapping requests, artificially spiking carrier API load and corrupting downstream latency metrics.
  • Timestamp Interpretation Drift: Using timestamp without time zone in PostgreSQL while passing JavaScript Date objects through Drizzle/node-postgres introduces subtle interpretation drift, breaking cursor pagination and out-of-order scan handling.

The system becomes fragile when delivery expectations override data correctness. The solution requires deliberately decoupling truth from delivery.

WOW Moment: Key Findings

Approach State Correctness on Redis Failure P95 Delivery Latency Dedup Overhead Recovery Complexity
WebSocket-First (Cache as Truth) Fails silently; UI freezes until reconnect 45ms Low (in-memory) High (requires full state sync)
Direct Socket Broadcast Fails on worker restart/missing socket 30ms Medium (per-node maps) High (reconnection storms)
DB-First + Redis Streams (This Architecture) 100% preserved; UI resumes from DB on reconnect 180ms (fanout + group consume) Low (DB constraint + deterministic key) Low (asymmetric failure handling)

Key Findings:

  • Making the database the authoritative truth and the WebSocket stream a disposable delivery layer eliminates state drift during cache outages.
  • Moving deduplication from polling loops to deterministic event identity reduces per-carrier complexity and removes the need for duplicate caches.
  • Asymmetric failure handling (DB failure = failed, Redis failure = processed with published: false) ensures business-correctness is never compromised for delivery speed.

Core Solution

1. Adapter Normalization & Deterministic Deduplication

Carriers (DHL, DPD, GLS) return inconsistent payloads, status codes, and timestamp formats. The adapter layer normalizes these into a single shape: carrier, tracking number, carrier status, normalized status, carrier timestamp, and a deterministic dedupKey.

export function generateDedupKey(input: DedupKeyInput): string {
  const carrier = normalizeCarrier(input.carrier);
  const trackingNumber = assertRequiredString(input.trackingNumber, "trackingNumber");
  const carrierStatus = assertRequiredString(input.carrierStatus, "carrierStatus");
  const carrierTimestampIso = normalizeTimestamp(input.carrierTimestamp);

  return [carrier, trackingNumber, carrierStatus, carrierTimestampIso].join(":");
}

This function carries the system. Carriers send duplicate scans, polls retry after timeouts, and processes restart fetching identical history. If the fact is identical, the key is identical. Deduplication moves into event identity, freeing the poller, adapters, and WebSocket layer from duplicate logic.

2. The Processor: Load-Bearing Correctness

The processor handles the hard correctness problem. It starts with a dedup lookup, inserts into tracking_events with onConflictDoNothing(), and updates the shipment projection only when the new carrier timestamp is strictly newer than last_event_at.

const updatedRows = await options.db
  .update(shipments)
  .set({
    currentStatus: event.normalizedStatus,
    lastEventAt: event.carrierTimestamp,
    updatedAt: new Date()
  })
  .where(
    and(
      eq(shipments.id, event.shipmentId),
      or(isNull(shipments.lastEventAt), lt(shipments.lastEventAt, event.carrierTimestamp))
    )
  )
  .returning({ id: shipments.id });
projectionUpdated = updatedRows.length > 0;

This establishes the boundary between history and projection. Out-of-order scans persist in the event timeline but never revert the visible dashboard state. Redis publish happens only after successful database writes. If the DB fails, the processor returns failed. If Redis fails, it returns processed with published: false. Integration tests validate both paths: invalid shipment IDs leave Redis empty, and simulated redis unavailable errors still leave the PostgreSQL event intact.

3. Redis Streams Delivery Contract

Direct broadcasting couples processing to live delivery. Instead, the processor writes a single stream entry to tracking:events. A broadcaster consumes as group ws-broadcaster, parses the payload, maps tracking numbers to active connection IDs, and sends JSON to registered sockets.

Key delivery rules:

  • Malformed entries are acknowledged: Invalid JSON (e.g., {) is logged and acked immediately. Leaving it pending blocks operational visibility and marks the consumer group unhealthy.
  • XAUTOCLAIM for stuck messages: If a consumer dies before acking, the broadcaster reclaims pending entries after 60 seconds. This supports single-service deployments and provides a clear upgrade path for multi-process delivery.

4. Boring Polling Engine

The polling engine deliberately avoids cleverness. It loads enabled carrier configs, starts one loop per carrier, queries active shipments (excluding delivered, returned, deleted_at), and batches by POLL_BATCH_SIZE (default: 10). Each call uses withExponentialBackoff with baseDelayMs * 2 ** attempt.

The scheduler enforces one critical rule: skip overlapping cycles. If a carrier cycle is still running when the next interval fires, it returns skipped with reason cycle_already_running. This prevents artificial load spikes and keeps downstream metrics truthful. Target: 100 active shipments across 3 carriers, carrier cycle < 30s, WebSocket delivery < 200ms post-pipeline entry.

5. Timestamp & Pagination Handling

Cursor pagination failed because timestamp without time zone combined with JavaScript Date serialization introduced interpretation drift through Drizzle/node-postgres. The comparison against the cursor timestamp behaved inconsistently, repeating shipments across pages. The fix requires explicit timezone-aware casting at the database boundary and strict ISO-8601 normalization before cursor generation.

Pitfall Guide

  1. Treating WebSocket/Redis as Source of Truth: When the cache layer fails, the UI freezes and forces expensive full-state reconciliation. Always treat the database as the single source of truth; make delivery layers disposable and idempotent.
  2. Coupling Event Processing to Live Socket Delivery: Direct socket.send() ties business logic to connection lifecycle. Worker restarts, missing sockets, or network drops cause silent failures. Decouple via message queues/streams with explicit consumer groups.
  3. Implementing Deduplication in the Polling Loop: Pollers shouldn't track "what was seen last time". Carriers retry, processes restart, and history refetches. Move dedup to deterministic event identity (dedupKey) and let database constraints (onConflictDoNothing) handle collisions.
  4. Allowing Out-of-Order Scans to Revert Projections: Carriers frequently send delayed or out-of-sequence updates. Updating currentStatus without timestamp guards (lt(shipments.lastEventAt, event.carrierTimestamp)) causes dashboard flickering and incorrect state regression.
  5. Letting Malformed Stream Entries Block Consumer Groups: Invalid JSON or corrupted payloads will stall Redis Streams consumer groups indefinitely, masking real operational issues. Acknowledge and log malformed entries immediately to maintain group health visibility.
  6. Ignoring Polling Cycle Overlap Guards: Without cycle guards, high-latency carrier responses cause overlapping polls, creating artificial load spikes and corrupting rate-limit metrics. Always skip overlapping cycles and return structured skipped results.
  7. Using timestamp without time zone with JavaScript Date: PostgreSQL's lack of timezone context combined with JS Date serialization causes interpretation drift, breaking cursor pagination and timestamp comparisons. Always normalize to UTC ISO-8601 at the application boundary and use timestamp with time zone in schema definitions.

Deliverables

  • Architecture Blueprint: Visual data flow diagram showing adapter normalization β†’ deterministic dedup β†’ PostgreSQL truth layer β†’ Redis Streams fanout β†’ WebSocket delivery. Includes failure boundaries, async handoff points, and projection update guards.
  • Deployment & Validation Checklist: 18-step checklist covering carrier adapter registration, Redis consumer group initialization (ws-broadcaster), XAUTOCLAIM configuration, polling cycle guard validation, integration test execution (DB failure vs Redis failure paths), and timezone casting verification.
  • Configuration Templates: Ready-to-use YAML/JSON templates for carrier_configs (backoff base, batch size, rate limits), Redis stream consumer group setup, Drizzle schema definitions with timezone-safe timestamp columns, and polling scheduler parameters aligned to the 100-shipment/3-carrier target.