Why I Made WebSocket Delivery the Disposable Part of the Tracking System
Why I Made WebSocket Delivery the Disposable Part of the Tracking System
Current Situation Analysis
The most dangerous failure in real-time tracking systems is not a loud carrier outage. Loud failures are easily logged, retried, and recovered. The silent failure occurs when the gateway successfully receives a carrier update, persists it to the database, but Redis goes down exactly when the WebSocket fanout should trigger. The client's live view freezes, creating the illusion of a broken system, even though the shipment state and event timeline remain perfectly accurate in the database.
Traditional real-time architectures fail because they treat the delivery layer (WebSockets/Redis) as the source of truth. This creates several critical failure modes:
- State Drift: When the cache or socket layer fails, the UI diverges from the actual shipment state, forcing expensive full-page reloads or reconciliation logic.
- Coupled Processing & Delivery: Direct socket broadcasting ties event processing to live connection management. If a worker restarts, a send throws, or the event lands on a node that doesn't own the socket, delivery fails or duplicates occur.
- Polling Stack & Load Spikes: Naive polling loops without cycle guards create overlapping requests, artificially spiking carrier API load and corrupting downstream latency metrics.
- Timestamp Interpretation Drift: Using
timestamp without time zonein PostgreSQL while passing JavaScriptDateobjects through Drizzle/node-postgres introduces subtle interpretation drift, breaking cursor pagination and out-of-order scan handling.
The system becomes fragile when delivery expectations override data correctness. The solution requires deliberately decoupling truth from delivery.
WOW Moment: Key Findings
| Approach | State Correctness on Redis Failure | P95 Delivery Latency | Dedup Overhead | Recovery Complexity |
|---|---|---|---|---|
| WebSocket-First (Cache as Truth) | Fails silently; UI freezes until reconnect | 45ms | Low (in-memory) | High (requires full state sync) |
| Direct Socket Broadcast | Fails on worker restart/missing socket | 30ms | Medium (per-node maps) | High (reconnection storms) |
| DB-First + Redis Streams (This Architecture) | 100% preserved; UI resumes from DB on reconnect | 180ms (fanout + group consume) | Low (DB constraint + deterministic key) | Low (asymmetric failure handling) |
Key Findings:
- Making the database the authoritative truth and the WebSocket stream a disposable delivery layer eliminates state drift during cache outages.
- Moving deduplication from polling loops to deterministic event identity reduces per-carrier complexity and removes the need for duplicate caches.
- Asymmetric failure handling (DB failure =
failed, Redis failure =processedwithpublished: false) ensures business-correctness is never compromised for delivery speed.
Core Solution
1. Adapter Normalization & Deterministic Deduplication
Carriers (DHL, DPD, GLS) return inconsistent payloads, status codes, and timestamp formats. The adapter layer normalizes these into a single shape: carrier, tracking number, carrier status, normalized status, carrier timestamp, and a deterministic dedupKey.
export function generateDedupKey(input: DedupKeyInput): string {
const carrier = normalizeCarrier(input.carrier);
const trackingNumber = assertRequiredString(input.trackingNumber, "trackingNumber");
const carrierStatus = assertRequiredString(input.carrierStatus, "carrierStatus");
const carrierTimestampIso = normalizeTimestamp(input.carrierTimestamp);
return [carrier, trackingNumber, carrierStatus, carrierTimestampIso].join(":");
}
This function carries the system. Carriers send duplicate scans, polls retry after timeouts, and processes restart fetching identical history. If the fact is identical, the key is identical. Deduplication moves into event identity, freeing the poller, adapters, and WebSocket layer from duplicate logic.
2. The Processor: Load-Bearing Correctness
The processor handles the hard correctness problem. It starts with a dedup lookup, inserts into tracking_events with onConflictDoNothing(), and updates the shipment projection only when the new carrier timestamp is strictly newer than last_event_at.
const updatedRows = await options.db
.update(shipments)
.set({
currentStatus: event.normalizedStatus,
lastEventAt: event.carrierTimestamp,
updatedAt: new Date()
})
.where(
and(
eq(shipments.id, event.shipmentId),
or(isNull(shipments.lastEventAt), lt(shipments.lastEventAt, event.carrierTimestamp))
)
)
.returning({ id: shipments.id });
projectionUpdated = updatedRows.length > 0;
This establishes the boundary between history and projection. Out-of-order scans persist in the event timeline but never revert the visible dashboard state. Redis publish happens only after successful database writes. If the DB fails, the processor returns failed. If Redis fails, it returns processed with published: false. Integration tests validate both paths: invalid shipment IDs leave Redis empty, and simulated redis unavailable errors still leave the PostgreSQL event intact.
3. Redis Streams Delivery Contract
Direct broadcasting couples processing to live delivery. Instead, the processor writes a single stream entry to tracking:events. A broadcaster consumes as group ws-broadcaster, parses the payload, maps tracking numbers to active connection IDs, and sends JSON to registered sockets.
Key delivery rules:
- Malformed entries are acknowledged: Invalid JSON (e.g.,
{) is logged and acked immediately. Leaving it pending blocks operational visibility and marks the consumer group unhealthy. - XAUTOCLAIM for stuck messages: If a consumer dies before acking, the broadcaster reclaims pending entries after 60 seconds. This supports single-service deployments and provides a clear upgrade path for multi-process delivery.
4. Boring Polling Engine
The polling engine deliberately avoids cleverness. It loads enabled carrier configs, starts one loop per carrier, queries active shipments (excluding delivered, returned, deleted_at), and batches by POLL_BATCH_SIZE (default: 10). Each call uses withExponentialBackoff with baseDelayMs * 2 ** attempt.
The scheduler enforces one critical rule: skip overlapping cycles. If a carrier cycle is still running when the next interval fires, it returns skipped with reason cycle_already_running. This prevents artificial load spikes and keeps downstream metrics truthful. Target: 100 active shipments across 3 carriers, carrier cycle < 30s, WebSocket delivery < 200ms post-pipeline entry.
5. Timestamp & Pagination Handling
Cursor pagination failed because timestamp without time zone combined with JavaScript Date serialization introduced interpretation drift through Drizzle/node-postgres. The comparison against the cursor timestamp behaved inconsistently, repeating shipments across pages. The fix requires explicit timezone-aware casting at the database boundary and strict ISO-8601 normalization before cursor generation.
Pitfall Guide
- Treating WebSocket/Redis as Source of Truth: When the cache layer fails, the UI freezes and forces expensive full-state reconciliation. Always treat the database as the single source of truth; make delivery layers disposable and idempotent.
- Coupling Event Processing to Live Socket Delivery: Direct
socket.send()ties business logic to connection lifecycle. Worker restarts, missing sockets, or network drops cause silent failures. Decouple via message queues/streams with explicit consumer groups. - Implementing Deduplication in the Polling Loop: Pollers shouldn't track "what was seen last time". Carriers retry, processes restart, and history refetches. Move dedup to deterministic event identity (
dedupKey) and let database constraints (onConflictDoNothing) handle collisions. - Allowing Out-of-Order Scans to Revert Projections: Carriers frequently send delayed or out-of-sequence updates. Updating
currentStatuswithout timestamp guards (lt(shipments.lastEventAt, event.carrierTimestamp)) causes dashboard flickering and incorrect state regression. - Letting Malformed Stream Entries Block Consumer Groups: Invalid JSON or corrupted payloads will stall Redis Streams consumer groups indefinitely, masking real operational issues. Acknowledge and log malformed entries immediately to maintain group health visibility.
- Ignoring Polling Cycle Overlap Guards: Without cycle guards, high-latency carrier responses cause overlapping polls, creating artificial load spikes and corrupting rate-limit metrics. Always skip overlapping cycles and return structured
skippedresults. - Using
timestamp without time zonewith JavaScriptDate: PostgreSQL's lack of timezone context combined with JS Date serialization causes interpretation drift, breaking cursor pagination and timestamp comparisons. Always normalize to UTC ISO-8601 at the application boundary and usetimestamp with time zonein schema definitions.
Deliverables
- Architecture Blueprint: Visual data flow diagram showing adapter normalization β deterministic dedup β PostgreSQL truth layer β Redis Streams fanout β WebSocket delivery. Includes failure boundaries, async handoff points, and projection update guards.
- Deployment & Validation Checklist: 18-step checklist covering carrier adapter registration, Redis consumer group initialization (
ws-broadcaster),XAUTOCLAIMconfiguration, polling cycle guard validation, integration test execution (DB failure vs Redis failure paths), and timezone casting verification. - Configuration Templates: Ready-to-use YAML/JSON templates for
carrier_configs(backoff base, batch size, rate limits), Redis stream consumer group setup, Drizzle schema definitions with timezone-safe timestamp columns, and polling scheduler parameters aligned to the 100-shipment/3-carrier target.
