Back to KB
Difficulty
Intermediate
Read Time
5 min

Heartbeat monitoring: know when your scheduled jobs silently stop working

By Codcompass Team··5 min read

Current Situation Analysis

Traditional uptime and HTTP monitoring operate on a liveness paradigm: they verify that a server is reachable, a port is open, or an endpoint returns a 2xx status. This model fundamentally fails when dealing with scheduled, asynchronous, or batch workloads. The most dangerous outages are invisible because the infrastructure appears healthy while business logic degrades or halts completely.

Failure Modes:

  • Silent Crashes: The cron scheduler fires, the process exits with code 0, but the actual business logic fails due to unhandled exceptions or missing dependencies.
  • Zero-Output Operations: Backup jobs "complete" successfully but write 0 bytes due to permission errors or empty source directories.
  • Stale Data Pipelines: ETL jobs run on schedule but process empty datasets or skip transformations due to upstream schema changes.
  • Exception Accumulation: Report generation jobs start throwing warnings/errors after initial success, gradually degrading output quality without triggering process-level alerts.

Why Traditional Methods Fail: HTTP monitors cannot distinguish between "the server is alive" and "the job actually accomplished its purpose." Log aggregation requires complex regex parsing, drifts with format changes, and introduces high false-positive rates. Process supervisors (systemd, supervisord) only track daemon liveness, not task completion semantics. This gap leaves critical background operations unmonitored until downstream consumers or customers report data loss.

WOW Moment: Key Findings

Heartbeat monitoring shifts observability from reactive liveness checks to proactive success verification. By inverting the polling model—having the job call the monitoring service instead of the service polling the job—you eliminate blind spots in scheduled execution.

ApproachDetection Latency (avg)False Positive RateSilent Failure CoverageImplementation Complexity
Traditional HTTP/Uptime Monitor15–30 mins~12%0%Low
Log Aggregation + Regex Alerting5–10 mins~25%60%High
Heartbeat Monitoring (Tickstem)<1 min (configurable)~2%95%Low-Medium

Key Findings:

  • Detection Latency: Heartbeat systems alert within the configured grace window (typically 1–5 minutes past the expected interval), drastically reducing mean time to detect (MTTD) compared to log parsing or manual checks.
  • False Positive Reduction: By decoupling alerting from network jitter and process restarts, heartbeat monitoring achieves a ~2% false positive rate when grace windows are properly calibrated.
  • Sweet Spot: The optimal configuration aligns interval with the job's SLA frequency and sets grace window to job_max_runtime + network_variance + 10% buffer. This captures silent failures without triggering alert fatigue during normal execution variance.

Core Solution

Heartbeat monitoring implements a dead-man's switch pattern. Instead of external probes, the scheduled job authenticates and pings the monitoring service upon verified completion. The service tracks the last successful ping timestamp and triggers alerts only when consecutive pings are missing beyond the configured

tolerance.

Configuration Parameters:

  • interval: Expected frequency between successful runs (e.g., 86400 seconds for daily jobs)
  • grace window: Buffer past the deadline before alerting (e.g., 3600 seconds)
  • Alert triggers if no ping arrives for two consecutive intervals

Design Principles:

  • Success-only pings ensure alerts only fire when business logic actually completes
  • Non-fatal network calls prevent transient connectivity issues from aborting valid workloads
  • Token-based authentication isolates credentials and enables granular revocation

Go:

import "github.com/tickstem/heartbeat"

client := heartbeat.New(os.Getenv("TICKSTEM_API_KEY"))

hb, err := client.Create(ctx, heartbeat.CreateParams{
    Name:         "nightly-sync",
    IntervalSecs: 86400,
    GraceSecs:    3600,
})

// at the end of every successful run — token is the credential, no API key needed
if err := client.Ping(ctx, hb.Token); err != nil {
    log.Println("heartbeat ping failed:", err) // non-fatal
}

Enter fullscreen mode Exit fullscreen mode

Node.js:

import { HeartbeatClient } from "@tickstem/heartbeat"

const hb = new HeartbeatClient(process.env.TICKSTEM_API_KEY)

const heartbeat = await hb.create({ name: "nightly-sync", interval_secs: 86400 })

// at the end of every successful run
await hb.ping(heartbeat.token).catch(err => console.error("ping failed:", err))

Enter fullscreen mode Exit fullscreen mode

Or just curl — no SDK needed:

curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_TOKEN/ping

Enter fullscreen mode Exit fullscreen mode

The token goes in the URL. No auth header. If curl fails, the script still exits cleanly.

The thing worth noting The ping only happens on success. Silence means something went wrong — either the job crashed, was never scheduled, or completed without doing its actual work. That's the point.

Make the ping non-fatal though. A transient network blip shouldn't abort a successful sync.

When to use it Any job where "it ran" and "it did something useful" are different things:

  • Database backups
  • Data sync / ETL pipelines
  • Report generation
  • Invoice or payment processing
  • Cache warming

Uptime monitoring and heartbeat monitoring are complementary. Uptime = server is alive. Heartbeat = job actually did its job.

Pitfall Guide

  1. Pinging on Start or Failure: Triggering the heartbeat at job initialization or in catch/finally blocks defeats the dead-man's switch. Only ping after all business logic, data validation, and side effects have completed successfully.
  2. Making Ping Calls Fatal: Wrapping the ping in a fatal error handler (panic, process.exit(1), or uncaught exceptions) turns a monitoring tool into a single point of failure. Always isolate the ping call, catch network errors, and log them without interrupting the primary workflow.
  3. Misaligning Grace Windows: Setting the grace window too tight causes alert fatigue during normal execution variance. Setting it too loose delays incident response. Calculate as: grace = max_expected_runtime + p95_network_latency + 10%_safety_margin.
  4. Using Master API Keys for Pings: Embedding root credentials in ping calls violates least-privilege principles. Always use scoped heartbeat tokens generated during Create(). Tokens are revocable, auditable, and limit blast radius if leaked.
  5. Confusing Interval with Execution Duration: The interval parameter defines how often the job should run, not how long it takes. A 4-hour ETL job running daily requires interval_secs: 86400, not 14400. Misconfiguration here causes immediate false alerts.
  6. Ignoring Downstream Dependency Health: A heartbeat confirms the job ran, not that upstream data sources or downstream consumers are healthy. Pair heartbeat monitoring with data quality checks (row counts, checksums, schema validation) for end-to-end reliability.

Deliverables

📐 Architecture Blueprint

  • System flow diagram: Job execution → Success validation → Heartbeat ping → Tickstem state tracking → Alert routing
  • Configuration matrix for common workload types (DB backups, ETL pipelines, report generators, payment processors)
  • Token lifecycle management: creation, rotation, revocation, and audit logging strategies

✅ Pre-Deployment Checklist

  • Verify ping call is placed after all business logic and data validation steps
  • Confirm ping error handling is non-fatal (logged but never re-thrown)
  • Calculate and set interval based on job frequency, not execution duration
  • Calibrate grace window using historical runtime percentiles + network variance
  • Replace master API keys with scoped heartbeat tokens in all environments
  • Route alerts to appropriate on-call channels with clear runbook references
  • Perform dry-run simulation: intentionally skip a ping and verify alert triggers within grace window
  • Document fallback procedures if heartbeat service experiences regional outage