Architecting a Zero-Cost External Dependency Monitor on the Edge

Current Situation Analysis

Modern application stacks increasingly rely on third-party services for non-core functionality: weather data, geocoding, currency conversion, news feeds, and public datasets. The assumption is that if an endpoint is publicly documented and labeled "free," it will remain stable. In practice, free APIs operate without service-level agreements, making them the weakest link in production dependency graphs.

The core pain point is silent degradation. Unlike internal microservices that trigger PagerDuty alerts on 5xx spikes, external free APIs often fail quietly. They return 200 OK with error payloads, throttle requests without warning, or decommission endpoints entirely. Development teams rarely instrument monitoring for these services because traditional APM tools are expensive, complex to configure for external targets, and introduce runtime overhead.

This blind spot is statistically significant. In a representative sample of ten widely cited free weather APIs, approximately four were either decommissioned, migrated behind undocumented paywalls, or required credit card verification despite documentation claiming otherwise. When these services power frontend features or background sync jobs, the failure mode shifts from "service degraded" to "user-facing breakage" with zero early warning.

The misunderstanding stems from treating external APIs as infrastructure rather than volatile dependencies. Infrastructure demands uptime guarantees; volatile dependencies demand continuous verification. Building a lightweight, zero-cost verification layer bridges this gap without introducing operational debt.

WOW Moment: Key Findings

Traditional monitoring stacks require dedicated servers, database instances, and dashboard software. Even lightweight open-source alternatives demand maintenance, patching, and baseline hosting costs. By contrast, an edge-native architecture decouples data collection from presentation, leveraging serverless compute and static generation to eliminate runtime failure surfaces.

Approach	Monthly Cost	Infrastructure Complexity	Runtime Failure Risk	Data Freshness
Traditional Server + DB + Dashboard	$15–$50+	High (OS, runtime, DB, web server)	High (dashboard crashes if backend fails)	Real-time
SaaS Uptime Monitor	$10–$30+	Low (managed)	Low (vendor handles uptime)	1–5 min intervals
Edge-Static Architecture	$0	Medium (initial setup)	Zero (static HTML cannot fail at runtime)	Hourly/Daily

The edge-static model shifts the failure domain entirely. Because the monitoring dashboard is pre-rendered HTML served from a CDN, it remains accessible even if the target APIs go completely offline. The only moving parts are the background workers that collect and aggregate data. This architecture also naturally aligns with free-tier limits, as compute is event-driven and storage is append-only until rollup.

Core Solution

The system operates as a one-way data pipeline: collection → aggregation → static generation. Each stage is independently deployable, stateless, and bound to a specific schedule.

Phase 1: Data Collection (The Pinger)

A Cloudflare Worker executes on an hourly cron trigger. It queries a lightweight SQL database for active endpoints, performs HTTP health checks, and persists results. The worker avoids runtime dependencies by writing directly to D1 and never exposing public endpoints.

// worker/src/collector.ts
import { D1Database } from '@cloudflare/workers-types';

interface EndpointConfig {
  id: string;
  targetUrl: string;
  method: 'GET' | 'HEAD';
  expectedStatus: number;
  timeoutMs: number;
}

interface CheckResult {
  endpointId: string;
  timestamp: string;
  statusCode: number;
  latencyMs: number;
  isHealthy: boolean;
}

export async function runHealthChecks(env: { DB: D1Database }): Promise<void> {
  const endpoints: EndpointConfig[] = await env.DB.prepare(
    'SELECT id, target_url, method, expected_status, timeout_ms FROM endpoints WHERE status = ?'
  ).bind('active').all<EndpointConfig>();

  const results: CheckResult[] = [];

  for (const ep of endpoints) {
    const start = performance.now();
    let statusCode = 0;
    let isHealthy = false;

    try {
      const controller = new AbortController();
      const timeoutId = setTimeout(() => controller.abort(), ep.timeoutMs);

      const response = await fetch(ep.targetUrl, {
        method: ep.method,
        signal: controller.signal,
        headers: { 'User-Agent': 'EdgeMonitor/1.0' }
      });

      clearTimeout(timeoutId);
      statusCode = response.status;
      isHealthy = statusCode === ep.expectedStatus;
    } catch {
      statusCode = 0;
      isHealthy = false;
    }

    const latency = Math.round(performance.now() - start);
    results.push({
      endpointId: ep.id,
      timestamp: new Date().toISOString(),
      statusCode,
      latencyMs: latency,
      isHealthy
    });
  }

  // Batch insert to minimize D1 round trips
  const batch = results.map(r => 
    env.DB.prepare(
      'INSERT INTO check_logs (endpoint_id, recorded_at, status_code, latency_ms, is_healthy) VALUES (?, ?, ?, ?, ?)'
    ).bind(r.endpointId, r.timestamp, r.statusCode, r.latencyMs, r.isHealthy ? 1 : 0)
  );

  await env.DB.batch(batch);
}

Architecture Rationale:

AbortController enforces strict timeouts, preventing hanging requests from consuming subrequest quota.
Batch inserts reduce D1 transaction overhead. D1 handles up to 100 statements per batch efficiently.
No public routes are exposed. The worker runs exclusively on cron, eliminating attack surface.

Phase 2: Data Aggregation (The Builder)

A second Worker runs daily. It computes 24-hour uptime percentages, average latency, and state-change events. Raw logs older than 30 days are pruned to control storage growth. The worker generates a single JSON snapshot consumed by the static site.

// worker/src/aggregator.ts
import { D1Database } from '@cloudflare/workers-types';

interface RollupMetrics {
  endpointId: string;
  uptimePercent: number;
  avgLatencyMs: number;
  totalChecks: number;
  lastStatus: number;
}

export async function generateSnapshot(env: { DB: D1Database }): Promise<RollupMetrics[]> {
  const cutoff = new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString();

  const rollups = await env.DB.prepare(`
    SELECT 
      endpoint_id,
      ROUND(AVG(is_healthy) * 100, 2) as uptime_percent,
      ROUND(AVG(latency_ms), 2) as avg_latency,
      COUNT(*) as total_checks,
      MAX(recorded_at) as last_check
    FROM check_logs
    WHERE recorded_at >= ?
    GROUP BY endpoint_id
  `).bind(cutoff).all<RollupMetrics>();

  // Prune raw data older than retention window
  const retentionCutoff = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000).toISOString();
  await env.DB.prepare('DELETE FROM check_logs WHERE recorded_at < ?').bind(retentionCutoff).run();

  return rollups.results;
}

Architecture Rationale:

SQL aggregation pushes computation to the database layer, reducing Worker CPU time.
Retention policy prevents D1 storage from growing indefinitely. Free tier offers 5GB, but unbounded logs will eventually hit row-read/write limits.
The snapshot is a single payload. The static site fetches it once at build time, avoiding per-request database queries.

Phase 3: Static Presentation (Astro + Pages)

The monitoring dashboard is a 140+ page static site. During the build process, Astro fetches the JSON snapshot, generates HTML for each endpoint, and deploys to Cloudflare Pages. No server-side rendering occurs at request time.

// astro/src/pages/api/[endpoint].astro
import { getCollection } from 'astro:content';
import type { APIContext } from 'astro';

export async function getStaticPaths() {
  const snapshot = await fetch('https://builder-worker.example.workers.dev/snapshot.json').then(r => r.json());
  
  return snapshot.map((ep: any) => ({
    params: { endpoint: ep.endpointId },
    props: { data: ep }
  }));
}

export async function GET({ params, props }: APIContext) {
  const { data } = props;
  return new Response(
    `<html>
      <head><title>${data.endpointId} Status</title></head>
      <body>
        <h1>${data.endpointId}</h1>
        <p>Uptime: ${data.uptime_percent}%</p>
        <p>Avg Latency: ${data.avg_latency}ms</p>
        <p>Checks: ${data.total_checks}</p>
      </body>
    </html>`,
    { headers: { 'Content-Type': 'text/html' } }
  );
}

Architecture Rationale:

Static generation eliminates runtime dependencies. The dashboard remains online regardless of API health.
Cloudflare Pages caches HTML at the edge, delivering sub-50ms TTFB globally.
Build-time data fetching ensures consistency. All pages reflect the exact same snapshot.

Pitfall Guide

1. The `fetch` Context Binding Trap

When using dependency injection to mock network calls in tests, storing fetch as an object property breaks V8's internal brand check. Calling deps.fetcher(url) throws TypeError: Illegal invocation because this no longer references the global scope. Fix: Bind the function explicitly: const client = { fetcher: fetch.bind(globalThis) }. Alternatively, avoid DI for built-ins and use module mocking in your test runner.

2. The 50-Subrequest Ceiling

Cloudflare Workers on the free tier enforce a hard limit of 50 outbound HTTP requests per invocation. Attempting to ping 77 endpoints in a single cron run silently fails after the 50th request. The worker reports success, but the remaining checks resolve as failures. Fix: Split the workload across multiple cron triggers. Schedule cron-1 for endpoints 1–40 and cron-2 for 41–77, offset by 5 minutes. This stays within limits and distributes load.

3. Deploy Hook Misalignment

Cloudflare Pages deploy hooks only function for projects connected to a Git repository. Direct uploads via wrangler pages deploy do not support programmatic rebuild triggers. Attempting to call a deploy hook URL returns 404. Fix: Use GitHub Actions or CI/CD pipelines to trigger builds. The workflow runs the static generator, then executes wrangler pages deploy. Set CLOUDFLARE_ACCOUNT_ID explicitly to bypass token scope limitations.

4. Status-Code Myopia

Relying solely on HTTP status codes creates false positives. Many free APIs return 200 OK with error payloads like {"error": "quota exceeded"} or {"message": "invalid key"}. The endpoint appears healthy but delivers broken data. Fix: Implement response body validation. Parse JSON responses and check for known error keys, or validate against a minimal schema. Add a expected_payload_keys field to your endpoint configuration.

5. Unbounded D1 Growth

Raw check logs accumulate rapidly. At hourly intervals across dozens of endpoints, D1 row writes and storage grow linearly. Without pruning, you will eventually hit free-tier write limits or storage caps. Fix: Implement a retention policy. Keep raw logs for 7–30 days, then delete them. Store only aggregated rollups for long-term historical tracking. Use DELETE FROM ... WHERE recorded_at < ? in your daily aggregation job.

6. Cold Start Latency Skew

The first request to a Worker or external API after inactivity incurs cold start latency. Recording this in your metrics inflates average response times and triggers false degradation alerts. Fix: Discard the first measurement after a cold start, or use a warm-up probe. Alternatively, track p95 latency instead of avg to reduce skew from outlier cold starts.

7. Missing Retry/Timeout Logic

Network hiccups, DNS resolution delays, or temporary gateway errors cause transient failures. Without timeouts or retry logic, a single hanging request consumes subrequest quota and blocks subsequent checks. Fix: Always wrap external fetch calls in AbortController with explicit timeouts. Implement exponential backoff for retries, but cap them to avoid subrequest exhaustion. Log retry attempts separately from final results.

Production Bundle

Action Checklist

Define endpoint configuration schema: include URL, method, expected status, timeout, and payload validation rules.
Split collection workload across multiple cron triggers if monitoring >40 endpoints on the free tier.
Implement strict timeouts using AbortController to prevent hanging requests from consuming quota.
Add response body validation to catch silent failures where status is 200 but payload indicates error.
Configure D1 retention policy to prune raw logs after 30 days and archive rollups separately.
Use CI/CD pipelines for static site rebuilds instead of relying on Pages deploy hooks if using direct upload.
Set CLOUDFLARE_ACCOUNT_ID in CI environments to bypass token scope lookups during deployment.
Monitor D1 row writes and Workers subrequest usage weekly to stay within free-tier thresholds.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
<50 endpoints, strict $0 budget	Single Worker + hourly cron + D1 + Astro	Fits within 50-subrequest limit; free tier covers all usage	$0/month
50–200 endpoints, free tier	Split into 2–4 offset cron triggers	Avoids subrequest ceiling; distributes load evenly	$0/month
>200 endpoints or real-time needs	Upgrade to Workers Paid ($5/mo)	Lifts subrequest limit to 1,000; enables concurrent execution	$5/month
Need historical trend analysis	Store rollups in separate D1 table + export to CSV	Raw logs are too granular; rollups enable charting without storage bloat	$0/month
Dashboard must survive API outages	Static generation at build time	Eliminates runtime dependencies; CDN caches HTML globally	$0/month

Configuration Template

# wrangler.toml
name = "api-monitor"
main = "src/index.ts"
compatibility_date = "2024-06-01"

[[d1_databases]]
binding = "DB"
database_name = "monitor-db"
database_id = "your-d1-database-id"

[triggers]
crons = ["0 * * * *", "5 * * * *"]

[env.production]
name = "api-monitor-prod"
account_id = "your-account-id"

-- D1 Schema
CREATE TABLE IF NOT EXISTS endpoints (
  id TEXT PRIMARY KEY,
  target_url TEXT NOT NULL,
  method TEXT DEFAULT 'GET',
  expected_status INTEGER DEFAULT 200,
  timeout_ms INTEGER DEFAULT 5000,
  status TEXT DEFAULT 'active',
  created_at TEXT DEFAULT (datetime('now'))
);

CREATE TABLE IF NOT EXISTS check_logs (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  endpoint_id TEXT NOT NULL,
  recorded_at TEXT NOT NULL,
  status_code INTEGER NOT NULL,
  latency_ms INTEGER NOT NULL,
  is_healthy INTEGER NOT NULL,
  FOREIGN KEY (endpoint_id) REFERENCES endpoints(id)
);

CREATE INDEX idx_logs_endpoint_time ON check_logs(endpoint_id, recorded_at);

Quick Start Guide

Initialize the project: Run npm create cloudflare@latest api-monitor -- --type worker and select D1 as the database binding.
Seed endpoints: Insert your target APIs into the endpoints table using wrangler d1 execute monitor-db --file=seed.sql.
Deploy the collection worker: Run wrangler deploy to push the pinger. Verify cron execution via wrangler tail.
Configure the aggregator: Add the daily rollup job to the same worker or a separate deployment. Test snapshot generation locally with wrangler dev.
Build the static site: Clone the Astro template, point getStaticPaths to your builder endpoint, and deploy via wrangler pages deploy dist/. Schedule a GitHub Action to trigger rebuilds daily.

This architecture delivers production-grade external dependency monitoring without infrastructure overhead. By decoupling collection from presentation, enforcing strict resource boundaries, and validating beyond HTTP status codes, you gain visibility into the volatile layer of your stack while maintaining a zero-cost operational footprint.

I Built a Live Monitor for 77 Free Public APIs in a Weekend (Architecture + Bugs)

Architecting a Zero-Cost External Dependency Monitor on the Edge

Current Situation Analysis

WOW Moment: Key Findings

Core Solution

Phase 1: Data Collection (The Pinger)

Phase 2: Data Aggregation (The Builder)

Phase 3: Static Presentation (Astro + Pages)

Pitfall Guide

1. The `fetch` Context Binding Trap

2. The 50-Subrequest Ceiling

3. Deploy Hook Misalignment

4. Status-Code Myopia

5. Unbounded D1 Growth

6. Cold Start Latency Skew

7. Missing Retry/Timeout Logic

Production Bundle

Action Checklist

Decision Matrix

Configuration Template

Quick Start Guide

Mid-Year Sale — Unlock Full Article