I Built a Live Monitor for 77 Free Public APIs in a Weekend (Architecture + Bugs)
Architecting a Zero-Cost External Dependency Monitor on the Edge
Current Situation Analysis
Modern application stacks increasingly rely on third-party services for non-core functionality: weather data, geocoding, currency conversion, news feeds, and public datasets. The assumption is that if an endpoint is publicly documented and labeled "free," it will remain stable. In practice, free APIs operate without service-level agreements, making them the weakest link in production dependency graphs.
The core pain point is silent degradation. Unlike internal microservices that trigger PagerDuty alerts on 5xx spikes, external free APIs often fail quietly. They return 200 OK with error payloads, throttle requests without warning, or decommission endpoints entirely. Development teams rarely instrument monitoring for these services because traditional APM tools are expensive, complex to configure for external targets, and introduce runtime overhead.
This blind spot is statistically significant. In a representative sample of ten widely cited free weather APIs, approximately four were either decommissioned, migrated behind undocumented paywalls, or required credit card verification despite documentation claiming otherwise. When these services power frontend features or background sync jobs, the failure mode shifts from "service degraded" to "user-facing breakage" with zero early warning.
The misunderstanding stems from treating external APIs as infrastructure rather than volatile dependencies. Infrastructure demands uptime guarantees; volatile dependencies demand continuous verification. Building a lightweight, zero-cost verification layer bridges this gap without introducing operational debt.
WOW Moment: Key Findings
Traditional monitoring stacks require dedicated servers, database instances, and dashboard software. Even lightweight open-source alternatives demand maintenance, patching, and baseline hosting costs. By contrast, an edge-native architecture decouples data collection from presentation, leveraging serverless compute and static generation to eliminate runtime failure surfaces.
| Approach | Monthly Cost | Infrastructure Complexity | Runtime Failure Risk | Data Freshness |
|---|---|---|---|---|
| Traditional Server + DB + Dashboard | $15β$50+ | High (OS, runtime, DB, web server) | High (dashboard crashes if backend fails) | Real-time |
| SaaS Uptime Monitor | $10β$30+ | Low (managed) | Low (vendor handles uptime) | 1β5 min intervals |
| Edge-Static Architecture | $0 | Medium (initial setup) | Zero (static HTML cannot fail at runtime) | Hourly/Daily |
The edge-static model shifts the failure domain entirely. Because the monitoring dashboard is pre-rendered HTML served from a CDN, it remains accessible even if the target APIs go completely offline. The only moving parts are the background workers that collect and aggregate data. This architecture also naturally aligns with free-tier limits, as compute is event-driven and storage is append-only until rollup.
Core Solution
The system operates as a one-way data pipeline: collection β aggregation β static generation. Each stage is independently deployable, stateless, and bound to a specific schedule.
Phase 1: Data Collection (The Pinger)
A Cloudflare Worker executes on an hourly cron trigger. It queries a lightweight SQL database for active endpoints, performs HTTP health checks, and persists results. The worker avoids runtime dependencies by writing directly to D1 and never exposing public endpoints.
// worker/src/collector.ts
import { D1Database } from '@cloudflare/workers-types';
interface EndpointConfig {
id: string;
targetUrl: string;
method: 'GET' | 'HEAD';
expectedStatus: number;
timeoutMs: number;
}
interface CheckResult {
endpointId: string;
timestamp: string;
statusCode: number;
latencyMs: number;
isHealthy: boolean;
}
export async function runHealthChecks(env: { DB: D1Database }): Promise<void> {
const endpoints: EndpointConfig[] = await env.DB.prepare(
'SELECT id, target_url, method, expected_status, timeout_ms FROM endpoints WHERE status = ?'
).bind('active').all<EndpointConfig>();
const results: CheckResult[] = [];
for (const ep of endpoints) {
const start = performance.now();
let statusCode = 0;
let isHealthy = false;
try {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), ep.timeoutMs);
const response = await fetch(ep.targetUrl, {
method: ep.method,
signal: controller.signal,
headers: { 'User-Agent': 'EdgeMonitor/1.0' }
});
clearTimeout(timeoutId);
statusCode = response.status;
isHealthy = statusCode === ep.expectedStatus;
} catch {
statusCode = 0;
isHealthy = false;
}
const latency = Math.round(performance.now() - start);
results.push({
endpointId: ep.id,
timestamp: new Date().toISOString(),
statusCode,
latencyMs: latency,
isHealthy
});
}
// Batch insert to minimize D1 round trips
const batch = results.map(r =>
env.DB.prepare(
'INSERT INTO check_logs (endpoint_id, recorded_at, status_code, latency_ms, is_healthy) VALUES (?, ?, ?, ?, ?)'
).bind(r.endpointId, r.timestamp, r.statusCode, r.latencyMs, r.isHealthy ? 1 : 0)
);
await env.DB.batch(batch);
}
Architecture Rationale:
AbortControllerenforces strict timeouts, preventing hanging requests from consuming subrequest quota.- Batch inserts reduce D1 transaction overhead. D1 handles up to 100 statements per batch efficiently.
- No public routes are exposed. The worker runs exclusively on cron, eliminating attack surface.
Phase 2: Data Aggregation (The Builder)
A second Worker runs daily. It computes 24-hour uptime percentages, average latency, and state-change events. Raw logs older than 30 days are pruned to control storage growth. The worker generates a single JSON snapshot consumed by the static site.
// worker/src/aggregator.ts
import { D1Database } from '@cloudflare/workers-types';
interface RollupMetrics {
endpointId: string;
uptimePercent: number;
avgLatencyMs: number;
totalChecks: number;
lastStatus: number;
}
export async function generateSnapshot(env: { DB: D1Database }): Promise<RollupMetrics[]> {
const cutoff = new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString();
const rollups = await env.DB.prepare(`
SELECT
endpoint_id,
ROUND(AVG(is_healthy) * 100, 2) as uptime_percent,
ROUND(AVG(latency_ms), 2) as avg_latency,
COUNT(*) as total_checks,
MAX(recorded_at) as last_check
FROM check_logs
WHERE recorded_at >= ?
GROUP BY endpoint_id
`).bind(cutoff).all<RollupMetrics>();
// Prune raw data older than retention window
const retentionCutoff = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000).toISOString();
await env.DB.prepare('DELETE FROM check_logs WHERE recorded_at < ?').bind(retentionCutoff).run();
return rollups.results;
}
Architecture Rationale:
- SQL aggregation pushes computation to the database layer, reducing Worker CPU time.
- Retention policy prevents D1 storage from growing indefinitely. Free tier offers 5GB, but unbounded logs will eventually hit row-read/write limits.
- The snapshot is a single payload. The static site fetches it once at build time, avoiding per-request database queries.
Phase 3: Static Presentation (Astro + Pages)
The monitoring dashboard is a 140+ page static site. During the build process, Astro fetches the JSON snapshot, generates HTML for each endpoint, and deploys to Cloudflare Pages. No server-side rendering occurs at request time.
// astro/src/pages/api/[endpoint].astro
import { getCollection } from 'astro:content';
import type { APIContext } from 'astro';
export async function getStaticPaths() {
const snapshot = await fetch('https://builder-worker.example.workers.dev/snapshot.json').then(r => r.json());
return snapshot.map((ep: any) => ({
params: { endpoint: ep.endpointId },
props: { data: ep }
}));
}
export async function GET({ params, props }: APIContext) {
const { data } = props;
return new Response(
`<html>
<head><title>${data.endpointId} Status</title></head>
<body>
<h1>${data.endpointId}</h1>
<p>Uptime: ${data.uptime_percent}%</p>
<p>Avg Latency: ${data.avg_latency}ms</p>
<p>Checks: ${data.total_checks}</p>
</body>
</html>`,
{ headers: { 'Content-Type': 'text/html' } }
);
}
Architecture Rationale:
- Static generation eliminates runtime dependencies. The dashboard remains online regardless of API health.
- Cloudflare Pages caches HTML at the edge, delivering sub-50ms TTFB globally.
- Build-time data fetching ensures consistency. All pages reflect the exact same snapshot.
Pitfall Guide
1. The fetch Context Binding Trap
When using dependency injection to mock network calls in tests, storing fetch as an object property breaks V8's internal brand check. Calling deps.fetcher(url) throws TypeError: Illegal invocation because this no longer references the global scope.
Fix: Bind the function explicitly: const client = { fetcher: fetch.bind(globalThis) }. Alternatively, avoid DI for built-ins and use module mocking in your test runner.
2. The 50-Subrequest Ceiling
Cloudflare Workers on the free tier enforce a hard limit of 50 outbound HTTP requests per invocation. Attempting to ping 77 endpoints in a single cron run silently fails after the 50th request. The worker reports success, but the remaining checks resolve as failures.
Fix: Split the workload across multiple cron triggers. Schedule cron-1 for endpoints 1β40 and cron-2 for 41β77, offset by 5 minutes. This stays within limits and distributes load.
3. Deploy Hook Misalignment
Cloudflare Pages deploy hooks only function for projects connected to a Git repository. Direct uploads via wrangler pages deploy do not support programmatic rebuild triggers. Attempting to call a deploy hook URL returns 404.
Fix: Use GitHub Actions or CI/CD pipelines to trigger builds. The workflow runs the static generator, then executes wrangler pages deploy. Set CLOUDFLARE_ACCOUNT_ID explicitly to bypass token scope limitations.
4. Status-Code Myopia
Relying solely on HTTP status codes creates false positives. Many free APIs return 200 OK with error payloads like {"error": "quota exceeded"} or {"message": "invalid key"}. The endpoint appears healthy but delivers broken data.
Fix: Implement response body validation. Parse JSON responses and check for known error keys, or validate against a minimal schema. Add a expected_payload_keys field to your endpoint configuration.
5. Unbounded D1 Growth
Raw check logs accumulate rapidly. At hourly intervals across dozens of endpoints, D1 row writes and storage grow linearly. Without pruning, you will eventually hit free-tier write limits or storage caps.
Fix: Implement a retention policy. Keep raw logs for 7β30 days, then delete them. Store only aggregated rollups for long-term historical tracking. Use DELETE FROM ... WHERE recorded_at < ? in your daily aggregation job.
6. Cold Start Latency Skew
The first request to a Worker or external API after inactivity incurs cold start latency. Recording this in your metrics inflates average response times and triggers false degradation alerts.
Fix: Discard the first measurement after a cold start, or use a warm-up probe. Alternatively, track p95 latency instead of avg to reduce skew from outlier cold starts.
7. Missing Retry/Timeout Logic
Network hiccups, DNS resolution delays, or temporary gateway errors cause transient failures. Without timeouts or retry logic, a single hanging request consumes subrequest quota and blocks subsequent checks.
Fix: Always wrap external fetch calls in AbortController with explicit timeouts. Implement exponential backoff for retries, but cap them to avoid subrequest exhaustion. Log retry attempts separately from final results.
Production Bundle
Action Checklist
- Define endpoint configuration schema: include URL, method, expected status, timeout, and payload validation rules.
- Split collection workload across multiple cron triggers if monitoring >40 endpoints on the free tier.
- Implement strict timeouts using
AbortControllerto prevent hanging requests from consuming quota. - Add response body validation to catch silent failures where status is 200 but payload indicates error.
- Configure D1 retention policy to prune raw logs after 30 days and archive rollups separately.
- Use CI/CD pipelines for static site rebuilds instead of relying on Pages deploy hooks if using direct upload.
- Set
CLOUDFLARE_ACCOUNT_IDin CI environments to bypass token scope lookups during deployment. - Monitor D1 row writes and Workers subrequest usage weekly to stay within free-tier thresholds.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| <50 endpoints, strict $0 budget | Single Worker + hourly cron + D1 + Astro | Fits within 50-subrequest limit; free tier covers all usage | $0/month |
| 50β200 endpoints, free tier | Split into 2β4 offset cron triggers | Avoids subrequest ceiling; distributes load evenly | $0/month |
| >200 endpoints or real-time needs | Upgrade to Workers Paid ($5/mo) | Lifts subrequest limit to 1,000; enables concurrent execution | $5/month |
| Need historical trend analysis | Store rollups in separate D1 table + export to CSV | Raw logs are too granular; rollups enable charting without storage bloat | $0/month |
| Dashboard must survive API outages | Static generation at build time | Eliminates runtime dependencies; CDN caches HTML globally | $0/month |
Configuration Template
# wrangler.toml
name = "api-monitor"
main = "src/index.ts"
compatibility_date = "2024-06-01"
[[d1_databases]]
binding = "DB"
database_name = "monitor-db"
database_id = "your-d1-database-id"
[triggers]
crons = ["0 * * * *", "5 * * * *"]
[env.production]
name = "api-monitor-prod"
account_id = "your-account-id"
-- D1 Schema
CREATE TABLE IF NOT EXISTS endpoints (
id TEXT PRIMARY KEY,
target_url TEXT NOT NULL,
method TEXT DEFAULT 'GET',
expected_status INTEGER DEFAULT 200,
timeout_ms INTEGER DEFAULT 5000,
status TEXT DEFAULT 'active',
created_at TEXT DEFAULT (datetime('now'))
);
CREATE TABLE IF NOT EXISTS check_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
endpoint_id TEXT NOT NULL,
recorded_at TEXT NOT NULL,
status_code INTEGER NOT NULL,
latency_ms INTEGER NOT NULL,
is_healthy INTEGER NOT NULL,
FOREIGN KEY (endpoint_id) REFERENCES endpoints(id)
);
CREATE INDEX idx_logs_endpoint_time ON check_logs(endpoint_id, recorded_at);
Quick Start Guide
- Initialize the project: Run
npm create cloudflare@latest api-monitor -- --type workerand select D1 as the database binding. - Seed endpoints: Insert your target APIs into the
endpointstable usingwrangler d1 execute monitor-db --file=seed.sql. - Deploy the collection worker: Run
wrangler deployto push the pinger. Verify cron execution viawrangler tail. - Configure the aggregator: Add the daily rollup job to the same worker or a separate deployment. Test snapshot generation locally with
wrangler dev. - Build the static site: Clone the Astro template, point
getStaticPathsto your builder endpoint, and deploy viawrangler pages deploy dist/. Schedule a GitHub Action to trigger rebuilds daily.
This architecture delivers production-grade external dependency monitoring without infrastructure overhead. By decoupling collection from presentation, enforcing strict resource boundaries, and validating beyond HTTP status codes, you gain visibility into the volatile layer of your stack while maintaining a zero-cost operational footprint.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
