tagged with user_id, request_id, or pod_name can generate hundreds of thousands of series. The fix is to implement a cardinality guard that hashes or drops high-cardinality tags before emission.
import { StatsD } from 'hot-shots';
const DATADOG_AGENT_HOST = process.env.DD_AGENT_HOST || 'localhost';
const MAX_CARDINALITY_PER_METRIC = 500;
class MetricCardinalityGuard {
private seriesRegistry: Map<string, Set<string>> = new Map();
private statsd: StatsD;
constructor() {
this.statsd = new StatsD({ host: DATADOG_AGENT_HOST, port: 8125 });
}
emit(metricName: string, value: number, tags: Record<string, string>) {
const tagKey = Object.entries(tags)
.filter(([k]) => !this.isHighCardinality(k))
.map(([k, v]) => `${k}:${v}`)
.join(',');
const seriesId = `${metricName}|${tagKey}`;
if (!this.seriesRegistry.has(metricName)) {
this.seriesRegistry.set(metricName, new Set());
}
const registry = this.seriesRegistry.get(metricName)!;
if (registry.size >= MAX_CARDINALITY_PER_METRIC && !registry.has(seriesId)) {
// Fallback: emit to a generic bucket to avoid cardinality explosion
this.statsd.increment('metric.cardinality_limit_reached', 1, [metricName]);
return;
}
registry.add(seriesId);
this.statsd.increment(metricName, 1, [tagKey]);
}
private isHighCardinality(tagKey: string): boolean {
const highCardinalityKeys = ['user_id', 'request_id', 'session_id', 'pod_uid', 'transaction_hash'];
return highCardinalityKeys.includes(tagKey);
}
}
export const telemetry = new MetricCardinalityGuard();
Architecture Rationale: High-cardinality tags are filtered at the SDK level before they reach the Datadog agent. This prevents series creation entirely, rather than relying on post-ingestion filtering. The fallback mechanism ensures observability continuity while capping series growth. The MAX_CARDINALITY_PER_METRIC threshold should align with your team's budget and the $0.10 per 100 metrics pricing tier.
Step 2: Implement Log Sampling & Routing
Indexing costs $1.70 per million events. Ingesting unfiltered logs at scale is financially unsustainable. The solution is a two-tier log pipeline: sample debug/verbose logs, index only error/warn/critical paths, and route compliance logs to Flex or Archive tiers.
import { Request, Response, NextFunction } from 'express';
interface LogPayload {
level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
message: string;
metadata: Record<string, unknown>;
}
class LogSamplingPipeline {
private debugSampleRate: number;
private infoSampleRate: number;
constructor(debugRate = 0.1, infoRate = 0.5) {
this.debugSampleRate = debugRate;
this.infoSampleRate = infoRate;
}
shouldIndex(payload: LogPayload): boolean {
if (['error', 'fatal', 'warn'].includes(payload.level)) return true;
if (payload.level === 'info') return Math.random() < this.infoSampleRate;
if (payload.level === 'debug') return Math.random() < this.debugSampleRate;
return false;
}
routeToDatadog(payload: LogPayload): void {
const indexed = this.shouldIndex(payload);
const ddPayload = {
...payload,
ddtags: indexed ? 'env:prod,index:true' : 'env:prod,index:false',
_dd: { sampling_priority: indexed ? 1 : 0 }
};
// Send to Datadog Agent HTTP intake
fetch('http://localhost:10516/api/v2/logs', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify([ddPayload])
}).catch(err => console.error('Datadog log delivery failed:', err));
}
}
export const logPipeline = new LogSamplingPipeline(0.05, 0.3);
Architecture Rationale: Sampling occurs at the application layer before network transmission. This reduces egress bandwidth, agent CPU overhead, and indexing costs simultaneously. The sampling_priority flag tells the Datadog agent whether to route the event to the index or to a Flex/Archive tier. This decouples compliance retention from active debugging costs.
Datadog counts containers and pods above a configurable threshold as billable hosts. Ephemeral workloads, CI runners, and batch jobs should be excluded from billing thresholds.
# datadog.yaml (Agent configuration)
container_exclude:
- "image:.*ci-runner.*"
- "kube_namespace:batch-jobs"
- "kube_deployment:.*-ephemeral"
container_include_metrics:
- "kube_namespace:production"
- "kube_namespace:staging"
# Prevent high-watermark inflation from short-lived pods
process_config:
enabled: "true"
container_collection:
enabled: true
# Only count containers running > 300 seconds
min_uptime_seconds: 300
Architecture Rationale: The 99th-percentile billing model means a 2-hour spike in pod count can inflate monthly costs. By excluding CI/batch namespaces and enforcing a minimum uptime threshold, you ensure only sustained workloads contribute to the host count. This aligns billing with actual infrastructure consumption rather than transient orchestration events.
Pitfall Guide
1. The 99th Percentile Host Trap
Explanation: Datadog bills infrastructure hosts based on the 99th percentile of hourly counts each month. A single weekend stress test or auto-scaling event can lock in a higher baseline for the entire month.
Fix: Implement container uptime thresholds (min_uptime_seconds), exclude ephemeral namespaces, and schedule scale tests during off-peak billing windows or use dedicated test accounts.
2. Kubernetes Pod Counting Blind Spot
Explanation: Every container or pod above the agent's threshold counts as a billable host. Deployments with frequent rollouts, job-style workloads, or sidecar proxies multiply host counts silently.
Fix: Use container_exclude rules for non-production workloads, consolidate sidecars where possible, and audit pod density per node to avoid over-provisioning that triggers additional host billing.
3. Unbounded Metric Cardinality
Explanation: Each unique metric+tag combination is a billable series. Tagging metrics with user_id, request_id, or dynamic IDs creates exponential series growth. At $0.10 per 100 series, 2 million combinations cost $2,000 monthly.
Fix: Implement SDK-level cardinality guards, hash or bucket high-cardinality values, and move granular data to logs or traces instead of metrics.
4. Index-Everything Log Strategy
Explanation: Ingestion is cheap ($0.10/GB), but indexing is expensive ($1.70/1M events). Indexing logs you never search or alert on is a direct budget leak.
Fix: Route only error/warn/critical logs to the index. Use Flex Logs for compliance retention and Archive tiers for long-term storage. Apply sampling rates to debug/info levels.
5. Enterprise Tier Creep
Explanation: Features like SSO, SAML, audit logs, extended retention, and Continuous Profiler require Enterprise tier upgrades. A single team's requirement can force the entire fleet onto Enterprise pricing ($23/$40/$41 per host vs. Pro).
Fix: Isolate compliance/security workloads to dedicated accounts or use open-source alternatives for non-critical features. Negotiate tier boundaries with Datadog sales to avoid fleet-wide upgrades.
6. Ignoring Annual Commitment Leverage
Explanation: Monthly billing carries a ~20% premium over annual commitments. Teams on month-to-month contracts pay significantly more for identical usage.
Fix: Forecast 12-month host/log/metric baselines, commit to annual contracts, and leverage volume tiers. Multi-year agreements typically yield 10β20% additional discounts.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / <50 hosts | Pro tier + Free logs | Low volume, minimal compliance needs | Baseline cost, predictable |
| High-volume logs (>500GB/mo) | Selective indexing + Flex Archive | Indexing cost dominates; Flex reduces $1.70/1M event exposure | 60β80% log cost reduction |
| Ephemeral K8s workloads | Container exclusion + uptime thresholds | Prevents 99th-percentile host inflation from batch/CI jobs | Eliminates 20β40% host billing spikes |
| Enterprise compliance (SSO/Audit) | Dedicated account or tier isolation | Avoids pulling entire fleet to Enterprise pricing | Saves $8β$16/host/mo on non-compliant nodes |
| Annual commitment ready | 12-month contract + volume tiers | Monthly billing carries ~20% premium; volume unlocks discounts | 10β20% list price reduction |
Configuration Template
# datadog.yaml - Production Cost Control Baseline
# Host & Container Billing Controls
container_exclude:
- "kube_namespace:ci-cd"
- "kube_namespace:batch"
- "image:.*test-runner.*"
process_config:
enabled: "true"
container_collection:
enabled: true
min_uptime_seconds: 300
# Log Pipeline Routing
logs:
- type: file
path: /var/log/app/*.log
service: "my-app"
source: "nodejs"
# Route to Flex tier for compliance, index only critical paths
log_processing_rules:
- type: include_at_match
pattern: "(ERROR|FATAL|WARN)"
name: "index_critical"
- type: exclude_at_match
pattern: "(DEBUG|INFO)"
name: "flex_only"
# Metric Cardinality Safeguard (Agent-level)
# Note: SDK-level guards are preferred, but agent can drop high-cardinality tags
tags:
- "env:prod"
- "team:backend"
# Avoid dynamic tags here; enforce in application code
Quick Start Guide
- Install & Configure Agent: Deploy the Datadog agent with the provided
datadog.yaml template. Ensure container_exclude and min_uptime_seconds match your ephemeral workload patterns.
- Instrument SDK Guards: Replace direct metric emissions with the
MetricCardinalityGuard class. Set MAX_CARDINALITY_PER_METRIC to 500 or your budget-aligned threshold.
- Deploy Log Sampler: Integrate
LogSamplingPipeline into your application's logging framework. Set debugSampleRate to 0.05 and infoSampleRate to 0.3 for initial rollout.
- Validate Billing Impact: Monitor the Datadog billing dashboard for 7 days. Verify that host counts stabilize, metric series growth plateaus, and indexing costs drop by 50%+.
- Lock Annual Commitment: Once baselines are established, negotiate a 12-month contract with volume tiers. Disable monthly billing to capture the ~20% discount.