standard, a federated collector architecture, and a unified semantic convention for resource metadata.
Architecture Decisions
- Instrumentation Standard: OpenTelemetry is non-negotiable. It provides a vendor-neutral SDK for traces, metrics, and logs. This ensures that application code remains portable across clouds.
- Collector Topology: Use a two-tier collector model.
- Edge Collectors: Deployed as DaemonSets or sidecars in each cloud environment. They handle data enrichment, filtering, and batching to minimize network calls.
- Gateway Collectors: Centralized aggregators that merge data from edge collectors, apply global sampling rules, and export to backends.
- Resource Detection: Automate the injection of cloud-specific attributes (
cloud.provider, cloud.region, k8s.cluster.name) to ensure every telemetry signal is contextualized regardless of origin.
- Backend Strategy: Decouple storage. Use a high-performance TSDB for metrics, a search-optimized store for logs, and a trace backend that supports tail-based sampling.
Implementation Steps
1. Standardize Resource Detection
Create a custom resource detector that dynamically identifies the cloud provider and enriches resources with consistent attributes.
import { Resource, ResourceDetectionConfig, CloudProviderDetector } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
// Custom detector to normalize multi-cloud attributes
export class MultiCloudResourceDetector {
static async detect(config: ResourceDetectionConfig): Promise<Resource> {
const cloudResource = await CloudProviderDetector.detect(config);
// Ensure consistent attribute naming across providers
const attributes = {
[SemanticResourceAttributes.CLOUD_PROVIDER]: cloudResource.attributes['cloud.provider'] || 'unknown',
[SemanticResourceAttributes.CLOUD_REGION]: cloudResource.attributes['cloud.region'] || 'unknown',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'production',
[SemanticResourceAttributes.SERVICE_NAME]: process.env.OTEL_SERVICE_NAME || 'unknown-service',
};
return new Resource(attributes);
}
}
2. Configure the OTel Collector
The collector configuration must handle multi-cloud routing and data reduction. Use the filter processor to drop high-cardinality metrics and the tail_sampling processor to ensure traces are sampled based on error status or latency, not randomly.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
http:
processors:
# Normalize cloud attributes if missing
attributes:
actions:
- key: cloud.provider
value: "${env:CLOUD_PROVIDER}"
action: insert
# Reduce cardinality before export
filter:
metrics:
include:
match_type: strict
metric_names:
- http.server.duration
- http.server.request.size
- cpu.usage
- memory.usage
# Tail-based sampling: keep 100% of errors, 10% of success
tail_sampling:
policies:
- name: error-policy
type: status_code
status_code: { status_codes: [ ERROR ] }
- name: latency-policy
type: latency
latency: { threshold_ms: 500 }
- name: default-policy
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
otlphttp/traces:
endpoint: "${TRACE_BACKEND_URL}"
headers:
Authorization: "Bearer ${TRACE_API_KEY}"
otlphttp/metrics:
endpoint: "${METRICS_BACKEND_URL}"
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes, tail_sampling]
exporters: [otlphttp/traces]
metrics:
receivers: [otlp]
processors: [filter]
exporters: [prometheus, otlphttp/metrics]
3. Initialize SDK in Application
Integrate the detector and configure the exporter to point to the local Edge Collector.
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { MultiCloudResourceDetector } from './MultiCloudResourceDetector';
const sdk = new NodeSDK({
resourceDetector: MultiCloudResourceDetector,
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://otel-collector:4318/v1/metrics',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [
// Add auto-instrumentations for http, express, pg, etc.
],
});
sdk.start();
4. Define Cross-Cloud SLOs
Implement Service Level Objectives that account for cross-cloud latency. SLOs must be defined at the user journey level, not just the service level.
// Example SLO calculation logic using metrics
const crossCloudSLO = {
target: 0.999, // 99.9% availability
window: '30d',
// Error budget calculation must include cross-cloud network failures
errorBudget: (totalRequests: number) => totalRequests * (1 - 0.999),
};
Pitfall Guide
1. Instrumenting Everything Without Sampling
Mistake: Sending 100% of traces and high-cardinality metrics to the backend.
Impact: Exponential cost growth and backend performance degradation.
Best Practice: Implement tail-based sampling. Drop traces for healthy, low-latency requests. Use cardinality limits on metrics. In multi-cloud, cross-service calls generate massive trace volumes; sampling is essential.
2. Ignoring Data Residency and Compliance
Mistake: Centralizing all telemetry in a single region without considering GDPR/CCPA requirements.
Impact: Legal violations and data sovereignty breaches.
Best Practice: Configure collectors to route PII or sensitive logs to region-specific storage buckets. Use the routing processor in OTel to direct data based on attributes like cloud.region.
3. Inconsistent Resource Tagging
Mistake: Relying on manual tags or inconsistent naming conventions across clouds.
Impact: Inability to correlate costs and performance by team, project, or environment.
Best Practice: Enforce a global tagging schema via CI/CD pipelines or infrastructure-as-code policies. Every resource must have team, cost-center, and environment tags. Automate injection via the OTel resource detector.
4. Treating Cross-Cloud Network as "Free"
Mistake: Assuming network latency between clouds is negligible in traces.
Impact: Misleading latency breakdowns; engineers blame application code for network issues.
Best Practice: Instrument network spans explicitly. Use eBPF or sidecar proxies to capture network latency as distinct spans. Ensure trace context propagation includes cross-cloud hops.
5. Alert Fatigue from Threshold Mismatch
Mistake: Setting static thresholds that do not account for cloud-specific baselines.
Impact: Alerts fire during normal cloud provider fluctuations or maintenance windows.
Best Practice: Use anomaly detection or dynamic thresholds. Normalize metrics before alerting. For example, CPU usage on AWS Graviton vs. Azure AMD instances may have different performance profiles; alert on relative utilization or error rates, not absolute CPU %.
6. Vendor Lock-in via Proprietary Instrumentation
Mistake: Using cloud-specific SDKs or proprietary agents that bind the codebase to a provider.
Impact: Migration becomes impossible; observability cost is tied to cloud spend.
Best Practice: Mandate OpenTelemetry. Prohibit direct usage of CloudWatch SDK or Azure Monitor SDK in application code. The OTel Collector is the only component allowed to interact with native APIs for metadata enrichment.
7. Neglecting Log Correlation
Mistake: Sending logs and traces separately without linking them.
Impact: Debugging requires manual search across tools; context is lost.
Best Practice: Inject trace_id and span_id into log records. Configure the OTel log exporter to include these attributes. Ensure the backend supports log-trace correlation views.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / Low Volume | Unified SaaS with OTel SDK | Speed to value; low engineering overhead; SaaS handles scaling. | Medium. License costs dominate, but egress is manageable. |
| Enterprise / High Volume | OTel + Agnostic Backend | Control over data; reduced egress costs; no lock-in; custom retention policies. | Low to Medium. High initial engineering cost, but TCO decreases significantly at scale. |
| Strict Compliance / Data Sovereignty | Federated OTel + Regional Backends | Data stays within region; centralized policy enforcement; auditability. | Medium. Operational complexity increases; storage costs distributed. |
| Legacy Multi-Cloud | Hybrid: OTel SDK + Native Backends | Incremental migration; reduces agent bloat; prepares for future backend switch. | Low. Immediate reduction in agent count; native costs remain until migration complete. |
Configuration Template
Terraform Module for Multi-Cloud OTel Collector Deployment
This template demonstrates deploying a collector in AWS EKS and Azure AKS with consistent configuration.
# modules/otel-collector/main.tf
variable "cloud_provider" {
type = string
}
variable "region" {
type = string
}
variable "cluster_name" {
type = string
}
resource "helm_release" "otel_collector" {
name = "opentelemetry-collector"
repository = "https://open-telemetry.github.io/opentelemetry-helm-charts"
chart = "opentelemetry-collector"
version = "0.75.0"
namespace = "monitoring"
set {
name = "mode"
value = "daemonset"
}
set {
name = "config.service.pipelines.metrics.receivers[0]"
value = "otlp"
}
# Inject cloud-specific environment variables for resource detection
set {
name = "env.CLOUD_PROVIDER.value"
value = var.cloud_provider
}
set {
name = "env.CLOUD_REGION.value"
value = var.region
}
set {
name = "env.CLUSTER_NAME.value"
value = var.cluster_name
}
# Values file for complex config
values = [
file("${path.module}/values/${var.cloud_provider}.yaml")
]
}
Quick Start Guide
- Install OTel SDK: Add
@opentelemetry/sdk-node and relevant auto-instrumentations to your package.json.
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
- Configure Collector: Create a
otel-collector-config.yaml with receivers, processors (filtering/sampling), and exporters pointing to your backend.
- Deploy Collector: Run the collector locally for testing or deploy via Helm to your Kubernetes clusters in each cloud.
helm install otel-collector open-telemetry/opentelemetry-collector -f otel-collector-config.yaml
- Start Application: Run your app with OTel environment variables pointing to the collector.
OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
node index.js
- Verify: Check your backend for metrics, logs, and traces. Confirm that
cloud.provider and cloud.region attributes are populated correctly.