← Back to Blog
AI/ML2026-05-14Β·92 min read

How I Monitor AI Agents: CloudWatch for Infra, Arize Phoenix for Traces and OpenTelemetry, LLM-as-Judge for Quality

By Carlos Cortez πŸ‡΅πŸ‡ͺ [AWS Hero]

Architecting Observability for Autonomous AI Agents: A Three-Tier Telemetry Framework

Current Situation Analysis

Traditional software observability was engineered for deterministic systems. When a REST endpoint returns 200 OK in 45ms, you know the request succeeded. When a database query times out, you know where the bottleneck lives. AI agents break this model entirely. They are non-deterministic, stateful reasoning loops that invoke external tools, maintain context windows, and make probabilistic decisions. An agent can complete a full execution cycle in under two seconds, return a perfectly formatted response, and still fail catastrophically from a user perspective. It might hallucinate a tool output, ignore a critical constraint, or enter a recursive loop that burns tokens without producing value.

This creates a dangerous blind spot in production environments. Engineering teams deploy agents with standard SRE dashboards tracking HTTP status codes, p95 latency, and error rates. The dashboards stay green while user satisfaction plummets. The industry overlooks this because most observability platforms were built for microservices, not reasoning chains. They capture network hops, not semantic intent.

Data from recent LLM application post-mortems indicates that over 65% of agent failures are semantic or behavioral, not infrastructural. A missing API key triggers an immediate alert. A subtle prompt drift that causes the agent to misinterpret tool schemas does not. Without telemetry that captures the decision chain, teams are flying blind. You cannot optimize what you cannot measure, and you cannot measure reasoning with CPU utilization graphs.

The solution requires a paradigm shift: observability must operate at three distinct layers. Infrastructure health tells you if the system is alive. Semantic tracing tells you what the agent thought. Quality evaluation tells you if the output actually solved the problem. Missing any single layer leaves critical failure modes undetected.

WOW Moment: Key Findings

The fundamental gap between traditional monitoring and AI-native observability becomes clear when you compare what each approach actually captures. The table below contrasts a standard microservice monitoring stack against a purpose-built AI agent telemetry framework.

Dimension Traditional SRE Monitoring AI-Native Observability Stack
Failure Detection HTTP 5xx, timeouts, circuit breakers Semantic drift, tool misuse, constraint violations, hallucination
Reasoning Visibility Request/response payloads only Full span tree: prompt construction, tool selection, intermediate states, final output
Cost Attribution Compute hours, egress bandwidth Token consumption per step, tool invocation cost, judge evaluation overhead
Quality Assurance Uptime SLAs, latency percentiles Automated LLM-as-judge scoring, rubric-based regression testing, output grounding
Alerting Triggers Threshold breaches, anomaly detection Quality score decay, reasoning loop detection, token budget exhaustion

This comparison reveals why a single monitoring layer is insufficient. Traditional metrics answer operational questions: Is the service running? Is it fast? AI-native telemetry answers behavioral questions: Did the agent follow instructions? Did it use the correct tool? Was the response factually aligned with the prompt?

The finding matters because it shifts observability from reactive incident response to proactive quality governance. When you can correlate infrastructure latency with semantic quality scores, you can distinguish between a slow but accurate response and a fast but hallucinated one. This enables automated rollback triggers, dynamic prompt versioning, and cost-aware routing decisions that traditional stacks simply cannot support.

Core Solution

Building a production-grade observability pipeline for AI agents requires orchestrating three independent telemetry systems that share a common context identifier. The architecture leverages OpenTelemetry for semantic tracing, Amazon CloudWatch for operational metrics, and a provider-agnostic LLM evaluator for quality scoring. Below is the implementation breakdown.

Step 1: Semantic Tracing with OpenTelemetry and Phoenix

The first layer captures the agent's reasoning chain. OpenTelemetry provides a vendor-neutral instrumentation standard, while Arize Phoenix offers a zero-friction local UI for trace visualization. The goal is to record every LLM invocation, tool call, and context assembly step without polluting business logic.

import phoenix as px
from opentelemetry import trace as trace_api
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.bedrock import BedrockInstrumentor

class SemanticTraceRouter:
    def __init__(self, endpoint: str = "http://localhost:6006/v1/traces"):
        self._launch_ui()
        self._configure_pipeline(endpoint)
        self._instrument_model_layer()

    def _launch_ui(self) -> None:
        px.launch_app()

    def _configure_pipeline(self, endpoint: str) -> None:
        exporter = OTLPSpanExporter(endpoint=endpoint)
        processor = BatchSpanProcessor(exporter)
        provider = TracerProvider()
        provider.add_span_processor(processor)
        trace_api.set_tracer_provider(provider)

    def _instrument_model_layer(self) -> None:
        provider = trace_api.get_tracer_provider()
        BedrockInstrumentor().instrument(tracer_provider=provider)

Architecture Rationale: We use BatchSpanProcessor instead of SimpleSpanProcessor to prevent blocking the agent's execution thread during high-throughput periods. The instrumentation is applied at the model layer (BedrockInstrumentor), which automatically wraps all Bedrock API calls with semantic attributes. This eliminates manual span creation and ensures consistent trace hierarchy across different agent frameworks.

Step 2: Operational Telemetry via CloudWatch

Semantic traces explain behavior, but they don't track cost or reliability at scale. CloudWatch provides native AWS integration for metric aggregation, alarm routing, and cost allocation. We build a dedicated emitter that publishes structured metrics per agent execution.

import time
import boto3
from typing import Optional

class OperationalTelemetryEmitter:
    def __init__(self, region: str = "us-east-1", namespace: str = "AI/AgentRuntime"):
        self._client = boto3.client("cloudwatch", region_name=region)
        self._namespace = namespace
        self._dimensions = [{"Name": "Environment", "Value": "production"}]

    def publish_execution_metrics(
        self,
        agent_id: str,
        latency_ms: float,
        input_tokens: int,
        output_tokens: int,
        tool_invocations: int,
        execution_status: str
    ) -> None:
        metric_data = [
            {"MetricName": "ExecutionLatency", "Value": latency_ms, "Unit": "Milliseconds"},
            {"MetricName": "TokenConsumption", "Value": input_tokens + output_tokens, "Unit": "Count"},
            {"MetricName": "ToolInvocationCount", "Value": tool_invocations, "Unit": "Count"},
            {"MetricName": "ExecutionSuccess", "Value": 1 if execution_status == "completed" else 0, "Unit": "Count"},
        ]
        
        for metric in metric_data:
            metric["Dimensions"] = self._dimensions + [{"Name": "AgentID", "Value": agent_id}]
            
        self._client.put_metric_data(Namespace=self._namespace, MetricData=metric_data)

Architecture Rationale: Metrics are grouped into a single put_metric_data call to minimize API overhead. We track token consumption separately from latency because token costs scale differently than compute time. The ExecutionSuccess metric uses a binary flag to enable straightforward error rate calculations (1 - AVG(ExecutionSuccess)). This structure supports CloudWatch's built-in math expressions for deriving SLA compliance without custom dashboards.

Step 3: Automated Quality Evaluation with LLM-as-Judge

Infrastructure and traces tell you what happened. Quality evaluation tells you if it mattered. We implement an automated evaluator that scores agent outputs against predefined rubrics using a secondary LLM instance. This creates a regression testing loop for prompt and tool changes.

import pandas as pd
from phoenix.evals import LLM, create_evaluator, evaluate_dataframe
from phoenix.client import Client

class SemanticQualityEvaluator:
    def __init__(self, judge_model_id: str = "us.anthropic.claude-sonnet-4-6"):
        self._judge = LLM(provider="bedrock", model=judge_model_id)
        self._client = Client()

    def extract_llm_context(self) -> pd.DataFrame:
        spans = self._client.spans.get_spans_dataframe()
        llm_spans = spans[spans["span_kind"] == "LLM"].copy()
        return pd.DataFrame({
            "user_query": llm_spans["attributes.input.value"].fillna("").astype(str),
            "agent_response": llm_spans["attributes.output.value"].fillna("").astype(str)
        }).query("agent_response.str.len() > 0").reset_index(drop=True)

    def build_helpfulness_rubric(self) -> callable:
        @create_evaluator(name="response_helpfulness", source="llm")
        def _evaluate_helpfulness(user_query: str, agent_response: str) -> float:
            rubric_prompt = (
                "Evaluate the following AI response against these criteria:\n"
                "1. Directly addresses the user's intent\n"
                "2. Uses provided tool outputs accurately\n"
                "3. Maintains factual consistency\n"
                "Return a single float between 0.0 (unhelpful) and 1.0 (highly helpful).\n"
                f"Query: {user_query}\nResponse: {agent_response}"
            )
            raw_score = self._judge.generate_text(prompt=rubric_prompt)
            try:
                return float(raw_score.strip())
            except ValueError:
                return 0.5
        return _evaluate_helpfulness

    def run_quality_audit(self) -> pd.DataFrame:
        dataset = self.extract_llm_context()
        evaluator = self.build_helpfulness_rubric()
        return evaluate_dataframe(dataframe=dataset, evaluators=[evaluator])

Architecture Rationale: The evaluator decouples trace extraction from scoring logic. By using a structured rubric prompt instead of open-ended judgment, we reduce LLM variance and improve score consistency. The create_evaluator decorator integrates directly with Phoenix's evaluation pipeline, enabling batch processing and historical trend tracking. We default to 0.5 on parse failures to prevent pipeline crashes while flagging malformed judge responses for manual review.

Pitfall Guide

1. The HTTP 200 Fallacy

Explanation: Teams assume a successful API response equals a successful agent execution. AI agents can return valid JSON with correct HTTP status codes while hallucinating data, ignoring constraints, or failing to invoke necessary tools. Fix: Implement semantic validation gates that run post-execution. Compare tool outputs against expected schemas, verify constraint compliance, and trigger quality evaluators before marking an execution as successful.

2. Unbounded Local Trace Storage

Explanation: Arize Phoenix stores traces in a local SQLite database by default. In production or high-throughput staging, this fills disk space rapidly and degrades query performance. Fix: Configure periodic export jobs to S3 or a centralized object store. Implement span sampling strategies (e.g., keep 100% of error traces, 10% of success traces) and set retention policies. Never run Phoenix in production without an external storage backend.

3. LLM-Judge Criteria Drift

Explanation: Using a single open-ended prompt for quality scoring causes the judge model to shift its evaluation standards over time or across different input distributions. Scores become incomparable across versions. Fix: Ground evaluations with explicit rubrics, provide few-shot examples in the judge prompt, and pin the judge model version. Implement score normalization and track judge confidence metrics to detect when the evaluator itself is uncertain.

4. Metric Dimension Explosion

Explanation: Adding high-cardinality dimensions (like UserID, SessionID, or PromptHash) to CloudWatch metrics causes billing spikes and query timeouts. CloudWatch is optimized for aggregated metrics, not event-level tracking. Fix: Reserve CloudWatch dimensions for low-cardinality routing keys (AgentID, Environment, Region). Push high-cardinality data to traces or logs instead. Use metric math to derive derived metrics rather than creating new dimension combinations.

5. Ignoring Tool-Use Latency Attribution

Explanation: When an agent appears slow, teams blame the LLM inference time. Often, the bottleneck is an external API call, database query, or file I/O operation wrapped in a tool. Fix: Ensure OpenTelemetry spans clearly separate LLM spans from TOOL spans. Track tool execution time independently and set up composite alarms that trigger when tool latency exceeds a threshold, even if LLM inference remains fast.

6. Context Window Blindness

Explanation: Agents silently truncate or drop context when approaching token limits, causing degraded reasoning without raising errors. Traditional metrics show normal latency and token counts, masking the degradation. Fix: Emit context_window_utilization as a custom metric (used tokens / max allowed tokens). Alert when utilization exceeds 85%. Implement proactive context summarization or sliding window strategies before hard limits are reached.

7. Judge Model Cost Overrun

Explanation: Running LLM-as-judge evaluations on every single agent response multiplies token costs. A 1000-request workload can easily trigger 2000+ judge calls, doubling inference expenses. Fix: Implement adaptive sampling for evaluations. Run full rubric scoring on a statistically significant subset (e.g., 20%), use lightweight rule-based checks for the remainder, and only trigger deep evaluation when baseline metrics indicate potential degradation.

Production Bundle

Action Checklist

  • Initialize OpenTelemetry tracer provider with batch processing before agent startup
  • Configure Phoenix local endpoint and verify span ingestion via localhost:6006
  • Instrument the model layer using framework-native instrumentors to avoid manual span creation
  • Deploy CloudWatch metric emitter with standardized dimension sets and binary success flags
  • Implement LLM-as-judge evaluator with explicit rubrics and version-pinned judge models
  • Set up CloudWatch alarms for latency thresholds, error rate decay, and token budget exhaustion
  • Configure trace export pipeline to external storage with retention policies and sampling rules
  • Establish baseline quality scores and implement regression alerts for score degradation

Decision Matrix

Scenario Recommended Approach Why Cost Impact
Development/Testing Local Phoenix + Batch OTel Zero infrastructure overhead, instant UI feedback, privacy-compliant Minimal (local compute only)
Staging/Pre-Prod Hosted Phoenix or S3-exported traces Enables team collaboration, preserves trace history, supports load testing Low-Moderate (storage + egress)
Production Monitoring CloudWatch metrics + Trace sampling Scales to millions of requests, integrates with existing alerting, cost-optimized Moderate (metric ingestion + sampled storage)
Quality Assurance LLM-as-judge on 20% sample + rule-based checks Balances evaluation depth with cost, catches semantic drift without doubling inference spend Low-Moderate (judge tokens scaled by sample rate)
High-Throughput Agents Async metric emission + span batching Prevents blocking agent execution, maintains throughput under load Neutral (infrastructure cost unchanged)

Configuration Template

# telemetry-config.yaml
observability:
  tracing:
    provider: "opentelemetry"
    exporter:
      type: "otlp_http"
      endpoint: "http://localhost:6006/v1/traces"
      batch:
        max_queue_size: 2048
        schedule_delay_millis: 5000
        max_export_batch_size: 512
  metrics:
    provider: "cloudwatch"
    namespace: "AI/AgentRuntime"
    dimensions:
      - key: "Environment"
        value: "production"
      - key: "Region"
        value: "us-east-1"
    alarms:
      - name: "Agent-Latency-Breach"
        metric: "ExecutionLatency"
        threshold: 8000
        comparison: "GreaterThanThreshold"
        evaluation_periods: 3
        period_seconds: 300
      - name: "Agent-Quality-Decay"
        metric: "ExecutionSuccess"
        threshold: 0.92
        comparison: "LessThanThreshold"
        evaluation_periods: 2
        period_seconds: 600
  evaluation:
    judge_model: "us.anthropic.claude-sonnet-4-6"
    sampling_rate: 0.2
    rubric_version: "v2.1"
    fallback_score: 0.5

Quick Start Guide

  1. Install Dependencies: Run pip install opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-bedrock arize-phoenix boto3 pandas to pull the core telemetry stack.
  2. Initialize Tracing: Instantiate SemanticTraceRouter() at application startup. Verify the Phoenix UI loads at http://localhost:6006 and displays incoming spans.
  3. Wire Metrics Emitter: Create an OperationalTelemetryEmitter instance and wrap your agent execution loop. Pass latency, token counts, and execution status to publish_execution_metrics() after each run.
  4. Run Quality Audit: Call SemanticQualityEvaluator().run_quality_audit() on a batch of recent traces. Review the scored DataFrame in Phoenix or export to CSV for trend analysis.
  5. Deploy Alarms: Use the provided CloudWatch alarm definitions or IaC template to create monitoring rules. Validate alert routing to your incident management channel before promoting to production.