Difficulty

Intermediate

Read Time

11 min

Monitoring LLM API Calls in Python: Latency, Token Usage, and Cost Tracking With OpenTelemetry

By Codcompass Team·2026-05-11·11 min read

LLM API calls are unlike any other external dependency in your Python application.

A database query takes milliseconds. A Redis call takes microseconds. An LLM call takes anywhere from half a second to thirty seconds, consumes a variable number of tokens on every invocation, costs real money on every request, and can fail in ways that have nothing to do with network connectivity — token limits, content filters, model refusals, context window exhaustion.

Standard application monitoring was not built for this. Your existing latency dashboards will show LLM calls as outliers. Your error rate alerts will fire on model refusals that aren't actually errors. Your cost monitoring won't exist at all unless you build it.

This article builds it. We'll instrument LLM API calls in Python with OpenTelemetry — capturing latency, token consumption, estimated cost, and finish reasons as structured telemetry that you can query, dashboard, and alert on.

The Monitoring Gap in LLM Applications

When you add an LLM to a Python application, you typically get visibility into two things: whether the call succeeded, and how long it took. Everything else — how many tokens it consumed, what the model decided to do, how much it cost, whether it hit a limit — is invisible unless you instrument it explicitly.

This creates real operational problems:

A feature that works in testing starts timing out in production because prompts grew longer than expected and token counts climbed
Costs spike unexpectedly because one endpoint is generating unusually long completions
Users report bad responses but you can't tell whether the model refused, truncated, or hallucinated because finish_reason is never captured
You can't tell which of your ten LLM-powered features is responsible for 80% of your API spend

Structured telemetry on LLM calls fixes all of these. Let's build it.

Prerequisites

Python 3.10+
An OpenAI or Anthropic API key
A running OpenTelemetry Collector or observability backend

Installing Dependencies

pip install opentelemetry-sdk
pip install opentelemetry-api
pip install opentelemetry-exporter-otlp-proto-grpc
pip install openai
pip install anthropic
pip install fastapi uvicorn

Enter fullscreen mode Exit fullscreen mode

Project Structure

llm-monitoring/
├── tracing.py          # OpenTelemetry setup
├── llm_tracer.py       # LLM instrumentation layer
├── cost_estimator.py   # Token cost calculation
├── main.py             # FastAPI application
└── services.py         # LLM-powered features

Enter fullscreen mode Exit fullscreen mode

Step 1: OpenTelemetry Setup

tracing.py

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME


def init_tracer(service_name: str) -> trace.Tracer:
    resource = Resource.create({
        SERVICE_NAME: service_name,
        "service.version": "1.0.0",
    })

    exporter = OTLPSpanExporter(
        endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317"),
        insecure=True,
    )

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    return trace.get_tracer(service_name)

Enter fullscreen mode Exit fullscreen mode

Step 2: Cost Estimation

Before building the instrumentation layer, we need a way to estimate costs. LLM providers charge per token, with different rates for input and output tokens.

cost_estimator.py

from dataclasses import dataclass
from typing import Optional


@dataclass
class ModelPricing:
    input_cost_per_token: float   # USD per token
    output_cost_per_token: float  # USD per token


# Pricing as of early 2026 — verify against provider pricing pages
# before building cost dashboards on these numbers
MODEL_PRICING: dict[str, ModelPricing] = {
    # OpenAI
    "gpt-4o": ModelPricing(
        input_cost_per_token=0.000005,
        output_cost_per_token=0.000015,
    ),
    "gpt-4o-mini": ModelPricing(
        input_cost_per_token=0.00000015,
        output_cost_per_token=0.0000006,
    ),
    "gpt-3.5-turbo": ModelPricing(
        input_cost_per_token=0.0000005,
        output_cost_per_token=0.0000015,
    ),
    # Anthropic
    "claude-sonnet-4-6": ModelPricing(
        input_cost_per_token=0.000003,
        output_cost_per_token=0.000015,
    ),
    "claude-haiku-4-5": ModelPricing(
        input_cost_per_token=0.00000025,
        output_cost_per_token=0.00000125,
    ),
}


def estimate_cost(
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
) -> Optional[float]:
    """
    Estimate the cost of an LLM call in USD.
    Returns None if the model is not in the pricing table.
    """
    pricing = MODEL_PRICING.get(model)
    if not pricing:
        return None

    input_cost = prompt_tokens * pricing.input_cost_per_token
    output_cost = completion_tokens * pricing.output_cost_per_token
    return round

(input_cost + output_cost, 8)


Enter fullscreen mode Exit fullscreen mode

* * *

## [](#step-3-the-llm-instrumentation-layer)Step 3: The LLM Instrumentation Layer

This is the core of the setup — a context manager that wraps any LLM call and captures the telemetry we care about.

**llm\_tracer.py**

import time from contextlib import contextmanager from typing import Optional, Generator from opentelemetry import trace from opentelemetry.trace import Span, Status, StatusCode

from cost_estimator import estimate_cost

tracer = trace.get_tracer("llm-instrumentation")

@contextmanager def llm_span( model: str, operation: str, feature: str, prompt_tokens: Optional[int] = None, temperature: float = 0.0, max_tokens: Optional[int] = None, ) -> Generator[Span, None, None]: """ Context manager that creates a span for an LLM API call.

Args:
    model: The model identifier (e.g. "gpt-4o", "claude-sonnet-4-6")
    operation: What this call is doing (e.g. "summarize", "classify", "generate")
    feature: Which product feature triggered this call (e.g. "order_summary", "search")
    prompt_tokens: Estimated prompt token count (if known before the call)
    temperature: Sampling temperature
    max_tokens: Maximum tokens requested
"""
with tracer.start_as_current_span(f"llm.{operation}") as span:
    # Request attributes — known before the call
    span.set_attributes({
        "llm.model": model,
        "llm.operation": operation,
        "llm.feature": feature,
        "llm.temperature": temperature,
        "llm.request_time": time.time(),
    })

    if prompt_tokens is not None:
        span.set_attribute("llm.prompt_tokens", prompt_tokens)

    if max_tokens is not None:
        span.set_attribute("llm.max_tokens", max_tokens)

    start_time = time.perf_counter()

    try:
        yield span
    finally:
        latency_ms = (time.perf_counter() - start_time) * 1000
        span.set_attribute("llm.latency_ms", round(latency_ms, 2))

def record_llm_response( span: Span, model: str, prompt_tokens: int, completion_tokens: int, finish_reason: str, cached: bool = False, ) -> None: """ Record response attributes after an LLM call completes. Call this inside the llm_span context manager after the API call returns. """ total_tokens = prompt_tokens + completion_tokens cost = estimate_cost(model, prompt_tokens, completion_tokens)

span.set_attributes({
    "llm.prompt_tokens": prompt_tokens,
    "llm.completion_tokens": completion_tokens,
    "llm.total_tokens": total_tokens,
    "llm.finish_reason": finish_reason,
    "llm.cached": cached,
})

if cost is not None:
    span.set_attribute("llm.estimated_cost_usd", cost)

# Set span status based on finish reason
# Not all non-"stop" finish reasons are errors — but they need visibility
if finish_reason == "length":
    # Response was cut off — may indicate prompt is too long
    # or max_tokens is set too low
    span.set_status(Status(StatusCode.ERROR, "Response truncated by token limit"))
    span.set_attribute("llm.truncated", True)

elif finish_reason == "content_filter":
    # Content policy triggered — usually a prompt design issue
    span.set_status(Status(StatusCode.ERROR, "Response blocked by content filter"))

elif finish_reason == "stop":
    span.set_status(Status(StatusCode.OK))

else:
    # tool_calls, function_call, or unknown — not an error
    span.set_status(Status(StatusCode.OK))

def record_llm_error(span: Span, error: Exception, error_type: str) -> None: """ Record an LLM API error on the span. Use error_type to distinguish between different failure modes. """ span.record_exception(error) span.set_attributes({ "llm.error": True, "llm.error_type": error_type, }) span.set_status(Status(StatusCode.ERROR, str(error)))


Enter fullscreen mode Exit fullscreen mode

The `finish_reason` handling is worth examining. When an LLM response is truncated because of a token limit, most monitoring systems record it as a successful call — the HTTP request returned 200. But from a product perspective, the response is incomplete and the user may get a broken experience. Treating `finish_reason == "length"` as an error in the span means you can alert on it separately from network failures or API errors.

* * *

## [](#step-4-instrumenting-real-llm-calls)Step 4: Instrumenting Real LLM Calls

Now let's use the instrumentation layer with actual API calls.

**services.py**

import os from openai import AsyncOpenAI, RateLimitError, APITimeoutError from anthropic import AsyncAnthropic, APIStatusError import structlog

from llm_tracer import llm_span, record_llm_response, record_llm_error

logger = structlog.get_logger() openai_client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"]) anthropic_client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

async def summarize_order(order_text: str, user_id: str) -> str: """Summarize an order for the customer dashboard.""" model = "gpt-4o-mini"

with llm_span(
    model=model,
    operation="summarize",
    feature="order_dashboard",
    temperature=0.0,
    max_tokens=200,
) as span:
    try:
        response = await openai_client.chat.completions.create(
            model=model,
            temperature=0.0,
            max_tokens=200,
            messages=[
                {
                    "role": "system",
                    "content": "Summarize the following order in 2-3 sentences for a customer.",
                },
                {
                    "role": "user",
                    "content": order_text,
                },
            ],
        )

        choice = response.choices[0]
        usage = response.usage

        record_llm_response(
            span=span,
            model=model,
            prompt_tokens=usage.prompt_tokens,
            completion_tokens=usage.completion_tokens,
            finish_reason=choice.finish_reason,
        )

        logger.info(
            "order_summarized",
            user_id=user_id,
            model=model,
            prompt_tokens=usage.prompt_tokens,
            completion_tokens=usage.completion_tokens,
            finish_reason=choice.finish_reason,
        )

        return choice.message.content

    except RateLimitError as e:
        record_llm_error(span, e, error_type="rate_limit")
        logger.warning("llm_rate_limited", model=model, feature="order_dashboard")
        raise

    except APITimeoutError as e:
        record_llm_error(span, e, error_type="timeout")
        logger.error("llm_timeout", model=model, feature="order_dashboard")
        raise

    except Exception as e:
        record_llm_error(span, e, error_type="unknown")
        logger.error("llm_error", model=model, exc_info=True)
        raise

async def classify_support_ticket(ticket_text: str) -> dict: """Classify a support ticket by category and urgency.""" model = "claude-haiku-4-5"

with llm_span(
    model=model,
    operation="classify",
    feature="support_triage",
    temperature=0.0,
    max_tokens=100,
) as span:
    try:
        response = await anthropic_client.messages.create(
            model=model,
            max_tokens=100,
            messages=[
                {
                    "role": "user",
                    "content": f"""Classify this support ticket.

Respond with JSON only: {{"category": "...", "urgency": "low|medium|high"}}

Ticket: {ticket_text}""", } ], )

        usage = response.usage
        finish_reason = response.stop_reason  # Anthropic uses stop_reason

        record_llm_response(
            span=span,
            model=model,
            prompt_tokens=usage.input_tokens,
            completion_tokens=usage.output_tokens,
            finish_reason=finish_reason or "stop",
        )

        import json
        result = json.loads(response.content[0].text)

        # Add classification result to span for filtering
        span.set_attributes({
            "ticket.category": result.get("category", "unknown"),
            "ticket.urgency": result.get("urgency", "unknown"),
        })

        return result

    except APIStatusError as e:
        record_llm_error(span, e, error_type=f"api_status_{e.status_code}")
        raise

    except Exception as e:
        record_llm_error(span, e, error_type="unknown")
        raise


Enter fullscreen mode Exit fullscreen mode

* * *

## [](#step-5-wiring-into-fastapi)Step 5: Wiring Into FastAPI

**main.py**

import os from fastapi import FastAPI, HTTPException from pydantic import BaseModel from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

from tracing import init_tracer from services import summarize_order, classify_support_ticket

init_tracer("llm-powered-api")

app = FastAPI() FastAPIInstrumentor.instrument_app(app)

class OrderSummaryRequest(BaseModel): order_text: str user_id: str

class SupportTicketRequest(BaseModel): ticket_text: str

@app.post("/orders/summarize") async def summarize(request: OrderSummaryRequest): try: summary = await summarize_order(request.order_text, request.user_id) return {"summary": summary} except Exception: raise HTTPException(status_code=503, detail="Summary service unavailable")

@app.post("/support/classify") async def classify(request: SupportTicketRequest): try: classification = await classify_support_ticket(request.ticket_text) return classification except Exception: raise HTTPException(status_code=503, detail="Classification service unavailable")


Enter fullscreen mode Exit fullscreen mode

* * *

## [](#what-the-telemetry-looks-like)What the Telemetry Looks Like

A successful call to `/orders/summarize` produces a span with these attributes:

{ "name": "llm.summarize", "status": "OK", "attributes": { "llm.model": "gpt-4o-mini", "llm.operation": "summarize", "llm.feature": "order_dashboard", "llm.temperature": 0.0, "llm.max_tokens": 200, "llm.prompt_tokens": 87, "llm.completion_tokens": 52, "llm.total_tokens": 139, "llm.finish_reason": "stop", "llm.estimated_cost_usd": 0.0000913, "llm.latency_ms": 1243.5, "llm.cached": false } }


Enter fullscreen mode Exit fullscreen mode

A truncated response — where the model hit the token limit — looks like:

{ "name": "llm.summarize", "status": "ERROR", "status_message": "Response truncated by token limit", "attributes": { "llm.model": "gpt-4o-mini", "llm.finish_reason": "length", "llm.truncated": true, "llm.prompt_tokens": 312, "llm.completion_tokens": 200, "llm.total_tokens": 512, "llm.estimated_cost_usd": 0.0001672, "llm.latency_ms": 3821.2 } }


Enter fullscreen mode Exit fullscreen mode

* * *

## [](#dashboards-and-alerts-that-actually-matter)Dashboards and Alerts That Actually Matter

With this telemetry in place, here are the queries that become useful:

**Cost by feature:** Group spans by `llm.feature` and sum `llm.estimated_cost_usd`. This tells you which features are driving your LLM spend. In most applications, one or two features account for the majority of cost.

**Truncation rate by model:** Filter spans where `llm.truncated = true` and group by `llm.model`. A rising truncation rate on a specific model usually means prompts are growing — often because you've added more context or the input data has changed.

**Latency percentiles by operation:** P50 and P99 latency grouped by `llm.operation`. LLM latency distributions are wide — P50 might be 800ms while P99 is 12 seconds. Alerting on P99 rather than average catches the tail latency issues that users actually experience.

**Error rate by error type:** Group spans by `llm.error_type`. Rate limit errors, timeouts, and content filter triggers have completely different remediation paths. Grouping them together hides what's actually wrong.

**Recommended alerts:**

Alert

Condition

Threshold

High latency

P99 `llm.latency_ms`

\> 10,000ms

Truncation spike

`llm.truncated = true` rate

\> 5% of calls

Rate limiting

`llm.error_type = rate_limit` count

\> 10 per minute

Cost spike

Sum `llm.estimated_cost_usd` per hour

\> 2x baseline

Content filter

`llm.error_type = content_filter` count

\> 3 per hour

* * *

## [](#handling-retries-without-doublecounting)Handling Retries Without Double-Counting

If your application retries failed LLM calls, you need to track retry counts to avoid double-counting costs and misattributing errors.

async def summarize_with_retry(order_text: str, user_id: str, max_retries: int = 2) -> str: model = "gpt-4o-mini" last_error = None

for attempt in range(max_retries + 1):
    with llm_span(
        model=model,
        operation="summarize",
        feature="order_dashboard",
    ) as span:
        span.set_attribute("llm.attempt", attempt)
        span.set_attribute("llm.is_retry", attempt > 0)

        try:
            response = await openai_client.chat.completions.create(
                model=model,
                max_tokens=200,
                messages=[
                    {"role": "system", "content": "Summarize this order."},
                    {"role": "user", "content": order_text},
                ],
            )

            usage = response.usage
            record_llm_response(
                span=span,
                model=model,
                prompt_tokens=usage.prompt_tokens,
                completion_tokens=usage.completion_tokens,
                finish_reason=response.choices[0].finish_reason,
            )

            return response.choices[0].message.content

        except RateLimitError as e:
            record_llm_error(span, e, error_type="rate_limit")
            last_error = e
            if attempt < max_retries:
                import asyncio
                await asyncio.sleep(2 ** attempt)
            continue

raise last_error


Enter fullscreen mode Exit fullscreen mode

With `llm.attempt` and `llm.is_retry` on every span, you can filter your cost dashboard to exclude retry attempts — or specifically query retried calls to understand which operations are flaky.

* * *

## [](#summary)Summary

LLM API calls require a different approach to monitoring than standard HTTP dependencies. The key attributes to capture are:

-   **Latency** — LLM calls are slow and variable; P99 matters more than average
-   **Token counts** — input and output separately, since they have different costs
-   **Finish reason** — `stop`, `length`, `content_filter`, and `tool_calls` each indicate different conditions
-   **Estimated cost** — per-call and aggregated by feature
-   **Error type** — rate limits, timeouts, and content filters need different responses

The instrumentation layer in this article wraps both OpenAI and Anthropic calls with a consistent span structure. As you add more models or providers, the pattern stays the same — `llm_span` as the context manager, `record_llm_response` after the call, `record_llm_error` in the exception handler.

Without this telemetry, LLM-powered features are a black box. With it, you can answer the questions that actually matter in production: what is this costing, why is it slow, and what is the model actually doing.

* * *

_Find me on [GitHub](https://github.com/Temitopeajao) or [LinkedIn](https://www.linkedin.com/in/temitope-ajao-4a8670302/)._

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back

The Monitoring Gap in LLM Applications

Prerequisites

Installing Dependencies

Project Structure

Step 1: OpenTelemetry Setup

Step 2: Cost Estimation

🎉 Mid-Year Sale — Unlock Full Article

Production Bundle