(input_cost + output_cost, 8)
Enter fullscreen mode Exit fullscreen mode
* * *
## [](#step-3-the-llm-instrumentation-layer)Step 3: The LLM Instrumentation Layer
This is the core of the setup β a context manager that wraps any LLM call and captures the telemetry we care about.
**llm\_tracer.py**
import time
from contextlib import contextmanager
from typing import Optional, Generator
from opentelemetry import trace
from opentelemetry.trace import Span, Status, StatusCode
from cost_estimator import estimate_cost
tracer = trace.get_tracer("llm-instrumentation")
@contextmanager
def llm_span(
model: str,
operation: str,
feature: str,
prompt_tokens: Optional[int] = None,
temperature: float = 0.0,
max_tokens: Optional[int] = None,
) -> Generator[Span, None, None]:
"""
Context manager that creates a span for an LLM API call.
Args:
model: The model identifier (e.g. "gpt-4o", "claude-sonnet-4-6")
operation: What this call is doing (e.g. "summarize", "classify", "generate")
feature: Which product feature triggered this call (e.g. "order_summary", "search")
prompt_tokens: Estimated prompt token count (if known before the call)
temperature: Sampling temperature
max_tokens: Maximum tokens requested
"""
with tracer.start_as_current_span(f"llm.{operation}") as span:
# Request attributes β known before the call
span.set_attributes({
"llm.model": model,
"llm.operation": operation,
"llm.feature": feature,
"llm.temperature": temperature,
"llm.request_time": time.time(),
})
if prompt_tokens is not None:
span.set_attribute("llm.prompt_tokens", prompt_tokens)
if max_tokens is not None:
span.set_attribute("llm.max_tokens", max_tokens)
start_time = time.perf_counter()
try:
yield span
finally:
latency_ms = (time.perf_counter() - start_time) * 1000
span.set_attribute("llm.latency_ms", round(latency_ms, 2))
def record_llm_response(
span: Span,
model: str,
prompt_tokens: int,
completion_tokens: int,
finish_reason: str,
cached: bool = False,
) -> None:
"""
Record response attributes after an LLM call completes.
Call this inside the llm_span context manager after the API call returns.
"""
total_tokens = prompt_tokens + completion_tokens
cost = estimate_cost(model, prompt_tokens, completion_tokens)
span.set_attributes({
"llm.prompt_tokens": prompt_tokens,
"llm.completion_tokens": completion_tokens,
"llm.total_tokens": total_tokens,
"llm.finish_reason": finish_reason,
"llm.cached": cached,
})
if cost is not None:
span.set_attribute("llm.estimated_cost_usd", cost)
# Set span status based on finish reason
# Not all non-"stop" finish reasons are errors β but they need visibility
if finish_reason == "length":
# Response was cut off β may indicate prompt is too long
# or max_tokens is set too low
span.set_status(Status(StatusCode.ERROR, "Response truncated by token limit"))
span.set_attribute("llm.truncated", True)
elif finish_reason == "content_filter":
# Content policy triggered β usually a prompt design issue
span.set_status(Status(StatusCode.ERROR, "Response blocked by content filter"))
elif finish_reason == "stop":
span.set_status(Status(StatusCode.OK))
else:
# tool_calls, function_call, or unknown β not an error
span.set_status(Status(StatusCode.OK))
def record_llm_error(span: Span, error: Exception, error_type: str) -> None:
"""
Record an LLM API error on the span.
Use error_type to distinguish between different failure modes.
"""
span.record_exception(error)
span.set_attributes({
"llm.error": True,
"llm.error_type": error_type,
})
span.set_status(Status(StatusCode.ERROR, str(error)))
Enter fullscreen mode Exit fullscreen mode
The `finish_reason` handling is worth examining. When an LLM response is truncated because of a token limit, most monitoring systems record it as a successful call β the HTTP request returned 200. But from a product perspective, the response is incomplete and the user may get a broken experience. Treating `finish_reason == "length"` as an error in the span means you can alert on it separately from network failures or API errors.
* * *
## [](#step-4-instrumenting-real-llm-calls)Step 4: Instrumenting Real LLM Calls
Now let's use the instrumentation layer with actual API calls.
**services.py**
import os
from openai import AsyncOpenAI, RateLimitError, APITimeoutError
from anthropic import AsyncAnthropic, APIStatusError
import structlog
from llm_tracer import llm_span, record_llm_response, record_llm_error
logger = structlog.get_logger()
openai_client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
anthropic_client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
async def summarize_order(order_text: str, user_id: str) -> str:
"""Summarize an order for the customer dashboard."""
model = "gpt-4o-mini"
with llm_span(
model=model,
operation="summarize",
feature="order_dashboard",
temperature=0.0,
max_tokens=200,
) as span:
try:
response = await openai_client.chat.completions.create(
model=model,
temperature=0.0,
max_tokens=200,
messages=[
{
"role": "system",
"content": "Summarize the following order in 2-3 sentences for a customer.",
},
{
"role": "user",
"content": order_text,
},
],
)
choice = response.choices[0]
usage = response.usage
record_llm_response(
span=span,
model=model,
prompt_tokens=usage.prompt_tokens,
completion_tokens=usage.completion_tokens,
finish_reason=choice.finish_reason,
)
logger.info(
"order_summarized",
user_id=user_id,
model=model,
prompt_tokens=usage.prompt_tokens,
completion_tokens=usage.completion_tokens,
finish_reason=choice.finish_reason,
)
return choice.message.content
except RateLimitError as e:
record_llm_error(span, e, error_type="rate_limit")
logger.warning("llm_rate_limited", model=model, feature="order_dashboard")
raise
except APITimeoutError as e:
record_llm_error(span, e, error_type="timeout")
logger.error("llm_timeout", model=model, feature="order_dashboard")
raise
except Exception as e:
record_llm_error(span, e, error_type="unknown")
logger.error("llm_error", model=model, exc_info=True)
raise
async def classify_support_ticket(ticket_text: str) -> dict:
"""Classify a support ticket by category and urgency."""
model = "claude-haiku-4-5"
with llm_span(
model=model,
operation="classify",
feature="support_triage",
temperature=0.0,
max_tokens=100,
) as span:
try:
response = await anthropic_client.messages.create(
model=model,
max_tokens=100,
messages=[
{
"role": "user",
"content": f"""Classify this support ticket.
Respond with JSON only: {{"category": "...", "urgency": "low|medium|high"}}
Ticket: {ticket_text}""",
}
],
)
usage = response.usage
finish_reason = response.stop_reason # Anthropic uses stop_reason
record_llm_response(
span=span,
model=model,
prompt_tokens=usage.input_tokens,
completion_tokens=usage.output_tokens,
finish_reason=finish_reason or "stop",
)
import json
result = json.loads(response.content[0].text)
# Add classification result to span for filtering
span.set_attributes({
"ticket.category": result.get("category", "unknown"),
"ticket.urgency": result.get("urgency", "unknown"),
})
return result
except APIStatusError as e:
record_llm_error(span, e, error_type=f"api_status_{e.status_code}")
raise
except Exception as e:
record_llm_error(span, e, error_type="unknown")
raise
Enter fullscreen mode Exit fullscreen mode
* * *
## [](#step-5-wiring-into-fastapi)Step 5: Wiring Into FastAPI
**main.py**
import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from tracing import init_tracer
from services import summarize_order, classify_support_ticket
init_tracer("llm-powered-api")
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
class OrderSummaryRequest(BaseModel):
order_text: str
user_id: str
class SupportTicketRequest(BaseModel):
ticket_text: str
@app.post("/orders/summarize")
async def summarize(request: OrderSummaryRequest):
try:
summary = await summarize_order(request.order_text, request.user_id)
return {"summary": summary}
except Exception:
raise HTTPException(status_code=503, detail="Summary service unavailable")
@app.post("/support/classify")
async def classify(request: SupportTicketRequest):
try:
classification = await classify_support_ticket(request.ticket_text)
return classification
except Exception:
raise HTTPException(status_code=503, detail="Classification service unavailable")
Enter fullscreen mode Exit fullscreen mode
* * *
## [](#what-the-telemetry-looks-like)What the Telemetry Looks Like
A successful call to `/orders/summarize` produces a span with these attributes:
{
"name": "llm.summarize",
"status": "OK",
"attributes": {
"llm.model": "gpt-4o-mini",
"llm.operation": "summarize",
"llm.feature": "order_dashboard",
"llm.temperature": 0.0,
"llm.max_tokens": 200,
"llm.prompt_tokens": 87,
"llm.completion_tokens": 52,
"llm.total_tokens": 139,
"llm.finish_reason": "stop",
"llm.estimated_cost_usd": 0.0000913,
"llm.latency_ms": 1243.5,
"llm.cached": false
}
}
Enter fullscreen mode Exit fullscreen mode
A truncated response β where the model hit the token limit β looks like:
{
"name": "llm.summarize",
"status": "ERROR",
"status_message": "Response truncated by token limit",
"attributes": {
"llm.model": "gpt-4o-mini",
"llm.finish_reason": "length",
"llm.truncated": true,
"llm.prompt_tokens": 312,
"llm.completion_tokens": 200,
"llm.total_tokens": 512,
"llm.estimated_cost_usd": 0.0001672,
"llm.latency_ms": 3821.2
}
}
Enter fullscreen mode Exit fullscreen mode
* * *
## [](#dashboards-and-alerts-that-actually-matter)Dashboards and Alerts That Actually Matter
With this telemetry in place, here are the queries that become useful:
**Cost by feature:** Group spans by `llm.feature` and sum `llm.estimated_cost_usd`. This tells you which features are driving your LLM spend. In most applications, one or two features account for the majority of cost.
**Truncation rate by model:** Filter spans where `llm.truncated = true` and group by `llm.model`. A rising truncation rate on a specific model usually means prompts are growing β often because you've added more context or the input data has changed.
**Latency percentiles by operation:** P50 and P99 latency grouped by `llm.operation`. LLM latency distributions are wide β P50 might be 800ms while P99 is 12 seconds. Alerting on P99 rather than average catches the tail latency issues that users actually experience.
**Error rate by error type:** Group spans by `llm.error_type`. Rate limit errors, timeouts, and content filter triggers have completely different remediation paths. Grouping them together hides what's actually wrong.
**Recommended alerts:**
Alert
Condition
Threshold
High latency
P99 `llm.latency_ms`
\> 10,000ms
Truncation spike
`llm.truncated = true` rate
\> 5% of calls
Rate limiting
`llm.error_type = rate_limit` count
\> 10 per minute
Cost spike
Sum `llm.estimated_cost_usd` per hour
\> 2x baseline
Content filter
`llm.error_type = content_filter` count
\> 3 per hour
* * *
## [](#handling-retries-without-doublecounting)Handling Retries Without Double-Counting
If your application retries failed LLM calls, you need to track retry counts to avoid double-counting costs and misattributing errors.
async def summarize_with_retry(order_text: str, user_id: str, max_retries: int = 2) -> str:
model = "gpt-4o-mini"
last_error = None
for attempt in range(max_retries + 1):
with llm_span(
model=model,
operation="summarize",
feature="order_dashboard",
) as span:
span.set_attribute("llm.attempt", attempt)
span.set_attribute("llm.is_retry", attempt > 0)
try:
response = await openai_client.chat.completions.create(
model=model,
max_tokens=200,
messages=[
{"role": "system", "content": "Summarize this order."},
{"role": "user", "content": order_text},
],
)
usage = response.usage
record_llm_response(
span=span,
model=model,
prompt_tokens=usage.prompt_tokens,
completion_tokens=usage.completion_tokens,
finish_reason=response.choices[0].finish_reason,
)
return response.choices[0].message.content
except RateLimitError as e:
record_llm_error(span, e, error_type="rate_limit")
last_error = e
if attempt < max_retries:
import asyncio
await asyncio.sleep(2 ** attempt)
continue
raise last_error
Enter fullscreen mode Exit fullscreen mode
With `llm.attempt` and `llm.is_retry` on every span, you can filter your cost dashboard to exclude retry attempts β or specifically query retried calls to understand which operations are flaky.
* * *
## [](#summary)Summary
LLM API calls require a different approach to monitoring than standard HTTP dependencies. The key attributes to capture are:
- **Latency** β LLM calls are slow and variable; P99 matters more than average
- **Token counts** β input and output separately, since they have different costs
- **Finish reason** β `stop`, `length`, `content_filter`, and `tool_calls` each indicate different conditions
- **Estimated cost** β per-call and aggregated by feature
- **Error type** β rate limits, timeouts, and content filters need different responses
The instrumentation layer in this article wraps both OpenAI and Anthropic calls with a consistent span structure. As you add more models or providers, the pattern stays the same β `llm_span` as the context manager, `record_llm_response` after the call, `record_llm_error` in the exception handler.
Without this telemetry, LLM-powered features are a black box. With it, you can answer the questions that actually matter in production: what is this costing, why is it slow, and what is the model actually doing.
* * *
_Find me on [GitHub](https://github.com/Temitopeajao) or [LinkedIn](https://www.linkedin.com/in/temitope-ajao-4a8670302/)._