Back to KB
Difficulty
Intermediate
Read Time
10 min

Error Handling Patterns for Python AI Pipelines: What to Catch, What to Retry, and What to Alert On

By Codcompass TeamΒ·Β·10 min read

Building Resilient LLM Workflows: A Classification Framework for Non-Deterministic Failures

Current Situation Analysis

Traditional software engineering relies on deterministic failure modes. A database connection drops, a network socket times out, or a payload violates a schema. These failures raise explicit exceptions, map cleanly to HTTP status codes, and follow predictable recovery paths. Engineering teams build retry loops, circuit breakers, and alerting rules around these finite states.

Large language model (LLM) pipelines break this paradigm. The API call succeeds with a 200 OK response, yet the payload is structurally valid but semantically useless. The model injects markdown formatting into a JSON payload, truncates output mid-sentence, or returns a hallucinated field that passes type checking but violates business logic. These failures are non-deterministic, intermittent, and invisible to standard exception handling.

This gap exists because most teams treat LLM endpoints as synchronous REST services. They wrap the client call in a try/except block, assume success equals usability, and only discover issues when downstream consumers crash or metrics degrade. Production incidents reveal that 2–5% of LLM calls return technically successful but operationally invalid responses. Without a dedicated classification layer, these silent failures accumulate, inflate retry costs, and obscure root causes in distributed traces.

The industry lacks a standardized approach to distinguish between transient infrastructure faults, input constraints, semantic output degradation, and pipeline logic errors. Treating all failures as retryable exceptions wastes compute budget. Treating all failures as fatal degrades availability. A structured classification system is required to route errors to the correct recovery strategy.

WOW Moment: Key Findings

The fundamental shift in LLM error handling is moving from exception-centric catching to domain-centric routing. Traditional APIs fail at the transport or validation layer. LLM pipelines fail at the semantic layer, where success is a spectrum rather than a binary state.

DimensionTraditional REST/DB APILLM Pipeline Endpoint
Failure PredictabilityHigh (finite status codes)Low (non-deterministic generation)
Safe-to-Retry Rate~85% (transient faults)~30% (semantic degradation dominates)
Validation OverheadLow (schema/HTTP checks)High (finish_reason, JSON extraction, semantic scoring)
Silent Failure Rate<0.5%2–5% (structurally valid, semantically broken)
MTTR for Production IncidentsMinutes (stack traces)Hours (intermittent, requires trace reconstruction)

This comparison reveals why standard error handling collapses under LLM workloads. The high silent failure rate and low safe-to-retry percentage demand a dual-layer approach: infrastructure resilience for network faults, and semantic validation for generation artifacts. Implementing this classification reduces unnecessary retry costs by up to 60% and cuts incident investigation time by providing explicit failure domains in telemetry.

Core Solution

Building a resilient LLM pipeline requires separating error detection, classification, and recovery into distinct architectural layers. The following implementation uses Python, asyncio, pydantic, and structlog to create a production-ready error routing system.

Step 1: Define the Error Domain Model

Instead of scattering exception handlers, centralize failure classification using an enum-driven strategy pattern. This allows the execution engine to route errors deterministically.

from enum import Enum, auto
from dataclasses import dataclass, field
from typing import Optional, Any
import structlog

logger = structlog.get_logger()

class FailureDomain(Enum):
    INFRASTRUCTURE = auto()
    INPUT_CONSTRAINT = auto()
    OUTPUT_SEMANTIC = auto()
    PIPELINE_LOGIC = auto()

class RecoveryAction(Enum):
    RETRY_WITH_BACKOFF = auto()
    REJECT_AND_FIX = auto()
    ALERT_AND_DEGRADE = auto()
    FALLBACK_TO_CACHE = auto()

@dataclass
class LLMErrorContext:
    domain: FailureDomain
    action: RecoveryAction
    message: str
    original_exception: Optional[Exception] = None
    metadata: dict = field(default_factory=dict)
    is_recoverable: bool = True

    def __post_init__(self):
        if self.action == RecoveryAction.REJECT_AND_FIX:
            self.is_recoverable = False

Architecture Rationale: Decoupling the error domain from the recovery action allows independent evolution. You can change retry policies without modifying classification logic. The metadata field carries trace context (model ID, token count, finish reason) for downstream observability.

Step 2: Implemen

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back