Back to KB

eliminates redundant API calls and ensures predictable completion times, even in unstable

Difficulty
Beginner
Read Time
81 min

Checkpoint Your Agent Jobs So Crashes Don't Mean Starting Over

By Codcompass Team··81 min read

Resilient LLM Batch Pipelines: Implementing Checkpoint-Driven Resumption

Current Situation Analysis

Large Language Model (LLM) batch jobs are inherently fragile. They combine long execution times, external API dependencies, and high computational costs. A single network timeout, rate limit spike, or out-of-memory error can terminate a process that has been running for hours.

The industry pain point is the restart penalty. When a job crashes at 90% completion, the naive approach forces a full restart. This wastes time, burns through token budgets, and delays downstream data availability.

Consider a realistic production scenario: A batch job processes 1,000 documents for semantic enrichment. Each item requires a multi-step LLM chain taking approximately 2.8 seconds. The job runs for 47 minutes. At item 847, a transient network error crashes the process. Without resilience mechanisms, the system restarts from item 1. You have lost 47 minutes of wall-clock time and paid for 846 redundant LLM calls.

This problem is often overlooked because developers optimize for the "happy path" during local testing. Local environments rarely exhibit the network instability or resource constraints of production. Furthermore, many teams assume that cloud infrastructure auto-restarts will handle failures, ignoring the fact that auto-restarts typically trigger full re-execution unless state is explicitly persisted.

Data from production LLM pipelines indicates that jobs exceeding 15 minutes have a non-trivial failure probability due to the cumulative risk of transient errors. For jobs processing thousands of items, the expected cost of a crash without checkpointing can exceed the cost of the job itself if failures occur frequently.

WOW Moment: Key Findings

Implementing a checkpoint-driven resumption strategy transforms failure modes from catastrophic to manageable. The following comparison illustrates the operational impact of adopting a checkpoint pattern versus a naive execution loop.

ApproachCrash Recovery TimeToken Waste on CrashImplementation ComplexityResult Persistence
Naive LoopFull restart duration100% of processed tokensLowManual/External
CheckpointedNear-zero (resume)0% (skips completed)MediumManual/External
Idempotent + CheckpointNear-zero (resume)0%HighAutomatic

Why this matters: The checkpointed approach decouples progress from execution. By persisting completion state, you convert a crash into a pause. The system reads the checkpoint, identifies the gap, and resumes exactly where it stopped. This eliminates redundant API calls and ensures predictable completion times, even in unstable environments.

Critical Insight: Checkpointing does not make your processing logic idempotent. If a crash occurs during the processing of an item (after the LLM call starts but before the checkpoint is updated), that item will be retried. This is a feature, not a bug, but it requires your processing logic to handle potential duplicate executions safely.

Core Solution

The robust pattern for resilient batch processing involves three components: a Checkpoint Store, a Session Manager, and a Filtering Iterator.

Architecture Decisions

  1. Append-Only JSONL Store: The checkpoint file uses JSON Lines format. Each completed item is recorded as a single line containing the item ID and a timestamp. This format is append-only, making it efficient for high-throughput writes.
  2. Atomic Appends: Writes to the checkpoint file must use the O_APPEND flag. This ensures that each write operation is atomic at the OS level. If a crash occurs mid-write, the result is a partial line. The loader must be designed to skip malformed lines, preventing corruption from halting the resume process.
  3. Session Abstraction: A context manager encapsulates the checkpoint lifecycle. It loads existing state on

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back