eliminates redundant API calls and ensures predictable completion times, even in unstable

Difficulty

Beginner

Read Time

81 min

Checkpoint Your Agent Jobs So Crashes Don't Mean Starting Over

By Codcompass Team·2026-05-26·81 min read

Resilient LLM Batch Pipelines: Implementing Checkpoint-Driven Resumption

Current Situation Analysis

Large Language Model (LLM) batch jobs are inherently fragile. They combine long execution times, external API dependencies, and high computational costs. A single network timeout, rate limit spike, or out-of-memory error can terminate a process that has been running for hours.

The industry pain point is the restart penalty. When a job crashes at 90% completion, the naive approach forces a full restart. This wastes time, burns through token budgets, and delays downstream data availability.

Consider a realistic production scenario: A batch job processes 1,000 documents for semantic enrichment. Each item requires a multi-step LLM chain taking approximately 2.8 seconds. The job runs for 47 minutes. At item 847, a transient network error crashes the process. Without resilience mechanisms, the system restarts from item 1. You have lost 47 minutes of wall-clock time and paid for 846 redundant LLM calls.

This problem is often overlooked because developers optimize for the "happy path" during local testing. Local environments rarely exhibit the network instability or resource constraints of production. Furthermore, many teams assume that cloud infrastructure auto-restarts will handle failures, ignoring the fact that auto-restarts typically trigger full re-execution unless state is explicitly persisted.

Data from production LLM pipelines indicates that jobs exceeding 15 minutes have a non-trivial failure probability due to the cumulative risk of transient errors. For jobs processing thousands of items, the expected cost of a crash without checkpointing can exceed the cost of the job itself if failures occur frequently.

WOW Moment: Key Findings

Implementing a checkpoint-driven resumption strategy transforms failure modes from catastrophic to manageable. The following comparison illustrates the operational impact of adopting a checkpoint pattern versus a naive execution loop.

Approach	Crash Recovery Time	Token Waste on Crash	Implementation Complexity	Result Persistence
Naive Loop	Full restart duration	100% of processed tokens	Low	Manual/External
Checkpointed	Near-zero (resume)	0% (skips completed)	Medium	Manual/External
Idempotent + Checkpoint	Near-zero (resume)	0%	High	Automatic

Why this matters: The checkpointed approach decouples progress from execution. By persisting completion state, you convert a crash into a pause. The system reads the checkpoint, identifies the gap, and resumes exactly where it stopped. This eliminates redundant API calls and ensures predictable completion times, even in unstable environments.

Critical Insight: Checkpointing does not make your processing logic idempotent. If a crash occurs during the processing of an item (after the LLM call starts but before the checkpoint is updated), that item will be retried. This is a feature, not a bug, but it requires your processing logic to handle potential duplicate executions safely.

Core Solution

The robust pattern for resilient batch processing involves three components: a Checkpoint Store, a Session Manager, and a Filtering Iterator.

Architecture Decisions

Append-Only JSONL Store: The checkpoint file uses JSON Lines format. Each completed item is recorded as a single line containing the item ID and a timestamp. This format is append-only, making it efficient for high-throughput writes.
Atomic Appends: Writes to the checkpoint file must use the O_APPEND flag. This ensures that each write operation is atomic at the OS level. If a crash occurs mid-write, the result is a partial line. The loader must be designed to skip malformed lines, preventing corruption from halting the resume process.
Session Abstraction: A context manager encapsulates the checkpoint lifecycle. It loads existing state on

entry, provides an iterator that filters out completed items, and ensures the checkpoint is flushed on exit. 4. Separation of Concerns: The checkpoint store tracks progress, not results. Results should be persisted by the business logic. This keeps the checkpoint file small and fast, while allowing flexible result storage strategies.

Implementation Pattern

The following TypeScript example demonstrates a production-grade implementation of this pattern. Note the use of a generic CheckpointStore interface, allowing for different backends (file, database, S3).

import { createReadStream, createWriteStream } from 'fs';
import { appendFileSync } from 'fs';
import { randomUUID } from 'crypto';

// --- Interfaces ---

interface CheckpointRecord {
  id: string;
  completedAt: number;
}

interface CheckpointStore {
  load(): Promise<Set<string>>;
  save(record: CheckpointRecord): Promise<void>;
}

interface BatchSession<T> {
  pendingItems: AsyncIterable<T>;
  markComplete(id: string): Promise<void>;
  jobId: string;
}

// --- File-Based Store Implementation ---

class FileCheckpointStore implements CheckpointStore {
  constructor(private filePath: string) {}

  async load(): Promise<Set<string>> {
    const completedIds = new Set<string>();
    try {
      const stream = createReadStream(this.filePath, { encoding: 'utf-8' });
      for await (const line of stream) {
        try {
          const record: CheckpointRecord = JSON.parse(line.trim());
          completedIds.add(record.id);
        } catch {
          // Skip partial writes or malformed lines
          console.warn(`Skipping malformed checkpoint line: ${line.substring(0, 50)}...`);
        }
      }
    } catch (err: any) {
      if (err.code !== 'ENOENT') throw err;
      // File doesn't exist yet; start fresh
    }
    return completedIds;
  }

  async save(record: CheckpointRecord): Promise<void> {
    const line = JSON.stringify(record) + '\n';
    // O_APPEND ensures atomicity on POSIX systems
    appendFileSync(this.filePath, line, { flag: 'a' });
  }
}

// --- Session Manager ---

class ResilientBatchManager {
  constructor(private store: CheckpointStore) {}

  async createSession<T>(
    jobId: string,
    allItems: AsyncIterable<T>,
    getId: (item: T) => string
  ): Promise<BatchSession<T>> {
    const completedIds = await this.store.load();
    
    return {
      jobId,
      pendingItems: this.filterPending(allItems, completedIds, getId),
      markComplete: async (id: string) => {
        await this.store.save({ id, completedAt: Date.now() });
      },
    };
  }

  private async *filterPending<T>(
    source: AsyncIterable<T>,
    completed: Set<string>,
    getId: (item: T) => string
  ): AsyncIterable<T> {
    for await (const item of source) {
      const id = getId(item);
      if (!completed.has(id)) {
        yield item;
      }
    }
  }
}

// --- Usage Example ---

async function runEnrichmentPipeline() {
  const store = new FileCheckpointStore('./data/enrichment_run_42.chk');
  const manager = new ResilientBatchManager(store);
  
  // Simulate loading items from a database or API
  const sourceItems = loadDocumentStream(); 
  
  const session = await manager.createSession(
    'enrichment_run_42',
    sourceItems,
    (doc) => doc.uid
  );

  console.log(`Resuming job ${session.jobId}.`);

  for await (const doc of session.pendingItems) {
    try {
      // Expensive LLM operation
      const enrichedContent = await callLLMEnrichment(doc);
      
      // Persist result separately
      await saveEnrichedResult(doc.uid, enrichedContent);
      
      // Checkpoint progress
      await session.markComplete(doc.uid);
    } catch (error) {
      console.error(`Failed to process ${doc.uid}:`, error);
      // Decision: Fail fast or continue? 
      // For batch jobs, often better to log and continue.
    }
  }
}

Rationale for Choices

O_APPEND Flag: Using appendFileSync with the a flag leverages OS-level atomicity. This prevents interleaved writes if multiple processes were to access the file (though concurrent access requires additional locking, see Pitfalls).
Malformed Line Handling: The loader catches JSON parse errors. This is critical for crash recovery. If the process dies while writing a line, the file contains a partial JSON object. The loader skips this line, ensuring the resume process doesn't crash on its own checkpoint.
Async Iterators: Using AsyncIterable allows the pipeline to process items as they stream in, rather than loading the entire dataset into memory. This supports jobs with millions of items.
Explicit markComplete: The checkpoint is updated after the result is saved. This ensures that if a crash occurs between saving the result and updating the checkpoint, the item is retried. While this may cause a duplicate result save, it guarantees no data loss. Idempotent result storage handles the duplicate safely.

Pitfall Guide

Production batch pipelines encounter specific failure modes. The following pitfalls are derived from real-world deployments of checkpointed systems.

The Idempotency Trap
- Explanation: Developers assume checkpointing prevents duplicate processing. It does not. If a crash occurs after callLLMEnrichment starts but before markComplete is called, the item is retried.
- Fix: Ensure your processing logic is idempotent. Use deterministic seeds for LLM calls, or design downstream consumers to handle duplicate updates gracefully. If the LLM call is non-deterministic and expensive, consider writing results to a temporary location and moving them to the final destination only after checkpointing.
Concurrent Worker Collisions
- Explanation: The file-based store supports single-worker execution. Running multiple workers against the same checkpoint file without coordination leads to race conditions and corrupted state.
- Fix: Use one worker per job file. If parallelism is required, partition the input data by worker ID and use separate checkpoint files per worker, or switch to a database-backed store with row-level locking.
Result Orphaning
- Explanation: The checkpoint store tracks IDs, not results. A common mistake is assuming the checkpoint file contains the processed data.
- Fix: Always persist results to a dedicated storage system (database, object store) within the processing loop. The checkpoint should only record completion status.
Overhead on Micro-Jobs
- Explanation: Checkpointing introduces I/O overhead. For jobs that complete in seconds, the cost of file reads/writes may outweigh the benefit.
- Fix: Implement a threshold. If the estimated job duration is under 30 seconds, skip checkpointing. For longer jobs, enable it. Monitor the ratio of checkpoint writes to processing time to tune this threshold.
Partial Write Blindness
- Explanation: While the loader skips partial lines, silent skipping can mask issues. If partial writes occur frequently, it indicates unstable storage or aggressive crash patterns.
- Fix: Instrument the loader to emit warnings or metrics when malformed lines are detected. Alert if the rate of partial writes exceeds a baseline, as this may indicate underlying infrastructure problems.
Input Drift
- Explanation: If the input data changes between runs (e.g., updated documents), the checkpoint may skip items that need reprocessing.
- Fix: Include a version hash or timestamp in the checkpoint record. When loading, compare the item's current version against the checkpoint. If they differ, treat the item as pending even if the ID matches.
File Descriptor Exhaustion
- Explanation: In high-throughput loops, opening and closing the checkpoint file for every item can exhaust file descriptors or cause performance bottlenecks.
- Fix: Keep the checkpoint file open during the session. Use a buffered writer that flushes periodically or on explicit markComplete calls, rather than opening/closing per write.

Production Bundle

Action Checklist

Define Unique Identifiers: Ensure every item in your batch has a stable, unique ID that persists across runs.
Implement Idempotent Processing: Review your LLM calls and side effects to ensure they can be safely retried without corrupting data.
Add Deadline Integration: Wrap the batch loop with a wall-clock deadline to prevent runaway jobs. Resume on timeout.
Test Crash Recovery: Simulate crashes at random intervals during development to verify resume behavior.
Monitor Checkpoint Health: Track checkpoint file size and warn on malformed lines.
Version Checkpoints: Include input data versioning in checkpoints to handle data drift.
Isolate Workers: Ensure each parallel worker uses a distinct checkpoint file or partition.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Job Duration < 5 mins	Skip Checkpointing	Overhead exceeds benefit; restart is fast.	Low
Job Duration > 15 mins	File-Based Checkpoint	Prevents redundant LLM costs; resume is instant.	High savings on failure
Single Worker, High Volume	File-Based Checkpoint	Simple, low latency, sufficient for single process.	Low
Multiple Workers	Database-Backed Store	File locks are unreliable for concurrency; DB provides row locking.	Medium (DB cost)
Non-Idempotent Side Effects	Two-Phase Commit	Save results to temp, checkpoint, then promote. Prevents duplicates.	High complexity
Unstable Network	Checkpoint + Retry	Checkpoint prevents restart; retry handles transient errors.	Medium

Configuration Template

This template provides a robust setup with deadline enforcement and error handling, ready for production adaptation.

import { ResilientBatchManager, FileCheckpointStore } from './resilient-batch';
import { Deadline } from './deadline'; // Assume deadline library

async function productionBatchJob() {
  const checkpointPath = `./checkpoints/job_${Date.now()}.chk`;
  const store = new FileCheckpointStore(checkpointPath);
  const manager = new ResilientBatchManager(store);
  
  // Set a 1-hour deadline to prevent runaway costs
  const deadline = Deadline.fromNow({ hours: 1 });

  try {
    const session = await manager.createSession(
      'prod_enrichment_v2',
      loadItemsFromSource(),
      (item) => item.id
    );

    console.log(`Starting job ${session.jobId}.`);

    for await (const item of session.pendingItems) {
      // Check deadline before processing
      if (deadline.isExceeded()) {
        console.log('Deadline reached. Saving state and exiting.');
        break;
      }

      try {
        const result = await processItemWithRetry(item);
        await persistResult(item.id, result);
        await session.markComplete(item.id);
      } catch (err) {
        console.error(`Item ${item.id} failed permanently:`, err);
        // Optionally record failures to a separate log
      }
    }

    console.log('Job completed or paused successfully.');
  } catch (err) {
    console.error('Fatal error in batch pipeline:', err);
    process.exit(1);
  }
}

Quick Start Guide

Initialize Store: Create a FileCheckpointStore pointing to a dedicated directory for checkpoint files.
Create Session: Call manager.createSession with your job ID, item stream, and ID extractor function.
Iterate Pending: Loop over session.pendingItems. This iterator automatically skips items recorded in the checkpoint.
Process and Checkpoint: After successfully processing and saving results for an item, call session.markComplete(id).
Resume: On the next run, the session loads the checkpoint and yields only unprocessed items. No code changes required.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back