[3 Reliability Patterns That Stopped My AI Agent From Crashing Every 6 Hours]

Architecting Resilient AI Agent Workflows: Production Patterns for High Availability

Current Situation Analysis

Autonomous AI agents are rapidly moving from experimental demos to critical production workloads, handling tasks ranging from overnight data extraction and code triage to automated reporting and communication. However, the operational reality often diverges sharply from development environments. Agents deployed without robust infrastructure patterns exhibit high fragility, frequently succumbing to unhandled exceptions, hanging network requests, or memory exhaustion.

A pervasive misunderstanding among engineering teams is treating agent scripts as ephemeral processes. Developers often wrap agent logic in simple infinite loops, assuming the process will persist indefinitely. In practice, large language model (LLM) interactions introduce non-deterministic failure modes: malformed token streams that crash parsers, tool calls that hang indefinitely due to upstream latency, and gradual memory leaks from context accumulation. Without supervision, a single crash halts the entire workflow. Scheduled triggers continue to fire, queuing behind the dead process, leading to cascading delays and missed SLAs.

Production data from early adopters indicates that naive agent deployments can suffer uptime rates as low as 71%, with processes freezing every six to twelve hours. This instability forces engineering teams into a "babysitting" mode, manually restarting processes and investigating failures, which negates the automation value proposition. Furthermore, unbounded retry loops on failed tool calls can inflate inference costs by significant margins, as agents repeatedly attempt operations on broken endpoints without backoff or circuit breaking.

WOW Moment: Key Findings

Transitioning from script-based execution to a service-oriented architecture yields immediate, measurable improvements in reliability and cost efficiency. The following comparison highlights the impact of implementing process supervision, state persistence, and bounded execution.

Metric	Naive Script Approach	Production-Hardened Architecture	Delta
Uptime	~71%	99.4%	+28.4%
Mean Time to Recovery	Hours (Manual Detection)	<30 Seconds (Automated)	~99% Reduction
Token Efficiency	Baseline	-40% Spend	Significant Cost Savings
Operational Overhead	High (Manual Intervention)	Low (Self-Healing)	Drastic Reduction

Why this matters: The shift to a hardened architecture decouples agent reliability from model stability. By externalizing state and enforcing boundaries, the system absorbs LLM and tool failures without data loss or process death. The reduction in token spend stems from eliminating infinite retry loops and preventing redundant work after crashes, directly improving the unit economics of agent operations.

Core Solution

Building a resilient agent requires three foundational pillars: process supervision, external state persistence, and bounded execution with circuit breaking. The following implementation uses TypeScript to demonstrate these patterns, emphasizing type safety and modern async control flow.

1. Process Supervision via Service Managers

Never rely on while(true) loops for production agents. Instead, deploy the agent under a process manager like systemd or pm2. This ensures automatic restart on crash, log aggregation, and resource limits.

Architecture Decision: Use pm2 for Node/TypeScript environments to leverage its ecosystem configuration and built-in monitoring. This abstracts the restart logic and provides a unified view of agent health.

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'data-extractor-agent',
    script: 'dist/agent.js',
    instances: 1,
    autorestart: true,
    max_restarts: 10,
    restart_delay: 5000,
    error_file: '/var/log/agents/data-extractor.err.log',
    out_file: '/var/log/agents/data-extractor.out.log',
    max_memory_restart: '1G',
    env: {
      NODE_ENV: 'production',
      CHECKPOINT_DIR: '/data/checkpoints'
    }
  }]
};

Rationale: autorestart: true guarantees recovery from non-zero exits. max_memory_restart prevents OOM conditions from degrading the host. Centralized logging allows correlation of crashes with specific tool calls or model responses.

2. External State Persistence and Checkpointing

In-memory state is volatile. Upon restart, an agent must resume exactly where it left off. This requires persisting the conversation history, pending tasks, and partial tool outputs to durable storage after every significant step.

Implementation: A CheckpointManager serializes state to disk or a lightweight database. The agent loads the checkpoint on startup and saves after each tool execution.

import fs from 'fs/promises';
import path from 'path';

interface AgentState {
  taskId: string;
  messages: Array<{ role: string; content: string }>;
  pendingTools: string[];
  lastToolOutput?: string;
  timestamp: number;
}

export class CheckpointManager {
  private dir: string;

  constructor(dir: string) {
    this.dir = dir;
  }

  async save(taskId: string, state: AgentState): Promise<void> {
    const filePath = path.join(this.dir, `${taskId}.json`);
    await fs.writeFile(filePath, JSON.stringify(state, null, 2));
  }

  async load(taskId: string): Promise<AgentState | null> {
    const filePath = path.join(this.dir, `${taskId}.json`);
    try {
      const data = await fs.readFile(filePath, 'utf-8');
      return JSON.parse(data);
    } catch {
      return null;
    }
  }

  async clear(taskId: string): Promise<void> {
    const filePath = path.join(this.dir, `${taskId}.json`);
    try {
      await fs.unlink(filePath);
    } catch {
      // Ignore if not exists
    }
  }
}

Rationale: Checkpointing after every tool call limits the blast radius of a crash to a single operation. The overhead is negligible compared to inference latency. This pattern enables idempotent execution, as the agent can safely re-run the last tool without duplicating side effects if the tool is designed idempotently.

3. Bounded Execution and Circuit Breaking

Tool calls must never hang indefinitely. SDK defaults often allow timeouts that are too long for production SLAs. Wrap all external calls in explicit timeouts and implement circuit breakers to prevent thrashing on failing dependencies.

Timeout Utility: Use Promise.race with an AbortController for cancellable timeouts.

export async function withTimeout<T>(
  promise: Promise<T>,
  ms: number,
  signal?: AbortSignal
): Promise<T> {
  const timeoutPromise = new Promise<never>((_, reject) => {
    const timer = setTimeout(() => reject(new Error(`Timeout after ${ms}ms`)), ms);
    if (signal) {
      signal.addEventListener('abort', () => {
        clearTimeout(timer);
        reject(new Error('Aborted'));
      });
    }
  });

  return Promise.race([promise, timeoutPromise]);
}

Circuit Breaker Pattern: Track consecutive failures. If a tool fails repeatedly, open the circuit to skip calls for a cooldown period, routing to a fallback or logging the skip.

export class CircuitBreaker {
  private failures: Map<string, number> = new Map();
  private openUntil: Map<string, number> = new Map();
  private threshold: number;
  private cooldownMs: number;

  constructor(threshold: number = 3, cooldownMs: number = 600_000) {
    this.threshold = threshold;
    this.cooldownMs = cooldownMs;
  }

  isOpen(toolName: string): boolean {
    const openUntil = this.openUntil.get(toolName);
    if (!openUntil) return false;
    if (Date.now() < openUntil) return true;
    this.openUntil.delete(toolName);
    this.failures.delete(toolName);
    return false;
  }

  recordFailure(toolName: string): void {
    const count = (this.failures.get(toolName) || 0) + 1;
    this.failures.set(toolName, count);
    if (count >= this.threshold) {
      this.openUntil.set(toolName, Date.now() + this.cooldownMs);
      console.warn(`Circuit breaker OPEN for ${toolName}`);
    }
  }

  recordSuccess(toolName: string): void {
    this.failures.delete(toolName);
    this.openUntil.delete(toolName);
  }
}

Rationale: Timeouts prevent resource starvation. Circuit breakers protect the agent from wasting tokens and blocking the queue on known-bad endpoints. The cooldown period allows upstream services to recover.

Pitfall Guide

Pitfall	Explanation	Fix
In-Memory State Loss	Storing conversation history or task lists in process memory causes total state loss on crash.	Implement checkpointing to disk/DB after every tool call. Load state on startup.
Trusting SDK Defaults	LLM SDKs often have generous timeouts (e.g., 3 minutes) that are unsuitable for production.	Wrap all model and tool calls with explicit, context-aware timeouts.
Infinite Retry Loops	Retrying a broken tool indefinitely burns tokens and blocks other tasks.	Implement circuit breakers with cooldowns and fallback strategies.
Blocking Webhooks	Directly invoking agents from webhooks couples trigger latency to agent execution.	Decouple triggers using a message queue (Redis, SQS). Agents pull from the queue.
Ignoring OOM	Agents can leak memory over time, eventually crashing the host or other processes.	Set memory limits in the process manager and configure auto-restart on OOM.
Non-Idempotent Tools	Re-running a tool after a crash may duplicate side effects (e.g., sending duplicate emails).	Design tools to be idempotent or use deduplication keys based on checkpoint state.
Log Silos	Relying on agent stdout for logs loses data if the process crashes before flushing.	Use a process manager that captures stderr/stdout to persistent files.

Best Practice: Classify errors to distinguish between transient failures (network blips, rate limits) and deterministic errors (invalid input, missing permissions). Transient errors should trigger retries with backoff; deterministic errors should fail fast to avoid token waste.

Production Bundle

Action Checklist

Deploy agents under a process manager (systemd or pm2) with autorestart enabled.
Implement a CheckpointManager to persist state after every tool execution.
Wrap all external tool calls and model invocations with explicit timeouts.
Add circuit breakers to tools that interact with external APIs or services.
Decouple agent triggers from execution using a message queue.
Configure memory limits and auto-restart thresholds in the process manager.
Ensure all tools are idempotent or use deduplication logic to handle retries safely.
Set up log aggregation to monitor crash patterns and token usage trends.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Low Volume / Prototype	`pm2` with local SQLite checkpoints	Simple setup, low overhead, sufficient for single-agent workloads.	Minimal infrastructure cost.
High Volume / Multi-Agent	`systemd` + Redis Queue + Postgres Checkpoints	Scalable, robust, supports distributed state and high concurrency.	Higher infra cost, but reduces token waste and ops overhead.
Critical SLA / Enterprise	Managed Agent Platform + Kubernetes	Fully managed supervision, auto-scaling, and observability.	Premium cost, but eliminates operational burden and ensures uptime.
Flaky External Tools	Circuit Breaker + Fallback Logic	Prevents cascading failures and token burn on broken endpoints.	Reduces token spend by ~40% in failure scenarios.

Configuration Template

Below is a production-ready systemd service file for a TypeScript agent. This ensures the agent starts on boot, restarts on failure, and logs to journald.

# /etc/systemd/system/agent-worker.service
[Unit]
Description=AI Agent Worker Service
After=network.target

[Service]
Type=simple
User=agentuser
Group=agentuser
WorkingDirectory=/opt/agents/worker
ExecStart=/usr/bin/node dist/agent.js
Restart=on-failure
RestartSec=5
StartLimitBurst=10
StartLimitIntervalSec=60
StandardOutput=journal
StandardError=journal
SyslogIdentifier=agent-worker

# Resource Limits
LimitNOFILE=65536
MemoryMax=1G
MemoryHigh=800M

[Install]
WantedBy=multi-user.target

Usage:

Save the file to /etc/systemd/system/agent-worker.service.
Run sudo systemctl daemon-reload.
Enable and start: sudo systemctl enable --now agent-worker.
Monitor: journalctl -u agent-worker -f.

Quick Start Guide

Scaffold the Agent: Initialize a TypeScript project with tsc and install dependencies (pm2, redis, sqlite3).
Add Checkpointing: Integrate the CheckpointManager class. Modify the agent loop to save state after each tool call and load state on startup.
Implement Timeouts: Replace direct tool calls with withTimeout(toolCall(), 30000). Configure timeouts based on tool characteristics (e.g., 5s for HTTP, 120s for LLM).
Deploy with Supervision: Create a pm2 ecosystem file or systemd service. Configure autorestart, max_memory_restart, and log paths.
Verify Resilience: Simulate a crash by killing the process. Confirm the agent restarts automatically, loads the checkpoint, and resumes without data loss. Monitor logs for timeout and circuit breaker events.

Mid-Year Sale — Unlock Full Article