[3 Reliability Patterns That Stopped My AI Agent From Crashing Every 6 Hours]
Architecting Resilient AI Agent Workflows: Production Patterns for High Availability
Current Situation Analysis
Autonomous AI agents are rapidly moving from experimental demos to critical production workloads, handling tasks ranging from overnight data extraction and code triage to automated reporting and communication. However, the operational reality often diverges sharply from development environments. Agents deployed without robust infrastructure patterns exhibit high fragility, frequently succumbing to unhandled exceptions, hanging network requests, or memory exhaustion.
A pervasive misunderstanding among engineering teams is treating agent scripts as ephemeral processes. Developers often wrap agent logic in simple infinite loops, assuming the process will persist indefinitely. In practice, large language model (LLM) interactions introduce non-deterministic failure modes: malformed token streams that crash parsers, tool calls that hang indefinitely due to upstream latency, and gradual memory leaks from context accumulation. Without supervision, a single crash halts the entire workflow. Scheduled triggers continue to fire, queuing behind the dead process, leading to cascading delays and missed SLAs.
Production data from early adopters indicates that naive agent deployments can suffer uptime rates as low as 71%, with processes freezing every six to twelve hours. This instability forces engineering teams into a "babysitting" mode, manually restarting processes and investigating failures, which negates the automation value proposition. Furthermore, unbounded retry loops on failed tool calls can inflate inference costs by significant margins, as agents repeatedly attempt operations on broken endpoints without backoff or circuit breaking.
WOW Moment: Key Findings
Transitioning from script-based execution to a service-oriented architecture yields immediate, measurable improvements in reliability and cost efficiency. The following comparison highlights the impact of implementing process supervision, state persistence, and bounded execution.
| Metric | Naive Script Approach | Production-Hardened Architecture | Delta |
|---|---|---|---|
| Uptime | ~71% | 99.4% | +28.4% |
| Mean Time to Recovery | Hours (Manual Detection) | <30 Seconds (Automated) | ~99% Reduction |
| Token Efficiency | Baseline | -40% Spend | Significant Cost Savings |
| Operational Overhead | High (Manual Intervention) | Low (Self-Healing) | Drastic Reduction |
Why this matters: The shift to a hardened architecture decouples agent reliability from model stability. By externalizing state and enforcing boundaries, the system absorbs LLM and tool failures without data loss or process death. The reduction in token spend stems from eliminating infinite retry loops and preventing redundant work after crashes, directly improving the unit economics of agent operations.
Core Solution
Building a resilient agent requires three foundational pillars: process supervision, external state persistence, and bounded execution with circuit breaking. The following implementation uses TypeScript to demonstrate these patterns, emphasizing type safety and modern async control flow.
1. Process Supervision via Service Managers
Never rely on while(true) loops for production agents. Instead, deploy the agent under a process manager like systemd or pm2. This ensures automatic restart on crash, log aggregation, and resource limits.
Architecture Decision: Use pm2 for Node/TypeScript environments to leverage its ecosystem configuration and built-in monitoring. This abstracts the restart logic and provides a unified view of agent health.
// ecosystem.config.js
module.exports = {
apps: [{
name: 'data-extractor-agent',
script: 'dist/agent.js',
instances: 1,
autorestart: true,
max_restarts: 10,
restart_delay: 5000,
error_file: '/var/log/agents/data-extractor.err.log',
out_file: '/var/log/agents/data-extractor.out.log',
max_memory_restart: '1G',
env: {
NODE_ENV: 'production',
CHECKPOINT_DIR: '/data/checkpoints'
}
}]
};
Rationale: autorestart: true guarantees recovery from non-zero exits. max_memory_restart prevents OOM conditions from degrading the host. Centralized logging allows correlation of crashes with specific tool calls or model responses.
2. External State Persistence and Checkpointing
In-memory state is volatile. Upon restart, an agent must resume exactly where it left off. This requires persisting the conversation history, pending tasks, and partial tool outputs to durable storage after every significant step.
Implementation: A CheckpointManager serializes state to disk or a lightweight database. The agent loads the checkpoint on startup and saves after each tool execution.
import fs from 'fs/promises';
import path from 'path';
interface AgentState {
taskId: string;
messages: Array<{ role: string; content: string }>;
pendingTools: string[];
lastToolOutput?: string;
timestamp: number;
}
export class CheckpointManager {
private dir: string;
constructor(dir: string) {
this.dir = dir;
}
async save(taskId: string, state: AgentState): Promise<void> {
const filePath = path.join(this.dir, `${taskId}.json`);
await fs.writeFile(filePath, JSON.stringify(state, null, 2));
}
async load(taskId: string): Promise<AgentState | null> {
const filePath = path.join(this.dir, `${taskId}.json`);
try {
const data = await fs.readFile(filePath, 'utf-8');
return JSON.parse(data);
} catch {
return null;
}
}
async clear(taskId: string): Promise<void> {
const filePath = path.join(this.dir, `${taskId}.json`);
try {
await fs.unlink(filePath);
} catch {
// Ignore if not exists
}
}
}
Rationale: Checkpointing after every tool call limits the blast radius of a crash to a single operation. The overhead is negligible compared to inference latency. This pattern enables idempotent execution, as the agent can safely re-run the last tool without duplicating side effects if the tool is designed idempotently.
3. Bounded Execution and Circuit Breaking
Tool calls must never hang indefinitely. SDK defaults often allow timeouts that are too long for production SLAs. Wrap all external calls in explicit timeouts and implement circuit breakers to prevent thrashing on failing dependencies.
Timeout Utility: Use Promise.race with an AbortController for cancellable timeouts.
export async function withTimeout<T>(
promise: Promise<T>,
ms: number,
signal?: AbortSignal
): Promise<T> {
const timeoutPromise = new Promise<never>((_, reject) => {
const timer = setTimeout(() => reject(new Error(`Timeout after ${ms}ms`)), ms);
if (signal) {
signal.addEventListener('abort', () => {
clearTimeout(timer);
reject(new Error('Aborted'));
});
}
});
return Promise.race([promise, timeoutPromise]);
}
Circuit Breaker Pattern: Track consecutive failures. If a tool fails repeatedly, open the circuit to skip calls for a cooldown period, routing to a fallback or logging the skip.
export class CircuitBreaker {
private failures: Map<string, number> = new Map();
private openUntil: Map<string, number> = new Map();
private threshold: number;
private cooldownMs: number;
constructor(threshold: number = 3, cooldownMs: number = 600_000) {
this.threshold = threshold;
this.cooldownMs = cooldownMs;
}
isOpen(toolName: string): boolean {
const openUntil = this.openUntil.get(toolName);
if (!openUntil) return false;
if (Date.now() < openUntil) return true;
this.openUntil.delete(toolName);
this.failures.delete(toolName);
return false;
}
recordFailure(toolName: string): void {
const count = (this.failures.get(toolName) || 0) + 1;
this.failures.set(toolName, count);
if (count >= this.threshold) {
this.openUntil.set(toolName, Date.now() + this.cooldownMs);
console.warn(`Circuit breaker OPEN for ${toolName}`);
}
}
recordSuccess(toolName: string): void {
this.failures.delete(toolName);
this.openUntil.delete(toolName);
}
}
Rationale: Timeouts prevent resource starvation. Circuit breakers protect the agent from wasting tokens and blocking the queue on known-bad endpoints. The cooldown period allows upstream services to recover.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|---|---|
| In-Memory State Loss | Storing conversation history or task lists in process memory causes total state loss on crash. | Implement checkpointing to disk/DB after every tool call. Load state on startup. |
| Trusting SDK Defaults | LLM SDKs often have generous timeouts (e.g., 3 minutes) that are unsuitable for production. | Wrap all model and tool calls with explicit, context-aware timeouts. |
| Infinite Retry Loops | Retrying a broken tool indefinitely burns tokens and blocks other tasks. | Implement circuit breakers with cooldowns and fallback strategies. |
| Blocking Webhooks | Directly invoking agents from webhooks couples trigger latency to agent execution. | Decouple triggers using a message queue (Redis, SQS). Agents pull from the queue. |
| Ignoring OOM | Agents can leak memory over time, eventually crashing the host or other processes. | Set memory limits in the process manager and configure auto-restart on OOM. |
| Non-Idempotent Tools | Re-running a tool after a crash may duplicate side effects (e.g., sending duplicate emails). | Design tools to be idempotent or use deduplication keys based on checkpoint state. |
| Log Silos | Relying on agent stdout for logs loses data if the process crashes before flushing. | Use a process manager that captures stderr/stdout to persistent files. |
Best Practice: Classify errors to distinguish between transient failures (network blips, rate limits) and deterministic errors (invalid input, missing permissions). Transient errors should trigger retries with backoff; deterministic errors should fail fast to avoid token waste.
Production Bundle
Action Checklist
- Deploy agents under a process manager (
systemdorpm2) withautorestartenabled. - Implement a
CheckpointManagerto persist state after every tool execution. - Wrap all external tool calls and model invocations with explicit timeouts.
- Add circuit breakers to tools that interact with external APIs or services.
- Decouple agent triggers from execution using a message queue.
- Configure memory limits and auto-restart thresholds in the process manager.
- Ensure all tools are idempotent or use deduplication logic to handle retries safely.
- Set up log aggregation to monitor crash patterns and token usage trends.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Low Volume / Prototype | pm2 with local SQLite checkpoints |
Simple setup, low overhead, sufficient for single-agent workloads. | Minimal infrastructure cost. |
| High Volume / Multi-Agent | systemd + Redis Queue + Postgres Checkpoints |
Scalable, robust, supports distributed state and high concurrency. | Higher infra cost, but reduces token waste and ops overhead. |
| Critical SLA / Enterprise | Managed Agent Platform + Kubernetes | Fully managed supervision, auto-scaling, and observability. | Premium cost, but eliminates operational burden and ensures uptime. |
| Flaky External Tools | Circuit Breaker + Fallback Logic | Prevents cascading failures and token burn on broken endpoints. | Reduces token spend by ~40% in failure scenarios. |
Configuration Template
Below is a production-ready systemd service file for a TypeScript agent. This ensures the agent starts on boot, restarts on failure, and logs to journald.
# /etc/systemd/system/agent-worker.service
[Unit]
Description=AI Agent Worker Service
After=network.target
[Service]
Type=simple
User=agentuser
Group=agentuser
WorkingDirectory=/opt/agents/worker
ExecStart=/usr/bin/node dist/agent.js
Restart=on-failure
RestartSec=5
StartLimitBurst=10
StartLimitIntervalSec=60
StandardOutput=journal
StandardError=journal
SyslogIdentifier=agent-worker
# Resource Limits
LimitNOFILE=65536
MemoryMax=1G
MemoryHigh=800M
[Install]
WantedBy=multi-user.target
Usage:
- Save the file to
/etc/systemd/system/agent-worker.service. - Run
sudo systemctl daemon-reload. - Enable and start:
sudo systemctl enable --now agent-worker. - Monitor:
journalctl -u agent-worker -f.
Quick Start Guide
- Scaffold the Agent: Initialize a TypeScript project with
tscand install dependencies (pm2,redis,sqlite3). - Add Checkpointing: Integrate the
CheckpointManagerclass. Modify the agent loop to save state after each tool call and load state on startup. - Implement Timeouts: Replace direct tool calls with
withTimeout(toolCall(), 30000). Configure timeouts based on tool characteristics (e.g., 5s for HTTP, 120s for LLM). - Deploy with Supervision: Create a
pm2ecosystem file orsystemdservice. Configureautorestart,max_memory_restart, and log paths. - Verify Resilience: Simulate a crash by killing the process. Confirm the agent restarts automatically, loads the checkpoint, and resumes without data loss. Monitor logs for timeout and circuit breaker events.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
