oal);
await this.board.initialize(tasks);
// Anchor evaluation hook to prevent drift
this.board.setEvaluationHook((action: AgentAction) => {
return this.validateAgainstGoal(action, goal);
});
}
private decomposeIntoTasks(goal: WorkflowGoal): Task[] {
return goal.acceptanceCriteria.map((criteria, index) => ({
taskId: TK-${goal.id}-${index + 1},
status: 'todo',
description: criteria,
retryPolicy: { maxAttempts: 3, backoff: 'exponential' }
}));
}
}
**Architecture Rationale:** Explicit goal anchoring forces the runtime to evaluate every tool call against the original objective. If a subtask diverges, the evaluation hook intercepts it before execution. This eliminates the common failure mode where agents chase tangential optimizations or lose track of the primary deliverable.
### Step 2: Subagent Delegation & Workspace Isolation
Parallel execution requires strict boundaries. Each child agent must operate within an isolated context, with a restricted toolset and a dedicated terminal session. This prevents file-state collisions and limits blast radius.
```typescript
interface SubAgentConfig {
taskId: string;
allowedTools: string[];
workspaceMount: string;
maxConcurrency: number;
}
class DelegationEngine {
private activeAgents: Map<string, AgentSession> = new Map();
async spawnSubAgents(configs: SubAgentConfig[]): Promise<void> {
for (const config of configs) {
if (this.activeAgents.size >= config.maxConcurrency) {
throw new Error('Concurrency limit reached');
}
const session = await this.createIsolatedSession(config);
this.activeAgents.set(config.taskId, session);
// Attach heartbeat monitor
this.attachHeartbeat(config.taskId, session);
}
}
private async createIsolatedSession(config: SubAgentConfig): Promise<AgentSession> {
return {
context: this.pruneContextForTask(config.taskId),
tools: config.allowedTools,
terminal: await this.allocateTerminal(config.workspaceMount),
state: 'active'
};
}
}
Architecture Rationale: Isolation is non-negotiable for parallel agents. Shared file systems or unrestricted tool access cause race conditions and unpredictable state mutations. By scoping tools per task and mounting dedicated workspaces, you guarantee that a failure in one subagent cannot corrupt another's output.
Step 3: Heartbeat Monitoring & Zombie Reclamation
Long-running tasks require active liveness checks. The system must detect hung processes, network partitions, or silent crashes, and automatically reclaim stalled work.
class HeartbeatMonitor {
private timeouts: Map<string, NodeJS.Timeout> = new Map();
private readonly CHECK_INTERVAL_MS = 5000;
attachHeartbeat(taskId: string, session: AgentSession): void {
const interval = setInterval(async () => {
const isAlive = await this.pingSession(session);
if (!isAlive) {
this.handleZombie(taskId, session);
clearInterval(interval);
}
}, this.CHECK_INTERVAL_MS);
this.timeouts.set(taskId, interval);
}
private async handleZombie(taskId: string, session: AgentSession): Promise<void> {
console.warn(`[Zombie Detected] Task ${taskId} unresponsive. Reclaiming...`);
session.state = 'reclaimed';
await this.requeueTask(taskId, 'blocked');
}
}
Architecture Rationale: Heartbeat monitoring transforms passive failure observation into active recovery. Without it, in_progress tasks become permanent dead weight. The monitor detects liveness gaps, transitions the task to blocked, and triggers requeue logic, ensuring no work silently vanishes.
Step 4: Checkpoint Management & Safe Rollback
File mutations during agent execution carry inherent risk. A checkpoint system that snapshots state before writes, prunes old snapshots, and enables instant rollback is essential for production safety.
class CheckpointManager {
private snapshots: Map<string, Snapshot> = new Map();
private readonly MAX_RETENTION = 10;
async createSnapshot(taskId: string, workspacePath: string): Promise<string> {
const snapshotId = crypto.randomUUID();
const snapshot: Snapshot = {
id: snapshotId,
timestamp: Date.now(),
path: workspacePath,
diff: await this.captureDiff(workspacePath)
};
this.snapshots.set(snapshotId, snapshot);
this.pruneOldSnapshots(taskId);
return snapshotId;
}
async rollback(taskId: string, targetSnapshotId: string): Promise<void> {
const target = this.snapshots.get(targetSnapshotId);
if (!target) throw new Error('Snapshot not found');
await this.restoreWorkspace(target.path, target.diff);
console.info(`[Rollback] Restored ${taskId} to ${targetSnapshotId}`);
}
private pruneOldSnapshots(taskId: string): void {
const taskSnapshots = Array.from(this.snapshots.entries())
.filter(([_, s]) => s.taskId === taskId)
.sort((a, b) => b[1].timestamp - a[1].timestamp);
if (taskSnapshots.length > this.MAX_RETENTION) {
taskSnapshots.slice(this.MAX_RETENTION).forEach(([id]) => {
this.snapshots.delete(id);
});
}
}
}
Architecture Rationale: Checkpoints must be proactive, not reactive. By capturing diffs before mutations and enforcing retention limits, you prevent disk exhaustion while maintaining a clean recovery surface. The rollback operation becomes deterministic, eliminating the need to manually untangle partially written files.
Pitfall Guide
1. Unbounded Retry Loops
Explanation: Configuring infinite retries without escalation causes the system to waste compute on fundamentally broken tasks, masking underlying tool failures or prompt misconfigurations.
Fix: Implement exponential backoff with a hard ceiling (typically 3-5 attempts). After the threshold, transition the task to failed and trigger an alert or manual review queue.
2. Shared Workspace Collisions
Explanation: Allowing multiple subagents to write to the same directory without isolation causes race conditions, overwritten files, and corrupted build artifacts.
Fix: Enforce strict workspace partitioning. Mount read-only copies of shared dependencies and allocate isolated write directories per task. Use file locks or atomic rename operations for cross-agent handoffs.
3. Ignoring Heartbeat Drift
Explanation: Using static timeout values for all tasks ignores the reality that complex operations (e.g., large file parsing, model inference) naturally take longer. This triggers false zombie detections.
Fix: Implement adaptive heartbeat windows based on task complexity scoring. Allow subagents to request timeout extensions for known long-running operations, with a maximum extension cap.
4. Checkpoint Accumulation
Explanation: Failing to prune old snapshots causes disk usage to grow linearly with execution time, eventually triggering OOM or storage exhaustion on long-running deployments.
Fix: Configure automatic retention policies (e.g., keep last 10 snapshots per task). Schedule periodic cleanup jobs that archive older snapshots to cold storage if audit compliance requires them.
5. Goal Drift in Extended Contexts
Explanation: Even with goal anchoring, agents can gradually optimize for subtask completion rather than the original objective, especially when context windows refresh or tools return noisy data.
Fix: Inject periodic evaluation hooks that score current progress against the original acceptance criteria. If the score drops below a threshold, force a context reset and re-anchor the goal before continuing.
6. Silent Hallucination Acceptance
Explanation: Agents may claim a step is complete when the underlying tool call failed or returned empty data. Without verification, the board marks the task done incorrectly.
Fix: Enable output verification against task logs. Compare declared completion states with actual tool return payloads. Flag discrepancies for review instead of auto-advancing the board.
7. Over-Delegation for Simple Tasks
Explanation: Spawning multiple subagents for straightforward, sequential tasks introduces unnecessary orchestration overhead, increasing latency and resource consumption.
Fix: Implement a complexity threshold. Use single-agent execution for workflows under 15 minutes or with linear dependencies. Reserve Kanban delegation for parallelizable, multi-hour, or high-risk operations.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single-file refactor (<10 min) | Single-agent execution | Lower orchestration overhead, faster completion | Minimal compute cost |
| Multi-module feature build (2-4 hours) | Kanban delegation with 3 subagents | Parallelism reduces wall-clock time, isolation prevents collisions | Moderate compute, higher reliability |
| Unattended overnight batch | Kanban + gateway auto-resume + checkpoints | Survives restarts, auto-reclaims zombies, enables rollback | Higher storage for checkpoints, lower manual intervention cost |
| High-risk production deployment | Kanban + strict tool scoping + audit logging | Guarantees state verification, prevents silent hallucinations | Increased logging/storage, significantly reduced rollback risk |
| Rapid prototyping / exploration | Ephemeral single-agent | Fast iteration, no state persistence required | Lowest cost, highest failure tolerance acceptable |
Configuration Template
# hermes-kanban-config.yaml
workflow:
goal_anchoring:
enabled: true
evaluation_interval_ms: 30000
drift_threshold: 0.75
task_board:
states: [todo, in_progress, blocked, done, failed]
default_retry_policy:
max_attempts: 3
backoff_strategy: exponential
base_delay_ms: 5000
subagent_delegation:
max_concurrency: 3
workspace_isolation: true
tool_scoping: strict
heartbeat:
check_interval_ms: 5000
adaptive_timeout: true
max_extension_factor: 2.0
checkpointing:
enabled: true
snapshot_before_mutation: true
retention_policy:
max_per_task: 10
cleanup_schedule: "0 */6 * * *"
rollback:
enabled: true
requires_confirmation: false
gateway:
auto_resume: true
state_persistence: durable
crash_recovery: automatic
Quick Start Guide
- Initialize the runtime: Install the Hermes Agent v0.12 release and verify gateway auto-resume is enabled in your environment configuration.
- Declare your objective: Use the goal declaration interface to define acceptance criteria, timeout windows, and retry policies. The system will automatically decompose the objective into trackable tasks.
- Configure isolation boundaries: Set workspace mounts, restrict tool access per subagent, and enable checkpointing before any file mutations occur.
- Execute and monitor: Launch the workflow. Use the board status endpoint to track state transitions, heartbeat liveness, and checkpoint availability. Intervene only when tasks transition to
blocked or failed.
- Validate and rollback if needed: Compare final outputs against acceptance criteria. If state corruption occurs, invoke the rollback interface with the latest checkpoint ID to restore a clean workspace instantly.
This architecture transforms AI agent execution from a fragile, single-pass experiment into a production-grade automation layer. By treating state as a first-class citizen, enforcing isolation, and building explicit recovery paths, you eliminate the most common failure modes that plague long-running AI workflows. The result is a system that finishes what it starts, survives what it cannot prevent, and provides a verifiable record of every decision made along the way.