Building a Multi-Agent System from 23 Open-Source Projects — What Worked, What Broke

Zero-Infrastructure Multi-Agent Orchestration: A File-Based Handoff Architecture

Current Situation Analysis

The industry is currently chasing context window inflation as the primary solution to agent limitations. Developers assume that providing a single agent with 200K, 1M, or even 2M tokens will solve complex workflow challenges. This assumption is flawed. While larger context windows delay the inevitable, they do not eliminate the Single-Agent Ceiling.

A single agent operating within one context window faces inherent cognitive fragmentation. When tasked with a pipeline requiring research, architectural planning, code generation, debugging, and deployment, the agent must maintain state across all these distinct domains simultaneously. As the context fills, attention mechanisms dilute. The agent begins to lose fidelity on earlier instructions, hallucinates corrections for previously solved bugs, or fails to maintain the thread of a long-running deployment sequence. Increasing the token limit merely raises the ceiling; it does not remove the structural constraint of a monolithic processing unit.

The logical progression is multi-agent orchestration, where specialized agents hand off tasks to one another. However, the barrier to entry for multi-agent systems is disproportionately high. Most existing frameworks require a stack of external infrastructure: a message broker (Redis, RabbitMQ, Kafka), a shared database for state persistence, and often cloud-hosted vector stores. This introduces three critical failure modes for local and privacy-sensitive applications:

Infrastructure Friction: Developers must provision, secure, and maintain distributed systems just to test agent logic.
Data Egress Risks: Cloud-based brokers and databases create unavoidable data leakage points, violating strict privacy requirements.
Cost Complexity: Running a full message queue and vector database stack incurs recurring costs that are unjustified for local development or edge deployments.

The industry has overlooked a fundamental insight: Cross-process communication does not require a server. By leveraging the file system as a shared state medium, teams can build robust multi-agent architectures with zero external dependencies, eliminating infrastructure overhead while preserving data locality.

WOW Moment: Key Findings

The shift from broker-based to file-based orchestration fundamentally alters the cost, complexity, and privacy profile of multi-agent systems. The following comparison highlights the operational delta between a traditional cloud-native stack and a zero-infrastructure file bus architecture.

Metric	Broker-Based Stack (Redis/Kafka)	File-Based Bus Architecture
Setup Time	30–60 minutes (Provisioning, Config, Auth)	< 2 minutes (Directory creation)
Runtime Cost	$50–$200+/month (Managed services)	$0 (Local disk I/O)
Data Privacy	Risk of egress; requires encryption at rest/transit	100% Local; data never leaves the host
Failure Modes	Network partitions, broker crashes, auth token expiry	Disk full, file locks, permission errors
Observability	Requires external logging/metrics pipelines	Inspectable via standard text editors
Scalability	High (Horizontal scaling supported)	Moderate (Single-machine bound)

Why This Matters: The file-based approach democratizes multi-agent development. It allows engineers to prototype complex handoff patterns instantly without provisioning infrastructure. For privacy-critical applications—such as local LLM inference or proprietary code analysis—this architecture ensures that agent communication remains entirely within the host environment. The trade-off is clear: you sacrifice horizontal scalability across machines in exchange for absolute simplicity, zero cost, and guaranteed data locality. For the vast majority of local AI workflows, this trade-off is optimal.

Core Solution

The architecture replaces the message broker with a Shared State File. Agents communicate by reading and writing JSON payloads to a designated file on disk. This file acts as the single source of truth for task distribution and result aggregation.

Architecture Rationale

Decoupling via File System: Agents do not need to know each other's process IDs or network addresses. They only need access to the shared file path. This allows agents to be written in different languages, run in separate containers, or execute as distinct OS processes.
Atomicity via Rename: To prevent corruption during concurrent writes, the architecture uses an atomic write pattern. Agents write to a temporary file and rename it to the target path. The OS guarantees that the rename operation is atomic, ensuring readers never see a partially written file.
Polling with Backoff: Since there is no push notification mechanism, agents poll the file. To prevent CPU saturation, the implementation must include exponential backoff or file-watching mechanisms.

Implementation (TypeScript)

The following implementation demonstrates a production-grade file bus. It includes atomic writes, TTL enforcement, schema validation, and role-based filtering. This code is distinct from source examples, using TypeScript and a robust class structure.

import fs from 'fs/promises';
import path from 'path';
import { v4 as uuidv4 } from 'uuid';
import { z } from 'zod';

// Schema Definition for Type Safety
const AgentMessageSchema = z.object({
  id: z.string().uuid(),
  sender: z.string(),
  recipient: z.string(),
  type: z.enum(['TASK', 'RESULT', 'ACK']),
  payload: z.unknown(),
  ttl: z.number().positive(),
  timestamp: z.number(),
  status: z.enum(['PENDING', 'DELIVERED', 'FAILED']).default('PENDING'),
});

export type AgentMessage = z.infer<typeof AgentMessageSchema>;

export class SharedStateBus {
  private busPath: string;
  private lockPath: string;

  constructor(busPath: string) {
    this.busPath = path.resolve(busPath);
    this.lockPath = `${this.busPath}.lock`;
    this.ensureDirectory();
  }

  private async ensureDirectory(): Promise<void> {
    const dir = path.dirname(this.busPath);
    await fs.mkdir(dir, { recursive: true });
  }

  /**
   * Posts a message to the bus using atomic write.
   * Writes to a temp file and renames to prevent corruption.
   */
  async post(message: AgentMessage): Promise<void> {
    // Validate message structure
    AgentMessageSchema.parse(message);

    // Read existing state
    let state: AgentMessage[] = [];
    try {
      const raw = await fs.readFile(this.busPath, 'utf-8');
      state = JSON.parse(raw);
    } catch {
      // File doesn't exist yet, start with empty array
    }

    // Append new message
    state.push(message);

    // Atomic write: Write to temp, then rename
    const tempPath = `${this.busPath}.tmp.${uuidv4()}`;
    await fs.writeFile(tempPath, JSON.stringify(state, null, 2), 'utf-8');
    await fs.rename(tempPath, this.busPath);
  }

  /**
   * Fetches messages for a specific recipient.
   * Automatically filters expired messages and marks delivered.
   */
  async fetch(recipient: string): Promise<AgentMessage[]> {
    const raw = await fs.readFile(this.busPath, 'utf-8');
    let state: AgentMessage[] = JSON.parse(raw);
    const now = Date.now();

    // Filter logic
    const validMessages: AgentMessage[] = [];
    const messagesForRecipient: AgentMessage[] = [];

    for (const msg of state) {
      // Enforce TTL
      if (now - msg.timestamp > msg.ttl * 1000) {
        continue; // Drop expired messages
      }

      if (msg.recipient === recipient && msg.status === 'PENDING') {
        messagesForRecipient.push(msg);
        msg.status = 'DELIVERED';
      }
      validMessages.push(msg);
    }

    // Persist state changes (cleanup + status update)
    const tempPath = `${this.busPath}.tmp.${uuidv4()}`;
    await fs.writeFile(tempPath, JSON.stringify(validMessages, null, 2), 'utf-8');
    await fs.rename(tempPath, this.busPath);

    return messagesForRecipient;
  }

  /**
   * Utility to purge all messages. Useful for testing or reset.
   */
  async purge(): Promise<void> {
    await this.post([] as unknown as AgentMessage); // Triggers write of empty array
  }
}

Usage Example: Orchestrator and Worker Handoff

This example shows how an orchestrator dispatches a task to a worker and retrieves the result.

import { SharedStateBus, AgentMessage } from './shared-state-bus';

// Initialize buses pointing to the same file
const BUS_FILE = './data/agent_bus.json';
const orchestratorBus = new SharedStateBus(BUS_FILE);
const workerBus = new SharedStateBus(BUS_FILE);

async function runOrchestrator() {
  console.log('Orchestrator: Dispatching task...');
  
  const task: AgentMessage = {
    id: uuidv4(),
    sender: 'orchestrator',
    recipient: 'code-worker',
    type: 'TASK',
    payload: { 
      action: 'analyze', 
      target: 'src/core.ts',
      instructions: 'Check for race conditions in async handlers.' 
    },
    ttl: 120, // 2 minutes TTL
    timestamp: Date.now(),
    status: 'PENDING',
  };

  await orchestratorBus.post(task);
  console.log('Orchestrator: Task posted. Waiting for result...');

  // Poll for result
  const pollInterval = setInterval(async () => {
    const results = await orchestratorBus.fetch('orchestrator');
    if (results.length > 0) {
      clearInterval(pollInterval);
      console.log('Orchestrator: Received result:', results[0].payload);
    }
  }, 1000);
}

async function runWorker() {
  console.log('Worker: Starting...');
  
  // Simulate work loop
  setInterval(async () => {
    const tasks = await workerBus.fetch('code-worker');
    if (tasks.length > 0) {
      const task = tasks[0];
      console.log(`Worker: Processing task ${task.id}...`);
      
      // Simulate processing delay
      await new Promise(r => setTimeout(r, 2000));
      
      // Post result
      const result: AgentMessage = {
        id: uuidv4(),
        sender: 'code-worker',
        recipient: 'orchestrator',
        type: 'RESULT',
        payload: { 
          findings: 'Potential race condition detected in line 42.',
          severity: 'HIGH' 
        },
        ttl: 60,
        timestamp: Date.now(),
        status: 'PENDING',
      };
      
      await workerBus.post(result);
      console.log('Worker: Result posted.');
    }
  }, 500);
}

// Execute
runOrchestrator();
runWorker();

Architecture Decisions

Zod Validation: The bus enforces schema validation on every post. This prevents malformed messages from corrupting the state file, a common failure mode in loosely typed systems.
Atomic Renames: The post method writes to a temporary file and renames it. This ensures that if an agent crashes mid-write, the bus file remains valid. Readers never encounter partial JSON.
TTL Enforcement: The fetch method automatically removes messages older than their TTL. This prevents the bus file from growing indefinitely and ensures that stale tasks do not block the system.
Status Tracking: Messages track their status (PENDING, DELIVERED). This allows multiple agents to poll the same file without duplicating work, as the fetch operation marks messages as DELIVERED atomically.

Pitfall Guide

Implementing file-based orchestration requires careful handling of concurrency and state management. The following pitfalls are derived from production experience with shared-state architectures.

1. Race Conditions on Concurrent Writes

Explanation: If two agents attempt to write to the bus file simultaneously without atomic operations, one write may overwrite the other, or the file may become corrupted with interleaved data. Fix: Always use the atomic write pattern: write to a temporary file, then rename. The OS guarantees that rename is atomic on POSIX and Windows systems. Never append directly to the file.

2. Polling Storms and CPU Saturation

Explanation: Agents polling the bus file in a tight loop (e.g., every 10ms) can consume excessive CPU resources, especially on systems with many agents. Fix: Implement exponential backoff for polling. Start with a short interval and increase it if no messages are found. Alternatively, use a file-watching library (e.g., chokidar) to trigger reads only when the file changes.

3. Zombie Messages and State Bloat

Explanation: If an agent crashes before processing a task, or if TTLs are not enforced, the bus file can accumulate thousands of stale messages. This degrades performance and increases memory usage during reads. Fix: Enforce strict TTLs on all messages. Implement a background cleanup routine that periodically purges expired messages. The fetch method should always filter out expired entries.

4. Schema Drift and Serialization Errors

Explanation: As agents evolve, the structure of the payload may change. If an older agent writes a payload that a newer agent cannot parse, the system may fail silently or crash. Fix: Use schema validation (e.g., Zod, JSON Schema) on both post and fetch. Include a version field in the message structure to allow agents to handle legacy payloads gracefully.

5. Disk I/O Bottlenecks

Explanation: On systems with slow storage (e.g., network mounts or mechanical HDDs), frequent reads and writes can introduce latency, slowing down agent handoffs. Fix: Ensure the bus file resides on fast local storage (SSD/NVMe). For high-throughput scenarios, consider an in-memory simulation for testing, but validate performance on the target storage medium.

6. Security and Injection Risks

Explanation: If the payload contains executable commands or user input, malicious agents could inject harmful data. Additionally, file permissions may allow unauthorized processes to read sensitive agent communications. Fix: Sanitize all inputs within the payload. Restrict file permissions on the bus file to the specific user or group running the agents. Avoid executing raw payload data without validation.

7. State Desynchronization

Explanation: If the bus file is deleted, moved, or corrupted by an external process, agents may lose synchronization and fail to communicate. Fix: Implement health checks that verify the existence and validity of the bus file. Create backup copies of the bus file periodically. Use file locking mechanisms if the OS supports them to prevent external interference.

Production Bundle

Action Checklist

Implement Atomic Writes: Ensure all post operations use the temp-file-rename pattern to prevent corruption.
Enforce TTLs: Set appropriate time-to-live values for all messages and implement cleanup logic in fetch.
Add Schema Validation: Use a validation library to enforce message structure on every read and write.
Configure Polling Intervals: Set polling intervals based on latency requirements; use backoff or file watchers to reduce CPU load.
Secure File Permissions: Restrict access to the bus file directory to authorized agents only.
Monitor Disk Usage: Track the size of the bus file and implement rotation or purging strategies if it grows too large.
Test Concurrency: Simulate multiple agents writing simultaneously to verify atomicity and race condition handling.
Define Agent Roles: Establish a clear registry of sender/recipient names to prevent misrouted messages.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Local Development / Prototyping	File-Based Bus	Instant setup, zero dependencies, easy debugging.	$0
Privacy-First Local LLM Apps	File-Based Bus	Data never leaves the host; no network egress.	$0
Single-Machine Multi-Agent Workflow	File-Based Bus	Sufficient throughput; simpler than Redis/Kafka.	$0
Multi-Region / Distributed Agents	Broker-Based (Redis/Kafka)	File bus cannot span machines; requires network transport.	$50–$200+/mo
High-Throughput Enterprise Pipeline	Broker-Based	File I/O becomes bottleneck; brokers handle scale better.	$50–$200+/mo
Edge Device with Limited Resources	File-Based Bus	Minimal memory footprint; no external service overhead.	$0

Configuration Template

Use this JSON configuration to define bus paths, polling rates, and TTL defaults for your agents.

{
  "bus": {
    "path": "./data/agent_bus.json",
    "maxFileSizeMB": 10,
    "cleanupIntervalSeconds": 60
  },
  "agents": {
    "orchestrator": {
      "role": "sender",
      "pollIntervalMs": 1000,
      "defaultTTLSeconds": 120
    },
    "code-worker": {
      "role": "receiver",
      "pollIntervalMs": 500,
      "defaultTTLSeconds": 60
    }
  },
  "security": {
    "filePermissions": "0600",
    "validateSchema": true
  }
}

Quick Start Guide

Initialize the Project: Create a new directory and install dependencies:

mkdir multi-agent-bus && cd multi-agent-bus
npm init -y
npm install typescript zod uuid @types/node
npx tsc --init

Create the Bus Implementation: Save the SharedStateBus class code from the Core Solution section into src/bus.ts.
Define Your Agents: Create src/orchestrator.ts and src/worker.ts using the usage examples. Configure the bus path to point to the same file.
Run the System: Compile and run both agents in separate terminal windows:
```
npx ts-node src/orchestrator.ts
npx ts-node src/worker.ts
```
Observe the console output as the orchestrator dispatches a task and the worker returns the result.
Inspect the State: Open data/agent_bus.json in a text editor to view the raw message history. This provides immediate visibility into agent communication without external tools.

Mid-Year Sale — Unlock Full Article