Building a Multi-Agent System from 23 Open-Source Projects — What Worked, What Broke
Zero-Infrastructure Multi-Agent Orchestration: A File-Based Handoff Architecture
Current Situation Analysis
The industry is currently chasing context window inflation as the primary solution to agent limitations. Developers assume that providing a single agent with 200K, 1M, or even 2M tokens will solve complex workflow challenges. This assumption is flawed. While larger context windows delay the inevitable, they do not eliminate the Single-Agent Ceiling.
A single agent operating within one context window faces inherent cognitive fragmentation. When tasked with a pipeline requiring research, architectural planning, code generation, debugging, and deployment, the agent must maintain state across all these distinct domains simultaneously. As the context fills, attention mechanisms dilute. The agent begins to lose fidelity on earlier instructions, hallucinates corrections for previously solved bugs, or fails to maintain the thread of a long-running deployment sequence. Increasing the token limit merely raises the ceiling; it does not remove the structural constraint of a monolithic processing unit.
The logical progression is multi-agent orchestration, where specialized agents hand off tasks to one another. However, the barrier to entry for multi-agent systems is disproportionately high. Most existing frameworks require a stack of external infrastructure: a message broker (Redis, RabbitMQ, Kafka), a shared database for state persistence, and often cloud-hosted vector stores. This introduces three critical failure modes for local and privacy-sensitive applications:
- Infrastructure Friction: Developers must provision, secure, and maintain distributed systems just to test agent logic.
- Data Egress Risks: Cloud-based brokers and databases create unavoidable data leakage points, violating strict privacy requirements.
- Cost Complexity: Running a full message queue and vector database stack incurs recurring costs that are unjustified for local development or edge deployments.
The industry has overlooked a fundamental insight: Cross-process communication does not require a server. By leveraging the file system as a shared state medium, teams can build robust multi-agent architectures with zero external dependencies, eliminating infrastructure overhead while preserving data locality.
WOW Moment: Key Findings
The shift from broker-based to file-based orchestration fundamentally alters the cost, complexity, and privacy profile of multi-agent systems. The following comparison highlights the operational delta between a traditional cloud-native stack and a zero-infrastructure file bus architecture.
| Metric | Broker-Based Stack (Redis/Kafka) | File-Based Bus Architecture |
|---|---|---|
| Setup Time | 30–60 minutes (Provisioning, Config, Auth) | < 2 minutes (Directory creation) |
| Runtime Cost | $50–$200+/month (Managed services) | $0 (Local disk I/O) |
| Data Privacy | Risk of egress; requires encryption at rest/transit | 100% Local; data never leaves the host |
| Failure Modes | Network partitions, broker crashes, auth token expiry | Disk full, file locks, permission errors |
| Observability | Requires external logging/metrics pipelines | Inspectable via standard text editors |
| Scalability | High (Horizontal scaling supported) | Moderate (Single-machine bound) |
Why This Matters: The file-based approach democratizes multi-agent development. It allows engineers to prototype complex handoff patterns instantly without provisioning infrastructure. For privacy-critical applications—such as local LLM inference or proprietary code analysis—this architecture ensures that agent communication remains entirely within the host environment. The trade-off is clear: you sacrifice horizontal scalability across machines in exchange for absolute simplicity, zero cost, and guaranteed data locality. For the vast majority of local AI workflows, this trade-off is optimal.
Core Solution
The architecture replaces the message broker with a Shared State File. Agents communicate by reading and writing JSON payloads to a designated file on disk. This file acts as the single source of truth for task distribution and result aggregation.
Architecture Rationale
- Decoupling via File System: Agents do not need to know each other's process IDs or network addresses. They only need access to the shared file path. This allows agents to be written in different languages, run in separate containers, or execute as distinct OS processes.
- Atomicity via Rename: To prevent corruption during concurrent writes, the architecture uses an atomic write pattern. Agents write to a temporary file and rename it to the target path. The OS guarantees that the rename operation is atomic, ensuring readers never see a partially written file.
- Polling with Backoff: Since there is no push notification mechanism, agents poll the file. To prevent CPU saturation, the implementation must include exponential backoff or file-watching mechanisms.
Implementation (TypeScript)
The following implementation demonstrates a production-grade file bus. It includes atomic writes, TTL enforcement, schema validation, and role-based filtering. This code is distinct from source examples, using TypeScript and a robust class structure.
import fs from 'fs/promises';
import path from 'path';
import { v4 as uuidv4 } from 'uuid';
import { z } from 'zod';
// Schema Definition for Type Safety
const AgentMessageSchema = z.object({
id: z.string().uuid(),
sender: z.string(),
recipient: z.string(),
type: z.enum(['TASK', 'RESULT', 'ACK']),
payload: z.unknown(),
ttl: z.number().positive(),
timestamp: z.number(),
status: z.enum(['PENDING', 'DELIVERED', 'FAILED']).default('PENDING'),
});
export type AgentMessage = z.infer<typeof AgentMessageSchema>;
export class SharedStateBus {
private busPath: string;
private lockPath: string;
constructor(busPath: string) {
this.busPath = path.resolve(busPath);
this.lockPath = `${this.busPath}.lock`;
this.ensureDirectory();
}
private async ensureDirectory(): Promise<void> {
const dir = path.dirname(this.busPath);
await fs.mkdir(dir, { recursive: true });
}
/**
* Posts a message to the bus using atomic write.
* Writes to a temp file and renames to prevent corruption.
*/
async post(message: AgentMessage): Promise<void> {
// Validate message structure
AgentMessageSchema.parse(message);
// Read existing state
let state: AgentMessage[] = [];
try {
const raw = await fs.readFile(this.busPath, 'utf-8');
state = JSON.parse(raw);
} catch {
// File doesn't exist yet, start with empty array
}
// Append new message
state.push(message);
// Atomic write: Write to temp, then rename
const tempPath = `${this.busPath}.tmp.${uuidv4()}`;
await fs.writeFile(tempPath, JSON.stringify(state, null, 2), 'utf-8');
await fs.rename(tempPath, this.busPath);
}
/**
* Fetches messages for a specific recipient.
* Automatically filters expired messages and marks delivered.
*/
async fetch(recipient: string): Promise<AgentMessage[]> {
const raw = await fs.readFile(this.busPath, 'utf-8');
let state: AgentMessage[] = JSON.parse(raw);
const now = Date.now();
// Filter logic
const validMessages: AgentMessage[] = [];
const messagesForRecipient: AgentMessage[] = [];
for (const msg of state) {
// Enforce TTL
if (now - msg.timestamp > msg.ttl * 1000) {
continue; // Drop expired messages
}
if (msg.recipient === recipient && msg.status === 'PENDING') {
messagesForRecipient.push(msg);
msg.status = 'DELIVERED';
}
validMessages.push(msg);
}
// Persist state changes (cleanup + status update)
const tempPath = `${this.busPath}.tmp.${uuidv4()}`;
await fs.writeFile(tempPath, JSON.stringify(validMessages, null, 2), 'utf-8');
await fs.rename(tempPath, this.busPath);
return messagesForRecipient;
}
/**
* Utility to purge all messages. Useful for testing or reset.
*/
async purge(): Promise<void> {
await this.post([] as unknown as AgentMessage); // Triggers write of empty array
}
}
Usage Example: Orchestrator and Worker Handoff
This example shows how an orchestrator dispatches a task to a worker and retrieves the result.
import { SharedStateBus, AgentMessage } from './shared-state-bus';
// Initialize buses pointing to the same file
const BUS_FILE = './data/agent_bus.json';
const orchestratorBus = new SharedStateBus(BUS_FILE);
const workerBus = new SharedStateBus(BUS_FILE);
async function runOrchestrator() {
console.log('Orchestrator: Dispatching task...');
const task: AgentMessage = {
id: uuidv4(),
sender: 'orchestrator',
recipient: 'code-worker',
type: 'TASK',
payload: {
action: 'analyze',
target: 'src/core.ts',
instructions: 'Check for race conditions in async handlers.'
},
ttl: 120, // 2 minutes TTL
timestamp: Date.now(),
status: 'PENDING',
};
await orchestratorBus.post(task);
console.log('Orchestrator: Task posted. Waiting for result...');
// Poll for result
const pollInterval = setInterval(async () => {
const results = await orchestratorBus.fetch('orchestrator');
if (results.length > 0) {
clearInterval(pollInterval);
console.log('Orchestrator: Received result:', results[0].payload);
}
}, 1000);
}
async function runWorker() {
console.log('Worker: Starting...');
// Simulate work loop
setInterval(async () => {
const tasks = await workerBus.fetch('code-worker');
if (tasks.length > 0) {
const task = tasks[0];
console.log(`Worker: Processing task ${task.id}...`);
// Simulate processing delay
await new Promise(r => setTimeout(r, 2000));
// Post result
const result: AgentMessage = {
id: uuidv4(),
sender: 'code-worker',
recipient: 'orchestrator',
type: 'RESULT',
payload: {
findings: 'Potential race condition detected in line 42.',
severity: 'HIGH'
},
ttl: 60,
timestamp: Date.now(),
status: 'PENDING',
};
await workerBus.post(result);
console.log('Worker: Result posted.');
}
}, 500);
}
// Execute
runOrchestrator();
runWorker();
Architecture Decisions
- Zod Validation: The bus enforces schema validation on every
post. This prevents malformed messages from corrupting the state file, a common failure mode in loosely typed systems. - Atomic Renames: The
postmethod writes to a temporary file and renames it. This ensures that if an agent crashes mid-write, the bus file remains valid. Readers never encounter partial JSON. - TTL Enforcement: The
fetchmethod automatically removes messages older than their TTL. This prevents the bus file from growing indefinitely and ensures that stale tasks do not block the system. - Status Tracking: Messages track their status (
PENDING,DELIVERED). This allows multiple agents to poll the same file without duplicating work, as the fetch operation marks messages asDELIVEREDatomically.
Pitfall Guide
Implementing file-based orchestration requires careful handling of concurrency and state management. The following pitfalls are derived from production experience with shared-state architectures.
1. Race Conditions on Concurrent Writes
Explanation: If two agents attempt to write to the bus file simultaneously without atomic operations, one write may overwrite the other, or the file may become corrupted with interleaved data.
Fix: Always use the atomic write pattern: write to a temporary file, then rename. The OS guarantees that rename is atomic on POSIX and Windows systems. Never append directly to the file.
2. Polling Storms and CPU Saturation
Explanation: Agents polling the bus file in a tight loop (e.g., every 10ms) can consume excessive CPU resources, especially on systems with many agents.
Fix: Implement exponential backoff for polling. Start with a short interval and increase it if no messages are found. Alternatively, use a file-watching library (e.g., chokidar) to trigger reads only when the file changes.
3. Zombie Messages and State Bloat
Explanation: If an agent crashes before processing a task, or if TTLs are not enforced, the bus file can accumulate thousands of stale messages. This degrades performance and increases memory usage during reads.
Fix: Enforce strict TTLs on all messages. Implement a background cleanup routine that periodically purges expired messages. The fetch method should always filter out expired entries.
4. Schema Drift and Serialization Errors
Explanation: As agents evolve, the structure of the payload may change. If an older agent writes a payload that a newer agent cannot parse, the system may fail silently or crash.
Fix: Use schema validation (e.g., Zod, JSON Schema) on both post and fetch. Include a version field in the message structure to allow agents to handle legacy payloads gracefully.
5. Disk I/O Bottlenecks
Explanation: On systems with slow storage (e.g., network mounts or mechanical HDDs), frequent reads and writes can introduce latency, slowing down agent handoffs. Fix: Ensure the bus file resides on fast local storage (SSD/NVMe). For high-throughput scenarios, consider an in-memory simulation for testing, but validate performance on the target storage medium.
6. Security and Injection Risks
Explanation: If the payload contains executable commands or user input, malicious agents could inject harmful data. Additionally, file permissions may allow unauthorized processes to read sensitive agent communications.
Fix: Sanitize all inputs within the payload. Restrict file permissions on the bus file to the specific user or group running the agents. Avoid executing raw payload data without validation.
7. State Desynchronization
Explanation: If the bus file is deleted, moved, or corrupted by an external process, agents may lose synchronization and fail to communicate. Fix: Implement health checks that verify the existence and validity of the bus file. Create backup copies of the bus file periodically. Use file locking mechanisms if the OS supports them to prevent external interference.
Production Bundle
Action Checklist
- Implement Atomic Writes: Ensure all
postoperations use the temp-file-rename pattern to prevent corruption. - Enforce TTLs: Set appropriate time-to-live values for all messages and implement cleanup logic in
fetch. - Add Schema Validation: Use a validation library to enforce message structure on every read and write.
- Configure Polling Intervals: Set polling intervals based on latency requirements; use backoff or file watchers to reduce CPU load.
- Secure File Permissions: Restrict access to the bus file directory to authorized agents only.
- Monitor Disk Usage: Track the size of the bus file and implement rotation or purging strategies if it grows too large.
- Test Concurrency: Simulate multiple agents writing simultaneously to verify atomicity and race condition handling.
- Define Agent Roles: Establish a clear registry of sender/recipient names to prevent misrouted messages.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Local Development / Prototyping | File-Based Bus | Instant setup, zero dependencies, easy debugging. | $0 |
| Privacy-First Local LLM Apps | File-Based Bus | Data never leaves the host; no network egress. | $0 |
| Single-Machine Multi-Agent Workflow | File-Based Bus | Sufficient throughput; simpler than Redis/Kafka. | $0 |
| Multi-Region / Distributed Agents | Broker-Based (Redis/Kafka) | File bus cannot span machines; requires network transport. | $50–$200+/mo |
| High-Throughput Enterprise Pipeline | Broker-Based | File I/O becomes bottleneck; brokers handle scale better. | $50–$200+/mo |
| Edge Device with Limited Resources | File-Based Bus | Minimal memory footprint; no external service overhead. | $0 |
Configuration Template
Use this JSON configuration to define bus paths, polling rates, and TTL defaults for your agents.
{
"bus": {
"path": "./data/agent_bus.json",
"maxFileSizeMB": 10,
"cleanupIntervalSeconds": 60
},
"agents": {
"orchestrator": {
"role": "sender",
"pollIntervalMs": 1000,
"defaultTTLSeconds": 120
},
"code-worker": {
"role": "receiver",
"pollIntervalMs": 500,
"defaultTTLSeconds": 60
}
},
"security": {
"filePermissions": "0600",
"validateSchema": true
}
}
Quick Start Guide
Initialize the Project: Create a new directory and install dependencies:
mkdir multi-agent-bus && cd multi-agent-bus npm init -y npm install typescript zod uuid @types/node npx tsc --initCreate the Bus Implementation: Save the
SharedStateBusclass code from the Core Solution section intosrc/bus.ts.Define Your Agents: Create
src/orchestrator.tsandsrc/worker.tsusing the usage examples. Configure the bus path to point to the same file.Run the System: Compile and run both agents in separate terminal windows:
npx ts-node src/orchestrator.ts npx ts-node src/worker.tsObserve the console output as the orchestrator dispatches a task and the worker returns the result.
Inspect the State: Open
data/agent_bus.jsonin a text editor to view the raw message history. This provides immediate visibility into agent communication without external tools.
Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register — Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
