We Ran 4 Claude Code Dialogs for 28 Hours. Here's What the Memory Layer Caught (and Missed).

Orchestrator-Free Agent Meshes: A Filesystem-Contract Protocol for Multi-Session Reliability

Current Situation Analysis

Building reliable multi-agent systems typically forces engineers into a binary choice: deploy a heavy orchestration runtime (like LangGraph or AutoGen) with complex state management, or accept fragile, ad-hoc communication patterns that break under load. The industry consensus assumes that agents require real-time event buses, shared APIs, or centralized message queues to coordinate effectively. This assumption introduces significant operational overhead, creates single points of failure, and obscures the audit trail of agent interactions.

However, for asynchronous LLM-based workflows—particularly those involving human-in-the-loop oversight or long-running tasks—this complexity is often unnecessary. The critical failure mode in multi-agent reliability is not communication latency; it is the drift-intervention gap. Agents can detect anomalies (drift) with high frequency, but without a structured mechanism to trigger and track interventions, detection metrics become vanity numbers.

Field data from concurrent multi-session deployments reveals a stark reality: open-loop drift detection yields negligible intervention rates. In a 28-hour stress test of four concurrent agent dialogs sharing a filesystem, raw drift detection fired 314 times over a 7-day window, yet the intervention rate (actions taken per detection) sat at a dismal 9.87%. The system was noisy but ineffective. Only when a closed-loop contract mechanism was introduced did the intervention rate jump to 40.79% within a 24-hour window, while simultaneously reducing noise by filtering alerts through structured contracts. This demonstrates that reliability in agent meshes is not about faster communication; it is about converting detection into accountable, tracked actions.

WOW Moment: Key Findings

The most significant insight from multi-session reliability engineering is the divergence between detection volume and actionable outcomes. The following data compares an open-loop monitoring regime against a closed-loop contract protocol.

Regime	Detection Volume	Intervention Rate	Noise Reduction	Latency to Action
Open-Loop (Alerts Only)	314 events	9.87%	None	Undefined / High
Closed-Loop (Contracts)	76 events	40.79%	75.8%	~17.92 hours

Why this matters: The closed-loop contract protocol reduced detection volume by nearly 76% while quadrupling the intervention rate. This proves that structured contracts act as a high-fidelity filter. Instead of agents reacting to every drift signal, they only engage when a formal obligation exists. The 17.92-hour latency to action for a contract closure is acceptable in async workflows and provides a deterministic audit trail that event buses cannot match without additional logging infrastructure. This approach enables O(N+K) coordination complexity (where N is agents and K is contracts) rather than the O(N²) complexity of pairwise agent communication.

Core Solution

The solution replaces centralized orchestrators with a Filesystem-Contract Protocol. Agents communicate via atomic file writes in a shared directory structure. Coordination is driven by YAML-based contract frontmatter that agents scan and inject into their context windows via recall hooks.

Architecture Decisions

Filesystem as Source of Truth: No shared memory or API. Each agent owns its session state but reads shared contract files. This eliminates race conditions inherent in shared databases and provides native durability.
Contract Frontmatter: Contracts are defined in YAML blocks. They specify issuer, target, deadline, deliverables, and status. This structure allows agents to parse obligations without natural language ambiguity.
Recall Hooks: Agents do not poll continuously. A recall hook scans the contract directory for files matching the agent's ID and injects active contracts into the prompt context. This ensures agents are aware of obligations at decision points without constant overhead.
Drift-Loop Triad: Reliability is measured by three independent counters:
- Detection: Drift events fired.
- Intervention: User CLI actions taken.
- Acknowledgment: Agent self-ack of drift.
- Metric: intervention_ratio = (interventions + acks) / detections. Target ≥70%.

Implementation Examples

1. Contract Schema Definition

Contracts replace ad-hoc messages with structured obligations. The schema enforces accountability.

# contracts/ctr_mem_exec_99x.yaml
metadata:
  contract_id: ctr_mem_exec_99x
  issuer: MemoryHub
  target: TaskRunner
  deadline: 2026-06-10T12:00:00Z
  priority: high
deliverables:
  - type: artifact
    path: ./output/processed_data.json
    checksum_algo: sha256
  - type: acknowledgment
    format: yaml
    fields: [status, gotchas, cycle_id]
constraints:
  - idempotency_required: true
  - max_retries: 3
status: outstanding

2. Recall Hook Implementation

The recall hook surfaces contracts to the agent. It filters by target agent and deadline, ensuring only relevant obligations are injected.

from pathlib import Path
from datetime import datetime
import yaml

def inject_contracts(session_context: dict, memory_dir: Path) -> str:
    """Scan contracts and inject active ones into agent context."""
    contract_dir = memory_dir / "contracts"
    if not contract_dir.exists():
        return ""
    
    active_contracts = []
    now = datetime.utcnow()
    
    for file in contract_dir.glob("*.yaml"):
        try:
            with open(file, 'r') as f:
                data = yaml.safe_load(f)
            
            meta = data.get("metadata", {})
            if (meta.get("target") == session_context["agent_id"] and
                meta.get("status") == "outstanding" and
                datetime.fromisoformat(meta["deadline"]) > now):
                active_contracts.append(data)
        except Exception as e:
            log.error(f"Failed to parse contract {file}: {e}")
            
    if not active_contracts:
        return ""
    
    # Format for prompt injection
    prompt_block = "## Active Contracts\n"
    for c in active_contracts:
        m = c["metadata"]
        prompt_block += f"- **{m['contract_id']}**: Deadline {m['deadline']}. "
        prompt_block += f"Deliverables: {', '.join(d['type'] for d in c['deliverables'])}\n"
        
    return prompt_block

3. The Verify-Gap Pattern

A critical reliability pattern is the Verify-Gap. Agents often claim success based on internal state that may not reflect the actual environment. Downstream agents must spot-check claims before relying on them.

import subprocess
import sys

def verify_handoff_claim(claim: str, test_suite: str) -> bool:
    """
    Pattern: Never trust a handoff claim blindly.
    Run the actual verification command to detect environment gaps.
    """
    print(f"Verifying claim: '{claim}' against suite: {test_suite}")
    
    # Example: Run pytest on specific files
    cmd = [sys.executable, "-m", "pytest", test_suite, "-q"]
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode != 0:
        # Detect common environment gaps (e.g., missing __init__.py)
        if "ModuleNotFoundError" in result.stderr:
            print("WARNING: Environment gap detected. Check package structure.")
        print(f"Verification FAILED. Output:\n{result.stdout}\n{result.stderr}")
        return False
        
    print("Verification PASSED.")
    return True

# Usage in agent workflow
# claim = "22 tests GREEN"
# if not verify_handoff_claim(claim, "tests/proof/test_tier_promotion.py"):
#     raise RuntimeError("Handoff claim unverified. Aborting downstream dependency.")

4. Drift Triad Instrumentation

Measure the closed-loop effectiveness.

class DriftTriad:
    def __init__(self):
        self.detections = 0
        self.cli_interventions = 0
        self.agent_acks = 0
        
    def record_detection(self):
        self.detections += 1
        
    def record_intervention(self):
        self.cli_interventions += 1
        
    def record_ack(self):
        self.agent_acks += 1
        
    @property
    def intervention_ratio(self) -> float:
        if self.detections == 0:
            return 0.0
        return (self.cli_interventions + self.agent_acks) / self.detections

# Production target: intervention_ratio >= 0.70

Pitfall Guide

1. Race Conditions on Contract Files

Explanation: Multiple agents writing to the same contract file can corrupt YAML or lose status updates.
Fix: Use atomic writes. Write to a temporary file and rename. Implement file locking or unique status files per agent (e.g., ctr_001.status.runner).

2. Blind Trust in Handoff Metrics

Explanation: Agents may report "22/22 tests passing" based on a cached run or an environment that differs from the receiver's. This leads to silent failures downstream.
Fix: Implement the Verify-Gap pattern. Always run a spot-check command on claimed metrics before proceeding. One missing __init__.py can invalidate an entire handoff.

3. Open-Loop Drift Monitoring

Explanation: Logging drift events without a mechanism to trigger action results in alert fatigue. High detection volume with low intervention ratio indicates a broken feedback loop.
Fix: Instrument the Drift Triad. Ensure every detection has a potential path to intervention. Target an intervention ratio ≥70%. If the ratio is low, the system is noisy, not reliable.

4. Contract Drift and Expiration

Explanation: Contracts accumulate over time. Agents may waste context window on expired or completed contracts.
Fix: Enforce deadline checks in the recall hook. Implement a garbage collection routine that archives completed contracts. Use status fields (outstanding, fulfilled, expired) rigorously.

5. Environment Gaps in Package Structure

Explanation: Code that runs in one agent's session may fail in another due to missing __init__.py files, path configurations, or dependency versions.
Fix: Validate package structure as part of the verification step. Ensure sys.path is consistent or use absolute imports. Treat environment validation as a first-class deliverable.

6. N² Communication Complexity

Explanation: Agents trying to communicate directly with each other leads to exponential complexity and missed messages.
Fix: Use the contract protocol. All communication flows through contracts. This reduces complexity to O(N+K). Agents only need to know how to read/write contracts, not how to talk to specific peers.

7. Missing Intervention Hooks

Explanation: Detection is useless if the agent cannot act on it. If drift is detected but the agent has no tool or permission to resolve it, the drift is ignored.
Fix: Equip agents with intervention tools. Ensure the recall hook surfaces not just the drift, but the available actions. Close the loop by linking detection to capability.

Production Bundle

Action Checklist

Define Contract Schema: Establish a YAML schema for contracts including ID, issuer, target, deadline, deliverables, and status.
Implement Atomic Writes: Ensure all file operations use atomic rename patterns to prevent corruption.
Deploy Recall Hooks: Configure each agent to scan the contract directory and inject active contracts into the prompt context.
Instrument Drift Triad: Add counters for detection, CLI intervention, and agent acknowledgment. Calculate intervention ratio.
Add Verify-Gap Step: Implement a spot-check routine that validates handoff claims before downstream execution.
Configure Deadline Alerts: Set up monitoring for contracts approaching deadlines to prevent SLA breaches.
Archive Completed Contracts: Implement a routine to move fulfilled contracts to an archive directory to keep the active set clean.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Async LLM Workflows	Filesystem-Contract Protocol	Simplicity, auditability, no network dependency. Low latency requirements.	Low
Real-Time Trading	Event Bus / Message Queue	Sub-millisecond latency required. Filesystem too slow.	High
Multi-Org Collaboration	API Gateway + Contracts	Security boundaries require API. Contracts provide structure.	Medium
High-Volume Micro-Tasks	Batch Processing	Contract overhead per task is too high. Aggregate tasks.	Low
Human-in-the-Loop	Filesystem-Contract Protocol	Humans need audit trails and async review. Contracts provide clear obligations.	Low

Configuration Template

Directory structure for a filesystem-mediated agent mesh.

.agent_mesh/
├── contracts/
│   ├── ctr_001.yaml          # Active contract
│   ├── ctr_002.yaml
│   └── archive/              # Completed contracts
│       └── ctr_000.yaml
├── memory/
│   ├── hub/                  # MemoryHub session data
│   │   ├── session.md
│   │   └── drift_log.json
│   ├── runner/               # TaskRunner session data
│   │   ├── session.md
│   │   └── output/
│   └── strategy/             # StrategyCore session data
│       └── anchors.md
├── hooks/
│   ├── recall.py             # Contract injection logic
│   └── verify.py             # Verify-Gap implementation
└── config/
    ├── agents.yaml           # Agent definitions
    └── drift_config.yaml     # Drift thresholds

Quick Start Guide

Initialize Directory Structure: Create the .agent_mesh directory with contracts, memory, hooks, and config subdirectories.
Write First Contract: Create contracts/ctr_init_001.yaml defining an initial task for the TaskRunner.
Configure Recall Hook: Add the recall hook script to each agent's startup sequence. Ensure it reads contracts/ and injects active contracts.
Launch Agents: Start the agent sessions. Verify that contracts appear in the prompt context.
Monitor Metrics: Run the drift triad instrumentation. Check the intervention_ratio. Aim for ≥70%. If low, review detection noise and intervention capabilities.

This protocol provides a robust, auditable, and scalable foundation for multi-agent reliability without the overhead of centralized orchestrators. By treating the filesystem as a structured communication layer and enforcing contracts with verification, teams can build agent meshes that are both simple and dependable.

Mid-Year Sale — Unlock Full Article