repo_root: Path
session_id: str
worktree_path: Path
def __post_init__(self):
self.worktree_path = self.repo_root / ".worktrees" / self.session_id
self.worktree_path.mkdir(parents=True, exist_ok=True)
self._provision()
def _provision(self):
subprocess.run(
["git", "worktree", "add", str(self.worktree_path), "HEAD"],
cwd=self.repo_root, check=True
)
def teardown(self):
subprocess.run(
["git", "worktree", "remove", "--force", str(self.worktree_path)],
cwd=self.repo_root, check=True
)
**Why this choice:** Worktrees avoid the overhead of full repository clones while guaranteeing filesystem isolation. The `HEAD` reference ensures the session starts from the exact commit state, and `--force` teardown prevents orphaned directories from accumulating.
### Step 2: Context Budgeting and Sub-Agent Delegation
Instead of letting the primary agent consume tokens indefinitely, implement a context budget that triggers delegation when thresholds are approached. Sub-agents execute bounded tasks and return only finalized artifacts.
```python
import asyncio
from typing import Any
from pydantic import BaseModel, Field
class ContextBudget(BaseModel):
max_tokens: int = Field(default=8192, description="Hard limit before delegation")
current_usage: int = 0
delegation_threshold: float = Field(default=0.75, description="Trigger delegation at 75%")
class TaskDelegation:
def __init__(self, budget: ContextBudget):
self.budget = budget
def should_delegate(self, estimated_tokens: int) -> bool:
projected = self.budget.current_usage + estimated_tokens
return projected > (self.budget.max_tokens * self.budget.delegation_threshold)
async def spawn_subtask(self, prompt: str, executor: Any) -> str:
# Sub-agent runs in isolated context, returns only final output
result = await executor.run(prompt)
self.budget.current_usage += len(result.split()) * 1.3 # Rough token estimate
return result
Why this choice: Explicit budgeting prevents attention dilution. The 75% threshold leaves headroom for tool responses and error recovery. Sub-agents terminate after completion, returning only the delta needed by the primary context.
Tools must be strictly typed to prevent malformed API calls. Pydantic models enforce parameter validation before execution, and async execution prevents I/O blocking.
from abc import ABC, abstractmethod
from typing import Dict, Any
class BaseTool(ABC):
name: str = ""
description: str = ""
schema_model: Any = None
@abstractmethod
async def execute(self, params: Any) -> Dict[str, Any]:
pass
class FileSearchTool(BaseTool):
name = "search_codebase"
description = "Locate files matching a pattern or content query"
class Params(BaseModel):
pattern: str = Field(description="Glob pattern or regex")
max_results: int = Field(default=10, ge=1, le=50)
schema_model = Params
async def execute(self, params: Params) -> Dict[str, Any]:
import glob
matches = glob.glob(str(params.pattern), recursive=True)
return {"files": matches[:params.max_results], "error": None}
class ToolRouter:
def __init__(self):
self.registry: Dict[str, BaseTool] = {}
def register(self, tool_cls: BaseTool):
instance = tool_cls()
self.registry[instance.name] = instance
async def dispatch(self, tool_name: str, params: Dict[str, Any]) -> Dict[str, Any]:
if tool_name not in self.registry:
return {"error": f"Unknown tool: {tool_name}"}
tool = self.registry[tool_name]
validated = tool.schema_model(**params)
return await tool.execute(validated)
Why this choice: Schema validation catches malformed requests before they hit the filesystem or network. The router pattern decouples tool discovery from execution, enabling dynamic registration and testing. Async execution ensures the agent loop remains responsive during I/O-heavy operations.
Step 4: Session Orchestration
Combine isolation, budgeting, and tool routing into a deterministic execution loop.
class LocalAgentSession:
def __init__(self, repo_path: Path, model_name: str):
self.repo_path = repo_path
self.model_name = model_name
self.worktree = WorktreeSession(repo_path, session_id="dev_01")
self.budget = ContextBudget()
self.delegator = TaskDelegation(self.budget)
self.router = ToolRouter()
self.router.register(FileSearchTool)
async def run(self, instruction: str):
# Boot MLX server, load model, initialize inference context
# Execute tool calls within worktree boundary
# Delegate to sub-agents when budget threshold is crossed
# Commit final state to worktree, merge if validated
pass
Architecture Rationale: This design treats the agent as a state machine rather than a chatbot. Each component has a single responsibility: isolation guarantees safety, budgeting preserves performance, routing ensures reliability, and orchestration ties them together. The MLX server runs locally, eliminating network latency, while git worktrees provide atomic rollback capability.
Pitfall Guide
1. Worktree Accumulation
Explanation: Forgetting to prune worktrees after sessions causes disk bloat and git index conflicts. Each worktree maintains its own .git directory and index state.
Fix: Implement automatic lifecycle hooks. Run git worktree prune on session exit, and enforce a maximum concurrent worktree limit (typically 3β5). Add a cron job or pre-commit hook to clean orphaned directories older than 24 hours.
2. Context Window Over-Delegation
Explanation: Spawning sub-agents for trivial tasks introduces overhead that outweighs the benefit. Each delegation cycle requires context serialization, model re-initialization, and result parsing.
Fix: Set a minimum token threshold for delegation (e.g., >500 tokens of expected output). Use a cost-benefit calculator that compares delegation overhead against context retention gains. Keep simple edits in the primary context.
Explanation: Using synchronous database drivers, file locks, or network calls inside async tool methods blocks the event loop, causing the agent to hang during concurrent execution.
Fix: Wrap all blocking operations in asyncio.to_thread() or use native async libraries. Validate tool execution time with timeouts (asyncio.wait_for). Log execution duration to identify bottlenecks.
4. MLX Memory Fragmentation
Explanation: Repeated inference calls without clearing the KV cache cause GPU memory fragmentation on Apple Silicon. This leads to gradual slowdowns and eventual OOM errors during long sessions.
Fix: Implement explicit cache eviction between major task boundaries. Use mlx.core.clear_cache() or restart the inference server after context delegation cycles. Monitor memory usage with metal profiling tools and set hard limits.
Explanation: Modifying tool parameters without versioning breaks existing agent prompts and causes validation failures. The model may generate outdated parameter names or missing required fields.
Fix: Version tool schemas explicitly (v1, v2). Include schema version in the tool description. Implement backward-compatible deserialization that maps deprecated fields to new structures. Add integration tests that validate prompt-to-schema alignment.
6. Silent Rollback Failures
Explanation: Assuming git worktree teardown always succeeds. If a worktree contains uncommitted changes or locked files, --force may fail silently, leaving corrupted state.
Fix: Verify worktree state before teardown. Run git status --porcelain to detect uncommitted changes. Implement a two-phase cleanup: attempt graceful commit, fallback to stash, then force remove. Log all teardown outcomes for audit trails.
7. Mixed Execution Environments
Explanation: Running local tools against remote state (e.g., querying a production database from a local agent) creates inconsistent results and security risks. The agent assumes local determinism but receives external variance.
Fix: Enforce environment boundaries. Use mock data or local replicas for testing. Tag tools with environment: local or environment: remote and validate execution context before dispatch. Never allow local agents to mutate production state without explicit approval gates.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single-file refactor | Primary context + direct tool execution | Low token overhead, fast iteration | Minimal compute |
| Multi-module feature | Sub-agent delegation per module | Prevents context bloat, parallel execution | Moderate compute, higher memory |
| Research & synthesis | Async concurrent agents + reducer | Isolates data gathering, aggregates cleanly | High compute, linear scaling |
| CI/CD integration | Headless worktree + deterministic commit | Reproducible builds, safe rollback | Infrastructure overhead |
Configuration Template
agent_session:
model: "mlx-community/Qwen3.6-27B-OptiQ-4bit"
context_budget:
max_tokens: 8192
delegation_threshold: 0.75
min_delegation_tokens: 500
worktree:
base_path: ".worktrees"
max_concurrent: 3
prune_after_hours: 24
tools:
- name: "search_codebase"
enabled: true
timeout_seconds: 10
- name: "query_database"
enabled: false
environment: "local_only"
inference:
cache_eviction_interval: 5
metal_memory_limit_gb: 12
async_pool_size: 4
Quick Start Guide
- Initialize workspace isolation: Create a dedicated worktree directory and run
git worktree add .worktrees/session_01 HEAD to establish an atomic execution boundary.
- Bootstrap the inference server: Launch the MLX server with your target model, verify GPU memory allocation, and confirm token streaming latency under 50ms.
- Register tools and schemas: Define Pydantic models for each tool, instantiate the router, and run a dry-run validation to ensure parameter alignment.
- Execute with budgeting: Start the agent loop, monitor context usage against the 75% threshold, and trigger sub-agent delegation when projected usage exceeds limits.
- Validate and commit: After task completion, verify worktree state, run automated tests, and merge the isolated commit into the main branch only after passing validation gates.