自律エージェントを止めずにアップデートする — SIGHUP・plan.json ホットリロード・無停止デプロイの実装
Zero-Downtime Orchestration: Hot-Reloading Long-Running Autonomous Agents
Current Situation Analysis
Autonomous agents that operate continuously—polling external APIs, maintaining MCP connections, or executing scheduled workflows—face a fundamental operational contradiction: they must evolve without interruption, yet traditional deployment models treat every change as a restart event.
The industry default remains systemctl restart or container recreation. This approach assumes cold starts are cheap and stateless. In practice, long-running agents violate both assumptions. A typical cold start involves re-authenticating with third-party services, re-establishing WebSocket or HTTP keep-alive connections, rebuilding in-memory caches, and resetting API rate-limit counters. Measured across production workloads, this initialization phase routinely consumes 30 to 60 seconds. When engineering teams patch logic or adjust thresholds multiple times daily, the cumulative initialization overhead frequently exceeds actual execution time.
This problem is systematically overlooked because development workflows prioritize local iteration over production continuity. Teams validate changes in isolated environments, then push updates that trigger full process termination. The hidden costs accumulate silently: fragmented audit logs, lost in-flight task states, duplicate API calls due to uncommitted writes, and cascading failures when dependent services reject reconnection storms.
The operational reality is clear: if an agent must run 24/7, treating it as a disposable process is a structural liability. The solution requires decoupling volatility from stability. Configuration, execution plans, and runtime logic must be swappable on the fly, while the core scheduler remains anchored. This layered hot-reload architecture eliminates downtime, preserves state continuity, and transforms patching from a disruptive event into a background operation.
WOW Moment: Key Findings
The performance and reliability gap between traditional restarts and layered hot-reloading is not marginal—it is architectural. By isolating mutable components from the execution kernel, teams can achieve continuous operation without sacrificing safety or observability.
| Approach | Downtime | State Integrity | Startup Latency | Implementation Risk |
|---|---|---|---|---|
| Full Process Restart | 30–60s per patch | Lost (cache, rate limits, in-flight tasks) | High (re-auth, reconnection) | Low (standard tooling) |
| In-Process Module Reload | 0s | Fragile (stale references, class mismatches) | None | High (debugging complexity) |
| Layered Hot-Reload | 0s | Preserved (atomic swaps, state decoupling) | None | Medium (requires architectural discipline) |
This finding matters because it shifts the deployment paradigm from "stop, patch, resume" to "observe, swap, continue." The layered approach enables continuous delivery for autonomous systems without sacrificing reliability. It also establishes a clear boundary: core scheduling logic remains stable, while peripheral components adapt dynamically. Teams gain the ability to adjust thresholds, swap execution plans, and update agent implementations without breaking active workflows or triggering external service rate limits.
Core Solution
The architecture divides the agent runtime into four distinct layers, each with a dedicated reload mechanism. The design prioritizes atomicity, fail-safe defaults, and process isolation.
1. Signal-Driven Configuration Swapping
Configuration values (API keys, polling intervals, retry thresholds) change frequently but require immediate propagation. UNIX SIGHUP provides a standardized, low-overhead trigger. The implementation uses a thread-safe vault to hold the active configuration, ensuring readers never observe partial updates.
import signal
import json
import threading
import logging
from typing import Dict, Any
class ConfigVault:
def __init__(self, path: str):
self._path = path
self._data: Dict[str, Any] = {}
self._lock = threading.RLock()
self._load()
def _load(self) -> None:
with open(self._path, "r", encoding="utf-8") as fh:
raw = json.load(fh)
with self._lock:
self._data = raw
def get(self, key: str, default: Any = None) -> Any:
with self._lock:
return self._data.get(key, default)
def reload(self) -> None:
try:
self._load()
logging.info("Configuration vault refreshed successfully.")
except Exception as exc:
logging.error(f"Config reload failed. Retaining previous state. | {exc}")
def install_reload_handler(vault: ConfigVault) -> None:
def _handler(signum: int, frame: Any) -> None:
vault.reload()
signal.signal(signal.SIGHUP, _handler)
Rationale: The RLock prevents concurrent reads during replacement. If the new file is malformed, the exception is caught, and the previous configuration remains active. This fail-safe behavior ensures the scheduler never operates with corrupted parameters.
2. File-System Polling for Execution Plans
Execution plans (plan.json) define agent topology, dependencies, and scheduling intervals. Instead of signals, we monitor the file's modification timestamp. A background thread polls at a safe interval, validates the structure, and performs an atomic swap.
import os
import time
import threading
import logging
from typing import Dict, List
class PlanRegistry:
def __init__(self, path: str, poll_interval: float = 2.0):
self._path = path
self._interval = poll_interval
self._active_plan: Dict = {}
self._lock = threading.Lock()
self._last_mtime = 0.0
self._watcher = threading.Thread(target=self._poll_loop, daemon=True)
self._watcher.start()
def _poll_loop(self) -> None:
while True:
try:
current_mtime = os.stat(self._path).st_mtime
if current_mtime > self._last_mtime:
self._apply_update()
self._last_mtime = current_mtime
except Exception as exc:
logging.warning(f"Plan poll error: {exc}")
time.sleep(self._interval)
def _apply_update(self) -> None:
with open(self._path, "r", encoding="utf-8") as fh:
candidate = json.load(fh)
if self._validate(candidate):
with self._lock:
self._active_plan = candidate
logging.info("Execution plan updated atomically.")
else:
logging.error("Plan validation failed. Discarding update.")
def _validate(self, plan: Dict) -> bool:
# Topological sort check, required fields, type validation
return "agents" in plan and "schedule" in plan
def get_plan(self) -> Dict:
with self._lock:
return self._active_plan.copy()
Rationale: Polling avoids signal dependency and works across platforms. The validation step prevents invalid topologies from entering the scheduler. State separation is critical: runtime metrics (last_run, consecutive_failures) are stored in a separate persistence layer, not inside the plan file. This ensures that plan reloads never erase execution history.
3. Process Isolation for Runtime Code
Attempting to reload Python modules in-process using importlib.reload introduces reference corruption. Existing objects retain pointers to old class definitions, causing isinstance checks to fail and breaking dependency injection. The production-safe alternative is subprocess isolation.
import subprocess
import json
import logging
from typing import Dict, Any
class TaskSandbox:
def __init__(self, timeout: int = 300):
self._timeout = timeout
self._running: set = set()
def dispatch(self, task_id: str, payload: Dict[str, Any]) -> str:
if task_id in self._running:
logging.warning(f"Task {task_id} already executing. Skipping duplicate.")
return ""
self._running.add(task_id)
try:
proc = subprocess.run(
["python", "-m", "agent_runtime.executor", payload["module"]],
input=json.dumps(payload),
capture_output=True,
text=True,
encoding="utf-8",
timeout=self._timeout,
)
if proc.returncode != 0:
raise RuntimeError(f"Executor failed: {proc.stderr.strip()}")
return proc.stdout
finally:
self._running.discard(task_id)
Rationale: Each task spawns a fresh interpreter. Code changes on disk are immediately available for the next cycle without memory pollution. The 200–300ms startup overhead is negligible compared to typical task durations (seconds to minutes). The execution guard (_running set) prevents duplicate dispatches during rapid plan reloads.
4. Core Daemon Stability
The scheduler, signal handlers, and process managers constitute the core. These components are intentionally excluded from hot-reload mechanisms. Changes to the core require a full restart, which is acceptable given their low mutation frequency (typically monthly). This boundary enforces architectural discipline: volatility is pushed to the edges, stability remains at the center.
Pitfall Guide
1. In-Process Module Reloading
Explanation: Using importlib.reload leaves dangling references. Objects instantiated before the reload hold old class definitions, causing type mismatches and silent logic failures.
Fix: Isolate execution in subprocesses. Never reload modules that maintain state or are referenced by long-lived objects.
2. Non-Atomic Configuration Writes
Explanation: Writing directly to config.json while the daemon reads it causes partial reads, resulting in JSONDecodeError or missing keys.
Fix: Write to a temporary file, then use os.replace() for an atomic rename. The filesystem guarantees the target file is either fully old or fully new.
3. Coupling Runtime State with Declarative Plans
Explanation: Storing last_run or retry_count inside plan.json means every reload resets execution history, breaking rate-limit tracking and backoff logic.
Fix: Maintain a separate state store (SQLite, Redis, or memory-mapped file). The plan file should only declare intent; the state store tracks reality.
4. Blocking the Event Loop with File Polling
Explanation: Synchronous time.sleep() in the main thread halts scheduling, causing missed intervals and task drift.
Fix: Run the poller in a daemon thread or use asynchronous file watchers (watchdog, asyncio loops). Keep the scheduler loop unblocked.
5. Signal Handler Race Conditions
Explanation: Signal handlers execute asynchronously. Calling non-async-signal-safe functions (like json.load or logging) inside the handler can corrupt internal interpreter state.
Fix: Set a flag in the handler and perform the actual reload in the main loop. Alternatively, use signal.sigwait() in a dedicated thread.
6. Ignoring Cross-Platform Signal Limitations
Explanation: Windows does not support SIGHUP. Deployments on mixed environments will fail silently or crash.
Fix: Implement a fallback mechanism. On non-UNIX systems, trigger configuration reloads via an HTTP endpoint, a named pipe, or a dedicated control file.
7. Unbounded Subprocess Spawning
Explanation: Rapid plan updates or misconfigured intervals can spawn hundreds of subprocesses, exhausting file descriptors and memory. Fix: Implement a concurrency limiter. Track active PIDs, enforce maximum parallelism, and kill orphaned processes on daemon shutdown.
Production Bundle
Action Checklist
- Separate configuration, execution plans, and runtime code into distinct files
- Implement atomic file writes using temporary files and
os.replace() - Add execution guards to prevent duplicate task dispatches during reloads
- Decouple runtime state from declarative plan definitions
- Validate all incoming plans with topological sort and schema checks before swapping
- Run file pollers in background threads to avoid blocking the scheduler
- Establish a clear boundary: core daemon changes require restart, peripherals hot-reload
- Add observability hooks to log reload events, validation failures, and subprocess exits
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Adjust polling interval or API key | SIGHUP config reload | Immediate propagation, zero downtime, low risk | None |
| Change agent dependency graph or schedule | Plan file hot-reload | Validates topology, preserves state, atomic swap | Minimal (polling overhead) |
| Fix logic bug in agent implementation | Subprocess isolation | Fresh interpreter, no reference corruption, safe rollback | Low (200-300ms per task) |
| Modify scheduler core or signal handling | Full daemon restart | Core changes cannot be safely hot-swapped | High (30-60s downtime) |
Configuration Template
{
"api_endpoint": "https://api.example.com/v2",
"rate_limit_per_minute": 120,
"retry_backoff_base": 2,
"max_concurrent_tasks": 8,
"plan_path": "/etc/agent/plan.json",
"state_store": "/var/lib/agent/state.db",
"reload_poll_interval_sec": 2.0
}
# daemon_entrypoint.py
import signal
import json
import logging
from config_vault import ConfigVault
from plan_registry import PlanRegistry
from task_sandbox import TaskSandbox
def main():
logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
vault = ConfigVault("/etc/agent/config.json")
install_reload_handler(vault)
registry = PlanRegistry(vault.get("plan_path", "plan.json"), vault.get("reload_poll_interval_sec", 2.0))
sandbox = TaskSandbox(timeout=vault.get("max_task_timeout", 300))
logging.info("Agent daemon initialized. Listening for SIGHUP and plan updates.")
# Main scheduling loop consumes registry.get_plan() and dispatches via sandbox.dispatch()
# ...
if __name__ == "__main__":
main()
Quick Start Guide
- Create the directory structure:
/etc/agent/for config and plans,/var/lib/agent/for state,/opt/agent/runtime/for executor modules. - Deploy the base daemon: Install the core scheduler, config vault, plan registry, and task sandbox. Start the process normally.
- Trigger initial load: Place
config.jsonandplan.jsonin the expected paths. Sendkill -HUP <pid>to load configuration. The daemon will begin polling the plan file. - Validate hot-reload: Modify
plan.jsonorconfig.json. Verify logs show atomic swaps and successful validation. Dispatch a test task to confirm subprocess isolation works. - Monitor and iterate: Track reload events, subprocess exit codes, and state persistence. Adjust polling intervals and concurrency limits based on workload characteristics.
Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register — Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
