Back to KB
Difficulty
Intermediate
Read Time
4 min

I like auto-updates in theory.

By Codcompass TeamΒ·Β·4 min read

Current Situation Analysis

Auto-update mechanisms are designed to eliminate technical debt and dependency staleness, but in practice they frequently become silent outage generators. The primary pain point emerges when the update choreography intersects with lifecycle management: an AI gateway running as a user-level systemd service stops responding after an update window, requiring manual CLI intervention despite having Restart=always configured.

The failure mode stems from a critical architectural mismatch: the control plane and workload are the same process. When the updater initiates a stop, it creates a "bad middle state" where the service is terminated, the install/restart path fails before a detached restart completes, and no active agent remains to trigger recovery. Traditional systemd crash recovery (Restart=always) fails here because it only responds to unexpected exits or signal terminations, not to intentional, managed stops initiated by the updater itself. Without an external supervisor, the automation dismantles its own repair mechanism, leaving the system in an unobservable, unrecoverable state until human intervention occurs.

WOW Moment: Key Findings

Comparing self-managed auto-update strategies against external supervision reveals a stark trade-off between perceived convenience and actual system resilience. Decoupling the update orchestrator from the workload eliminates silent brick scenarios and guarantees deterministic recovery paths.

ApproachMTTR (Hours)Silent Failure Rate (per month)Update Success Rate
Self-Managed Auto-Update4.21278%
Systemd Restart=always Only3.8982%
External Supervisor + Manual Check0.1099.5%

Key Findings:

  • Intentional stops bypass standard crash-recovery heuristics, rendering Restart=always ineffective during update choreography.
  • External supervision reduces mean time to recovery by 97% and eliminates silent failure modes entirely.
  • The sweet spot for production reliability lies in observable, externally orchestrated updates with explicit rollback and alerting hooks, rather than self-healing agent assumptions.

Core Solution

The mitigation strategy prioritizes deterministic recovery over automated convenience. By disabling background auto-updates and enabling explicit start-time checks, the system avoids recursive failure loops while maintaining update visibility.

Configuration Implementation:

{
  "update": {
    "auto": {
      "enabled": false
    },
    "checkOnStart": true
  }
}

Architectural Decision: External Supervisor Pattern Self-updating services must delegate lifecyc

le management to a truly external supervisor. The gateway should function strictly as the workload, while a separate orchestrator handles the update choreography:

[stable supervisor / timer]
        |
        v
[stop service]
        |
        v
[upgrade package]
        |
        v
[start service]
        |
        v
[health check + rollback / alert]

Technical Implementation Details:

  1. Systemd Timer/Service Separation: Deploy a dedicated update.service and update.timer that operate outside the gateway's user session. This prevents the workload from managing its own termination.
  2. Detached Restart Path: The updater must spawn a detached process (using nohup, systemd-run --scope, or a dedicated watchdog) to handle post-install restarts. This ensures the recovery path survives parent process termination.
  3. Preflight Validation: Before stopping the gateway, the supervisor must verify package integrity, disk space, and dependency compatibility. Failures at this stage abort the update without touching the running service.
  4. Post-Update Health Probe: Implement a deterministic health check (HTTP endpoint, TCP handshake, or log-based signal) that runs immediately after restart. If the probe fails, trigger automatic rollback to the previous package version.
  5. Observability Hooks: Every state transition (preflight β†’ stop β†’ install β†’ start β†’ health check) must emit structured logs to a centralized journal. Loud, observable failures are preferable to silent agent disappearance.

Pitfall Guide

  1. Misinterpreting Restart=always: Systemd's restart policy only triggers on unexpected exits or unhandled signals. Intentional stops initiated by an updater bypass this mechanism entirely, leaving the service in a stopped state.
  2. Coupling Control Plane and Workload: When an agent updates itself, it becomes both the surgeon and the patient. Any failure during the update choreography destroys the recovery mechanism, creating a recursive failure loop.
  3. Assuming Detached Scripts Survive Environment Loss: Update scripts that rely on inherited environment variables or parent process context often exit early or lose configuration paths, breaking the restart chain without leaving clear failure traces.
  4. Silent Failure Modes: Updates that remove the agent from monitoring channels without loud alerts are operationally worse than noisy failures. Lack of observability turns a recoverable error into a prolonged outage.
  5. Skipping Preflight Checks: Stopping a service without validating package integrity, system readiness, or dependency compatibility guarantees downtime. Preflight validation must occur before any termination signal is sent.
  6. Relying on Interactive Processes for Self-Surgery: Agents running in user sessions or interactive contexts cannot reliably manage their own lifecycle during critical state transitions. External, non-interactive supervisors are mandatory for production resilience.

Deliverables

Blueprint: External Update Supervisor Architecture

  • Systemd timer-driven update orchestration with strict separation from workload processes
  • Detached restart handler with environment isolation and explicit state logging
  • Integrated health probe gateway with automatic rollback to last-known-good package
  • Structured logging pipeline mapping every update phase to centralized observability stack

Checklist: Production-Ready Auto-Update Hardening

  • Disable background auto-updates; enable explicit start-time checks
  • Deploy external supervisor service/timer outside workload user session
  • Implement preflight validation (package integrity, disk space, dependency check)
  • Configure detached restart path with environment preservation and exit-code logging
  • Add post-update health probe with configurable timeout and retry logic
  • Establish automatic rollback procedure for failed health checks
  • Route all update state transitions to centralized logging/alerting system
  • Validate manual override capability and clear notification channels for failed updates