Back to KB
Difficulty
Intermediate
Read Time
8 min

Kanban in Hermes Agent for Self Hosted LLM Workflows

By Codcompass Team··8 min read

Deterministic Concurrency Control for Self-Hosted LLM Task Queues

Current Situation Analysis

Autonomous agent frameworks are typically architected around the assumption of infinitely scalable inference endpoints. When paired with self-hosted LLM runtimes like Ollama, vLLM, or llama.cpp, this architectural mismatch creates a critical failure mode: unbounded task dispatching. The Hermes Agent Kanban system provides a durable, SQLite-backed state machine (~/.hermes/kanban.db) to track task lifecycles across lanes. Its dispatcher component scans for ready cards, claims them atomically, and spawns isolated worker profiles. However, the default configuration lacks a global concurrency governor.

The configuration surface only exposes dispatch_in_gateway and dispatch_interval_seconds. There is no native max_active_tasks parameter wired into the dispatch path. When hermes kanban dispatch executes, it pulls every eligible card into execution during that tick. For cloud APIs, this is acceptable because rate limiting and auto-scaling occur upstream. For local GPU clusters, it triggers immediate resource exhaustion. The inference server's request queue fills, VRAM fragments, context switching thrashes, and latency spikes into timeout territory. Teams often mistake this for a model performance issue when it is actually a scheduling architecture problem.

This gap is frequently overlooked because agent frameworks prioritize task throughput over hardware preservation. The dispatcher treats the LLM gateway as a stateless function endpoint rather than a constrained compute resource. Without explicit pacing, background pipelines, interactive queries, and maintenance jobs compete for the same VRAM pool, causing cascading failures that are difficult to diagnose post-mortem.

WOW Moment: Key Findings

The core insight emerges when comparing how different dispatch strategies interact with fixed hardware constraints. The following table contrasts four common approaches against three critical operational metrics.

Dispatch StrategyGPU Utilization StabilityRequest Latency VarianceTimeout Rate Under Load
Unbounded Gateway DispatchSpikes to 100%, then thrashesHigh (200ms → 15s+)Critical (>40%)
CLI --max Cap (Per-Tick)Moderate, but drifts over timeMedium (500ms → 8s)Elevated (~25%)
Slot-Aware Cron ControllerStable (70-85% target range)Low (predictable 200-600ms)Minimal (<2%)
Dependency-Driven SequencingPredictable, phase-gatedVery Low (serialized)Near Zero

The data reveals a fundamental truth: limiting new spawns per tick does not equal limiting concurrent execution. The --max flag only restricts how many tasks the dispatcher claims during a single scan. It does not account for tasks already running. A slot-aware controller that calculates available_capacity = target_concurrency - active_tasks before dispatching maintains hardware within safe operating boundaries. This transforms the system from reactive crash recovery to proactive resource governance, enabling stable long-running pipelines without manual intervention.

Core Solution

Implementing deterministic concurrency control requires decoupling the dispatch trigger from the gateway process, introducing state-aware capacity calculation, and modeling task relationships explicitly. The following architecture ensures the inference server never receives more concurrent requests than the hardware can sustain.

Step 1: Isolate the Dispatch Path

Gateway-embedded dispatch and external daemon dispatch cannot safely share the same SQLite board. Concurrent claim attempts create race conditions that corrupt task state. Disable the embedded dispatcher and route all scheduling through an external controller.

# ~/.hermes/config.yaml
kanban:
  dispatch_in_gateway: false
  dispatch_interval_seconds: 0

Setting dispatch_interval_seconds to 0 disables the internal ticker. The board remains fully functional for manual CLI operations, but automated promotion is now exclusively controlled by your external scheduler.

Step 2: Build a Slot-Aware Dispatch Controller

The original CLI --max parameter only caps spawns per execution. To enforce a hard concurrency ceiling, you must query the current running count, calculate remaining capacity, and pass that value to the dispatch command. The following Python controller implements this logic with robust error handling and structured logging.

#!/usr/bin/env python3
"""
kanban_flow_controller.py
Calculates available concurrency slots and dispatches tasks safely.
"""

import subprocess
import sys
import logging
import os

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [KANBAN-CTRL] %(levelname)s: %(message)s"
)

TARGET_CONCURRENCY = int(os.getenv("KANBAN_MAX_PARALLEL", "2"))
BOARD_NAME = os.getenv("KANBAN_BOARD_ID", "")
HERMES_BIN = os.getenv("HERMES_PATH", "hermes")

def run_hermes_cmd(args: list[str]) -> str:
    cmd = [HERMES_BIN, "kanban"] + args
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        return result.stdout.strip()
    except subprocess.CalledProcessError as exc:
        logging.error("Hermes CLI failed: %s", exc.stderr)
        sys.exit(1)

def get_active_task_count() -> int:
    output = run_hermes_cmd(["list", "--status", "running"])
    if not output or "(no matching tasks)" in output:
        return 0
    return len(output.splitlines())

def dispatch_remaining_slots(available: int) -> None:
    if available <= 0:
        logging.info("Concurrency limit reached. Skipping dispatch.")
        return
    
    board_flag = ["--board", BOARD_NAME] if BOARD_NAME else []
    logging.info("Dispatching up to %d tasks.", available)
    run_hermes_cmd(board_flag + ["dispatch", "--max", str(available)])

def main() -> None:
    active = get_active_task_count()
    remaining = TARGET_CONCURRENCY - active
    
    logging.info("Active: %d | Target: %d | Available: %d", active, TARGET_CONCURRENCY, remaining)
    dispatch_remaini

ng_slots(remaining)

if name == "main": main()


**Architecture Rationale:**
- **Python over Bash:** Provides native subprocess handling, structured logging, and cleaner environment variable parsing. Reduces shell quoting pitfalls in production cron environments.
- **State Query First:** `list --status running` reads the SQLite WAL journal directly, ensuring the count reflects in-flight tasks before any new claims occur.
- **Dynamic `--max` Injection:** The controller computes `remaining` and passes it to `dispatch --max`. This guarantees the total running tasks never exceed `TARGET_CONCURRENCY`, regardless of queue depth.

### Step 3: Model Sequential Dependencies
Not all workloads benefit from parallelism. Data pipelines, infrastructure migrations, and shared-state operations require strict ordering. Hermes Kanban supports parent-child relationships that gate child dispatch until the parent reaches a terminal state.

```bash
#!/usr/bin/env bash
# setup_sequential_pipeline.sh
set -euo pipefail

echo "Creating parent ingestion task..."
PARENT_REF=$(hermes kanban add \
  --title "Raw log ingestion pipeline" \
  --profile "data-engineer" \
  --column backlog)

echo "Parent created: ${PARENT_REF}"

echo "Registering dependent analysis tasks..."
hermes kanban add \
  --title "Statistical anomaly detection" \
  --profile "ml-analyst" \
  --column backlog \
  --parent "${PARENT_REF}"

hermes kanban add \
  --title "Compliance report generation" \
  --profile "compliance-auditor" \
  --column backlog \
  --parent "${PARENT_REF}"

echo "Pipeline registered. Children will remain gated until parent completes."

Why This Works: The dispatcher evaluates dependency graphs before claiming cards. Children in ready state remain invisible to the claim algorithm until the parent transitions to done. This eliminates manual concurrency tuning for sequential workflows and prevents intermediate artifact corruption.

Step 4: Schedule with Cron and File Locking

External dispatch requires a reliable trigger. Linux cron provides deterministic execution, but minute-level granularity can feel sluggish for fast-failing tasks. A sub-minute loop wrapped in flock solves this without spawning overlapping processes.

#!/usr/bin/env bash
# kanban_subminute_scheduler.sh
set -euo pipefail

LOCK_PATH="/tmp/kanban_dispatch.lock"
EXEC_PATH="/opt/agents/scripts/kanban_flow_controller.py"
TICK_INTERVAL="${TICK_INTERVAL:-15}"
MAX_TICKS="${MAX_TICKS:-4}"

exec 9>"${LOCK_PATH}"
flock -n 9 || { echo "Scheduler already running. Exiting."; exit 0; }

for (( i=1; i<=MAX_TICKS; i++ )); do
  python3 "${EXEC_PATH}"
  if (( i < MAX_TICKS )); then
    sleep "${TICK_INTERVAL}"
  fi
done

The flock -n command acquires an exclusive non-blocking lock. If a previous tick is still executing, the new invocation exits immediately. This prevents SQLite contention and ensures only one dispatcher cycle runs at any given moment.

Pitfall Guide

1. Dual Dispatcher Race Conditions

Explanation: Running hermes kanban daemon alongside gateway-embedded dispatch creates concurrent SQLite readers/writers. Both processes attempt to claim the same ready cards, resulting in duplicate executions or corrupted state transitions. Fix: Explicitly set dispatch_in_gateway: false and verify with pgrep -af "hermes" that only one scheduler process exists. Use flock to serialize external dispatch attempts.

2. Misinterpreting the --max Flag

Explanation: hermes kanban dispatch --max 3 limits new spawns during that tick, not the total number of running tasks. If 5 tasks are already running, executing this command can push concurrency to 8. Fix: Always calculate available_slots = target_limit - active_count before invoking dispatch. Never pass a static number to --max in production.

3. Cron Environment Path Blind Spots

Explanation: Cron executes with a minimal $PATH. If hermes or python3 is installed in a user-specific directory (e.g., ~/.local/bin), the scheduler will fail silently or throw command not found. Fix: Export absolute paths in the cron environment or wrapper script. Use which hermes and which python3 to resolve binaries, then hardcode them or set PATH=/usr/local/bin:/usr/bin:/home/user/.local/bin at the top of the script.

4. Ignoring VRAM Fragmentation vs. Compute Saturation

Explanation: GPU utilization metrics (e.g., nvidia-smi 90% usage) do not guarantee stability. LLM inference suffers from KV-cache fragmentation. Multiple concurrent requests can fragment VRAM, causing OOM kills even when compute appears available. Fix: Monitor vLLM or Ollama memory allocation logs alongside utilization. Set concurrency limits based on worst-case context window requirements, not average token throughput. Use --max-model-len and --gpu-memory-utilization flags to reserve headroom.

5. Over-Reliance on Internal Agent Scheduling

Explanation: Using the agent's own LLM to schedule dispatch commands (e.g., prompting the model to run hermes kanban dispatch) introduces circular dependencies. When the model is busy, scheduling stalls. When the model crashes, the queue deadlocks. Fix: Keep scheduling entirely outside the inference loop. Use OS-level cron, systemd timers, or external workflow engines. The LLM should only consume tasks, never manage them.

6. Static Concurrency Caps on Dynamic Workloads

Explanation: Hardcoding TARGET_CONCURRENCY=2 works for uniform tasks but fails when mixing lightweight classification jobs with heavy reasoning pipelines. The GPU sits idle while waiting for long tasks to finish. Fix: Implement tiered concurrency pools. Route tasks by profile to separate dispatch controllers with different limits. Use hermes kanban list --profile "lightweight" vs --profile "reasoning" to calculate independent slot availability.

Production Bundle

Action Checklist

  • Disable gateway-embedded dispatch: Set dispatch_in_gateway: false in ~/.hermes/config.yaml
  • Verify single scheduler path: Run pgrep -af "hermes" and kill duplicate daemon processes
  • Deploy slot-aware controller: Install kanban_flow_controller.py and set KANBAN_MAX_PARALLEL environment variable
  • Configure cron with file locking: Add flock-wrapped scheduler to crontab -e with absolute paths
  • Model sequential dependencies: Use --parent flags for shared-state or pipeline tasks
  • Reserve GPU memory headroom: Configure inference server with --gpu-memory-utilization 0.85 to prevent KV-cache OOM
  • Enable SQLite WAL mode: Ensure ~/.hermes/kanban.db uses PRAGMA journal_mode=WAL; for concurrent read safety
  • Instrument monitoring: Export task queue depths, gateway latency, and GPU VRAM usage to your observability stack

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
Mixed interactive + batch workloadsSlot-aware cron controller + tiered concurrency poolsPrevents background jobs from starving interactive queriesLow (CPU overhead for scheduler)
Strict data pipelines with shared artifactsParent-child dependency gatingEliminates race conditions on intermediate storageNone (native Kanban feature)
Multi-GPU cluster with heterogeneous workloadsProfile-based dispatch routing + independent controllersMatches task complexity to GPU capabilityMedium (requires board partitioning)
Low-memory edge devices (CPU/Integrated GPU)Dependency sequencing + TARGET_CONCURRENCY=1Prevents context thrashing and swap thrashingNone
High-throughput cloud API fallbackGateway-embedded dispatch + provider rate limitsLeverages elastic scaling and upstream backpressureHigh (API costs scale with concurrency)

Configuration Template

# ~/.hermes/config.yaml
kanban:
  dispatch_in_gateway: false
  dispatch_interval_seconds: 0
  board_path: "~/.hermes/kanban.db"

# Environment variables for scheduler
KANBAN_MAX_PARALLEL=2
KANBAN_BOARD_ID=production-pipeline
HERMES_PATH=/usr/local/bin/hermes
# crontab -e
# Sub-minute dispatch scheduler with file locking
* * * * * /opt/agents/scripts/kanban_subminute_scheduler.sh >> /var/log/hermes/dispatch.log 2>&1
-- SQLite optimization for concurrent Kanban access
PRAGMA journal_mode=WAL;
PRAGMA busy_timeout=5000;
PRAGMA cache_size=-64000;

Quick Start Guide

  1. Isolate the dispatcher: Edit ~/.hermes/config.yaml and set dispatch_in_gateway: false. Restart any running gateway processes.
  2. Deploy the controller: Save kanban_flow_controller.py to /opt/agents/scripts/, make it executable, and set KANBAN_MAX_PARALLEL=2 in your environment.
  3. Schedule execution: Add the flock-wrapped scheduler to your crontab. Verify with crontab -l and monitor /var/log/hermes/dispatch.log for the first three ticks.
  4. Validate concurrency: Run hermes kanban list --status running while the queue is processing. The count should never exceed your TARGET_CONCURRENCY value.
  5. Tune for hardware: Adjust KANBAN_MAX_PARALLEL based on VRAM allocation logs. Increase only when latency remains stable under sustained load.