← Back to Blog
AI/ML2026-05-07Β·40 min read

I Tested My AI Pipeline 6 Times and Found 9 Bugs. The Model Caused Zero of Them.

By Ken Imoto

I Tested My AI Pipeline 6 Times and Found 9 Bugs. The Model Caused Zero of Them.

Current Situation Analysis

Autonomous AI pipelines routinely fail not due to model hallucinations or prompt engineering deficiencies, but because of orchestration fragility in the execution harness. Traditional deployment patterns treat LLM agents as deterministic scripts, relying on time-based cron scheduling, self-referential quality loops, and stateless job management. This approach breaks down under real-world conditions:

  • Variable Latency vs Fixed Scheduling: LLM inference times fluctuate based on context window size, API load, and reasoning depth. Time-based cron jobs assume fixed execution windows, creating race conditions where downstream agents start before upstream outputs are ready.
  • Stateless Execution: Without explicit state tracking, pipelines duplicate outputs, overwrite calendar entries, and queue conflicting publish jobs. Each run operates in isolation, ignoring system state.
  • Circular Quality Assurance: Self-grading evaluation loops produce false positives. Agents optimizing for prompt compliance will consistently rate their own output as "excellent," masking structural or tonal deficiencies.
  • Infrastructure Fragility: Dynamic prompt placeholders injected into shell environments trigger silent parsing errors (e.g., angle brackets interpreted as I/O redirection). Missing idempotency checks compound job duplication.

The failure mode is systemic: 100% of observed defects originated in the harness layer, not the model layer. Traditional AI engineering focuses heavily on prompt/context optimization while neglecting execution control, state management, and environmental isolation.

WOW Moment: Key Findings

Six iterative test runs revealed a consistent defect pattern. Transitioning from time-based orchestration to event-driven chaining, combined with state-aware deduplication and independent evaluation, eliminated all nine identified failure modes. The following experimental comparison demonstrates the performance delta across three orchestration paradigms:

Approach Race Condition Incidents Duplicate Outputs False Quality Passes Shell/Infra Failures Successful Batch Runs
Time-Based Cron 3 4 2 2 0
Staggered Cron 1 3 2 2 0
Event-Driven Harness 0 0 0 0 1 (Run #7)

Key Findings:

  • Execution Control: Event-driven after dependencies reduced race conditions by 100% compared to staggered cron.
  • Data Integrity: Pre-flight state queries and deduplication logic eliminated duplicate calendar entries and topic overlaps.
  • Quality Assurance: Decoupling writer and evaluator sessions reduced false pass rates from 100% to 0%, exposing missing engagement metrics (e.g., wit/human tone).
  • Infrastructure: Shell-safe templating and job-id tracking resolved silent parsing errors and duplicate at queue entries.

Sweet Spot: The optimal architecture isolates model execution from orchestration logic, enforces idempotent state checks before every write operation, and routes quality evaluation through independent sessions with zero context carryover.

Core Solution

The resolution required a systematic shift from prompt-centric optimization to harness engineering. Implementation spans four layers:

1. Event-Driven Orchestration

Replace time-based cron with completion-triggered dependencies. Each phase writes output to a deterministic path. The next phase initializes only upon successful completion of the predecessor.

# The target architecture
observer:
  schedule: "0 7 * * 1"  # Monday 07:00
strategist:
  after: observer        # Starts when Observer completes
marketer:
  after: strategist      # Starts when Strategist completes
# Before: all fire at once
observer:   "0 7 * * 1"
strategist: "0 7 * * 1"
marketer:   "0 7 * * 1"

2. State-Aware Data Integrity

Inject exclusion lists and query existing system state before committing actions.

# Fix: inject exclusion list before topic selection
existing = list_existing_articles()
prompt = f"""
Select a topic. Do NOT pick any of these (already published):
{existing}
"""
# Fix: calculate available dates first
available = get_available_publish_dates(
    start=today,
    count=batch_size,
    existing=get_scheduled_dates()
)

3. Independent Quality Evaluation

Decouple assessment from generation. Run quality checks in a separate Claude session with no memory of the writing context.

4. Infrastructure Hardening

Escape dynamic placeholders and enforce job idempotency.

# Before: bash interprets <devto_id> as redirect
echo "Update article <devto_id> to published"

# After: escape or quote
echo "Update article DEVTO_ID_PLACEHOLDER to published"

Architecture Decision: The pipeline adopts a producer-consumer model with explicit state gates. Failure in any phase halts downstream execution, preventing corruption propagation. This maps directly to the Prompt β†’ Context β†’ Harness progression, where harness engineering becomes the primary determinant of agent reliability.

Pitfall Guide

  1. Time-Based Orchestration Over Event-Driven Chaining: Cron scheduling assumes deterministic execution windows. LLM latency is probabilistic. Always use completion-triggered dependencies (after clauses) to prevent race conditions and partial input consumption.
  2. Self-Referential Quality Loops: Agents evaluating their own output will optimize for prompt compliance, not actual quality. Route evaluation through independent sessions with isolated context windows and explicit rubrics (e.g., wit, tone, structural coherence).
  3. Stateless Job Scheduling: Failing to query existing jobs, calendar entries, or published content before scheduling guarantees duplicates. Implement idempotent pre-flight checks: list existing β†’ filter conflicts β†’ commit only unique actions.
  4. Unescaped Dynamic Placeholders in Shell: Angle brackets (< >), pipes (|), and ampersands (&) in prompt templates trigger shell interpretation. Always quote variables, escape special characters, or use safe templating engines that bypass shell evaluation.
  5. Ignoring External System State: Autonomous pipelines must treat external systems (calendars, CMS, schedulers) as authoritative sources of truth. Never assume availability; always query, diff, and reconcile before writing.
  6. Missing Engagement Metrics in Evals: Standard quality checks catch AI slop vocabulary but miss human-centric signals (self-deprecation, unexpected metaphors, tonal deflation). Explicitly require engagement heuristics in evaluation prompts.
  7. Silent Failure Propagation: Without explicit success gates, downstream phases execute on corrupted or incomplete inputs. Implement strict exit codes and halt chains on non-zero returns.

Deliverables

  • πŸ“˜ Blueprint: Event-Driven AI Harness Architecture β€” Complete system design covering phase chaining, state gating, independent evaluation routing, and idempotent job management. Includes failure isolation patterns and rollback strategies.
  • βœ… Checklist: 9-Point Pre-Flight Harness Validation β€” Actionable verification list covering execution dependencies, state deduplication, shell escaping, calendar reconciliation, topic exclusion, independent eval setup, wit/engagement metrics, job queue hygiene, and success/failure propagation gates.
  • βš™οΈ Configuration Templates: Production-ready YAML orchestration schemas, state-query pseudo-code for deduplication, independent evaluator prompt structures, and shell-safe placeholder escaping patterns. Deployable across Claude, GPT, and open-weight agent frameworks.

Harness patterns detailed in this article are part of the broader framework: Harness Engineering: From Using AI to Controlling AI.