A mocked ad-copy CLI, real evals, and 30 Playwright cycles (tool 1027)

Deterministic AI Tooling: Building Reproducible CLI Contracts for Agent Workflows

Current Situation Analysis

The AI engineering landscape has heavily optimized for model capability while neglecting operational predictability. Teams routinely ship agent-driven CLIs and automation scripts that depend on live LLM calls, treating non-determinism as an acceptable trade-off for flexibility. This creates a critical blind spot: when outputs shift between runs, CI pipelines lose their ability to validate behavior, debugging becomes guesswork, and cost tracking evaporates.

The problem is overlooked because most frameworks prioritize prompt iteration over engineering boundaries. Developers assume that if a model generates plausible text, the tool is ready. In reality, agent workflows require hard contracts. Without frozen output shapes, reproducible stubs, and explicit execution boundaries, a CLI becomes a black box that works during demos but fails under regression load.

Data from production deployments consistently shows that single-pass success is statistically meaningless. Running a tool once proves nothing about environment stability, permission handling, or process lifecycle management. Conversely, executing 150 identical cycles (30 iterations × 5 validation checks) exposes wiring faults, timeout misconfigurations, and silent payment-gate bypasses that single runs completely miss. Freezing the output contract as structured JSON rather than prose or markdown eliminates downstream parsing friction, enables automated assertion, and forces the engineering boundary to be narrower than the marketing narrative.

WOW Moment: Key Findings

The shift from model-dependent execution to hash-pinned deterministic stubs fundamentally changes how AI tooling behaves in CI/CD environments. The following comparison illustrates the operational delta:

Approach	Reproducibility	CI Latency	Debugging Overhead	Cost per 100 Runs
Live Model CLI	Low (varies by provider load)	High (network + inference)	High (trace logs + prompt diffs)	$12.00–$45.00
Deterministic Mock CLI	High (input-hash pinned)	Low (local compute only)	Low (schema validation + diff)	$0.00

This finding matters because it decouples validation from inference. When the output contract is frozen and the execution path is mockable, teams can run adversarial suites, static hygiene checks, and browser-driver integrations without paying for tokens or waiting for rate limits. It enables regression testing that actually catches environment drift, permission changes, and process-launch failures. Most importantly, it forces the engineering boundary to be explicit: the tool either satisfies the JSON contract or it fails deterministically.

Core Solution

Building a reproducible AI CLI requires three architectural decisions: strict output contracts, deterministic stub generation, and explicit execution boundaries. The implementation below demonstrates a TypeScript-based ad-copy generator that satisfies these requirements.

Step 1: Define the Output Contract

The contract must be narrower than prose. We enforce a strict JSON envelope containing an array of variants, a selected best option, and heuristic scores. This eliminates parsing ambiguity and enables downstream automation.

// types.ts
export interface AdVariant {
  id: string;
  text: string;
  channel: string;
  heuristicScore: number;
}

export interface CopyOutput {
  copies: AdVariant[];
  best: string;
  metadata: {
    generatedAt: string;
    inputHash: string;
    mockMode: boolean;
  };
}

Step 2: Implement Input Hashing for Deterministic Stubs

Hashing the input tuple guarantees that identical requests always produce identical stubs. This pins the demo payload and regression payload to the same object, eliminating flaky behavior.

// hash.ts
import { createHash } from 'crypto';

export function pinInputHash(product: string, audience: string, channel: string): string {
  const raw = `${product}|${audience}|${channel}`;
  return createHash('sha256').update(raw).digest('hex').slice(0, 12);
}

Step 3: Build the Mock Generator with Heuristic Scoring

The mock engine bypasses remote inference entirely. It uses lightweight heuristics (channel baselines, length penalties, keyword density) to generate plausible variants. This keeps the contract identical to the live path while removing network dependency.

// mockEngine.ts
import { AdVariant, CopyOutput } from './types';
import { pinInputHash } from './hash';

const CHANNEL_BASELINES: Record<string, number> = {
  google: 0.85,
  linkedin: 0.78,
  twitter: 0.65,
};

function applyHeuristics(text: string, channel: string): number {
  const base = CHANNEL_BASELINES[channel] ?? 0.5;
  const lengthPenalty = text.length > 150 ? -0.1 : 0;
  const keywordBonus = /enterprise|scale|secure/i.test(text) ? 0.05 : 0;
  return Math.max(0, Math.min(1, base + lengthPenalty + keywordBonus));
}

export function generateMockCopies(
  product: string,
  audience: string,
  channel: string
): CopyOutput {
  const hash = pinInputHash(product, audience, channel);
  
  const variants: AdVariant[] = [
    { id: 'v1', text: `Accelerate ${product} adoption for ${audience} on ${channel}`, channel, heuristicScore: 0 },
    { id: 'v2', text: `Streamline ${audience} workflows with ${product} via ${channel}`, channel, heuristicScore: 0 },
    { id: 'v3', text: `Deploy ${product} securely across ${audience} teams using ${channel}`, channel, heuristicScore: 0 },
  ];

  const scored = variants.map(v => ({ ...v, heuristicScore: applyHeuristics(v.text, channel) }));
  const best = scored.reduce((a, b) => a.heuristicScore > b.heuristicScore ? a : b).id;

  return {
    copies: scored,
    best,
    metadata: { generatedAt: new Date().toISOString(), inputHash: hash, mockMode: true },
  };
}

Step 4: Wire CLI Flags and Execution Boundaries

The CLI exposes --mock and --bypass-payment flags. The Oracle/Permitted/Prohibited split enforces that payment gates, outbound calls, and filesystem writes are statically verifiable. This prevents accidental model calls during regression suites.

// cli.ts
import { Command } from 'commander';
import { generateMockCopies } from './mockEngine';
import { CopyOutput } from './types';

const program = new Command();

program
  .name('copygen')
  .description('Deterministic ad-copy CLI with mockable execution path')
  .requiredOption('--product <string>', 'Target product or service')
  .requiredOption('--audience <string>', 'Target demographic or role')
  .requiredOption('--channel <string>', 'Distribution channel')
  .option('--mock', 'Use deterministic stubs instead of live inference')
  .option('--bypass-payment', 'Skip transaction validation for local runs')
  .action((opts) => {
    if (opts.mock) {
      const output: CopyOutput = generateMockCopies(opts.product, opts.audience, opts.channel);
      process.stdout.write(JSON.stringify(output, null, 2));
      process.exit(0);
    }
    // Live path would route through payment gate + model provider here
    console.error('Live inference requires valid transaction token. Use --mock for local validation.');
    process.exit(1);
  });

program.parse();

Architecture Rationale

JSON Contract Over Prose: Spreadsheets, dashboards, and secondary agents require structured data. Markdown parsing introduces fragility; JSON schema validation eliminates it.
Hash-Pinned Stubs: Identical inputs must yield identical outputs in CI. Hashing removes randomness while preserving the ability to swap in a live generator later without breaking assertions.
Oracle/Permitted/Prohibited Split: Static analysis can only enforce boundaries if they are explicitly declared. Payment gates, outbound network calls, and filesystem mutations are isolated so linters and test runners can police them independently.
Explicit Bypass Flags: --bypass-payment and --mock are not hidden behind environment variables. They are CLI-level contracts that force developers to acknowledge the execution path.

Pitfall Guide

1. Treating LLM Output as Test Ground Truth

Explanation: Writing assertions that expect specific phrasing from a live model guarantees flaky tests. Model temperature, provider updates, and rate limiting will break the suite. Fix: Pin test expectations to deterministic stubs. Validate schema, field presence, and heuristic ranges instead of exact string matches.

2. Ignoring Static Analysis for Test Quality

Explanation: Test suites often accumulate trivial assertions (expect(true).toBe(true)) that inflate coverage metrics without validating behavior. Fix: Implement AST-based checks that reject always-true assertions and enforce meaningful validation patterns. Integrate this into pre-commit hooks.

3. Accidentally Skipping Adversarial Tests

Explanation: Hard tests that validate failure paths, payment refusals, or malformed inputs are often excluded from default CI runs to save time. This creates silent regression gaps. Fix: Tag adversarial suites explicitly and configure the test runner to fail if they are skipped. In JavaScript ecosystems, use custom reporters that block CI on unexecuted hard tags.

4. Over-Engineering Mock Data

Explanation: Building complex synthetic data generators for mocks introduces maintenance overhead and drifts from the live path's actual behavior. Fix: Use lightweight heuristics (channel baselines, length penalties, keyword detection) that mirror real scoring logic. Keep mocks deterministic and mathematically simple.

5. Mixing Output Formats in Logs

Explanation: Printing markdown, plain text, and JSON in the same CLI output breaks downstream automation and forces consumers to implement fragile parsers. Fix: Enforce a strict JSON envelope for stdout. Route debug logs, warnings, and progress indicators to stderr. This separation is non-negotiable for CI compatibility.

6. Assuming Mock Mode Covers Environment Wiring

Explanation: Mock tests validate logic but miss process lifecycle, permission handling, timeout configuration, and driver integration. Fix: Pair mock suites with browser-driver cycles (e.g., Playwright) that execute the CLI as an external process. Run multiple iterations to expose race conditions and resource leaks.

7. Hardcoding Channel or Audience Values

Explanation: Embedding specific strings in heuristics or stubs creates brittle contracts that break when marketing expands targeting parameters. Fix: Externalize channel baselines and keyword lists to configuration files. Load them at runtime so the CLI remains adaptable without code changes.

Production Bundle

Action Checklist

Define strict JSON schema for all CLI outputs and validate against it in CI
Implement input hashing to pin deterministic stubs for regression suites
Separate stdout (JSON contract) from stderr (debug/logs) to prevent parsing failures
Tag adversarial tests explicitly and configure the runner to fail on skip
Add AST-based static checks to reject trivial assertions and coverage theater
Run browser-driver cycles (30+ iterations) to validate process launch, permissions, and timeouts
Externalize channel baselines and keyword lists to configuration files
Document Oracle/Permitted/Prohibited boundaries in DESIGN.md for static enforcement

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Rapid Prototyping	Live Model CLI	Fast iteration, accepts non-determinism	High per-run cost, low setup time
CI Regression	Deterministic Mock CLI	Reproducible, fast, catches environment drift	Zero token cost, moderate setup time
Cost-Sensitive Production	Hash-Pinned Stubs + Fallback	Guarantees budget control, enables graceful degradation	Predictable spend, requires stub maintenance
Multi-Agent Handoff	Strict JSON Contract	Eliminates parsing friction, enables automated routing	Low integration cost, high reliability
Adversarial Validation	Explicit Hard Tags + Browser Cycles	Prevents silent skips, validates failure paths	Higher CI runtime, critical for safety

Configuration Template

// .copygenrc.json
{
  "contract": {
    "version": "1.0",
    "outputFormat": "json",
    "requiredFields": ["copies", "best", "metadata.inputHash"]
  },
  "mock": {
    "enabled": true,
    "hashAlgorithm": "sha256",
    "channelBaselines": {
      "google": 0.85,
      "linkedin": 0.78,
      "twitter": 0.65
    },
    "heuristics": {
      "lengthThreshold": 150,
      "keywordBoost": ["enterprise", "scale", "secure"]
    }
  },
  "ci": {
    "adversarialTag": "hard",
    "playwrightCycles": 30,
    "checksPerCycle": 5,
    "rejectTrivialAssertions": true
  }
}

Quick Start Guide

Initialize the project: Run npm init -y && npm install commander crypto to set up the CLI framework and hashing utilities.
Create the schema and engine: Copy the TypeScript interfaces, hash function, and mock generator into your source directory. Ensure the output matches the JSON contract exactly.
Wire the CLI entry point: Implement the command parser with --mock and --bypass-payment flags. Route stdout to JSON and stderr to logs.
Run deterministic validation: Execute node cli.js --product "CloudSync" --audience "IT Directors" --channel "linkedin" --mock to verify the stub output and hash pinning.
Integrate CI gates: Add the adversarial test tag configuration, AST assertion checker, and Playwright cycle runner to your pipeline. Verify 150 green runs before merging.

Mid-Year Sale — Unlock Full Article