Autonomous UI Generation via Runtime Evaluation Loops

Current Situation Analysis

Modern AI coding assistants have dramatically reduced the time required to scaffold interfaces, but they suffer from a fundamental limitation: statistical convergence. When left to generate UI code autonomously, large language models gravitate toward high-probability patterns. The result is a predictable aesthetic often referred to as AI-generated slop: uniform typography, gradient-heavy backgrounds, rigid card grids, and identical interaction models. This isn't a flaw in the models; it's a mathematical inevitability when generation lacks a divergent feedback mechanism.

The industry has largely addressed this problem through static analysis. Developers run linters, type checkers, and static code reviews to catch syntax errors, missing props, or accessibility violations. While effective for correctness, static analysis completely misses runtime visual and behavioral defects. A component can pass every ESLint rule and TypeScript check while failing to render correctly on mobile viewports, misaligning under dynamic content, or breaking interactive states. The gap between static correctness and production-ready UI remains one of the most expensive bottlenecks in AI-assisted development.

Recent harness engineering experiments demonstrate that closing this gap requires shifting evaluation from the codebase to the runtime. By decoupling generation from evaluation and forcing the evaluator to interact with a live development server, teams can achieve production-grade interfaces in approximately 3.5 hours across 12 autonomous cycles with zero manual intervention. The critical insight is that visual and interaction quality cannot be reliably assessed through text alone. Runtime evaluation catches layout shifts, z-index conflicts, responsive breakpoints, and event handler failures that static review systematically misses.

WOW Moment: Key Findings

The most significant finding from runtime evaluation loops is that architectural divergence drives quality gains, not incremental refinement. When the evaluator operates on a live server and enforces strict anti-generic constraints, the generator is forced to abandon safe patterns and explore unconventional layouts. This produces measurable jumps in design maturity that static loops cannot replicate.

Approach	Design Score (Avg)	Visual/Interaction Bug Detection	Context Pollution Risk	Time to Viable UI
Single-Pass Generation	4.2/10	Low (static only)	None	< 5 min
Static Code Review Loop	5.8/10	Medium (misses layout/runtime)	High	~45 min
Generator-Evaluator Runtime Loop	7.5/10	High (viewport + interaction)	None (clean slate)	~3.5 hrs

This finding matters because it redefines how autonomous development pipelines should be structured. Static review optimizes for correctness; runtime evaluation optimizes for craft. By treating the development server as the source of truth, the evaluator can validate responsive behavior, interaction states, and visual hierarchy in real time. The clean-slate architecture eliminates context window degradation, ensuring each iteration starts with a fresh cognitive state rather than accumulating contradictory instructions. The result is a pipeline that prioritizes architectural pivots over cosmetic tweaks, producing interfaces that feel intentionally designed rather than statistically averaged.

Core Solution

The generator-evaluator loop operates on three principles: process isolation, file-based inter-process communication, and runtime validation. Each component runs as an independent CLI session, preventing context bleed and ensuring deterministic state transitions.

Architecture Overview

Planner (1x) → Generator ↔ Evaluator (N cycles)

The planner establishes the initial specification. The generator builds the interface. The evaluator critiques it against a live server. Communication flows exclusively through structured files in a shared harness directory. No shared memory, no persistent sessions, no implicit state.

Step 1: Orchestrator Implementation

A TypeScript orchestrator manages process lifecycle, enforces clean slates, and handles iteration routing. Unlike shell scripts, a typed runner provides structured error handling, timeout management, and deterministic exit codes.

import { execSync } from 'child_process';
import { writeFileSync, mkdirSync, existsSync } from 'fs';
import { join } from 'path';

const HARNES_DIR = '.harness';
const MAX_CYCLES = 12;
const CLI_BINARY = 'kiro-cli';

function ensureHarnessDir(): void {
  if (!existsSync(HARNES_DIR)) {
    mkdirSync(HARNES_DIR, { recursive: true });
  }
}

function invokeSession(promptFile: string, outputFile: string): void {
  const cmd = `${CLI_BINARY} chat --no-interactive --trust-all-tools "Read ${promptFile} and execute all instructions. Write output to ${outputFile}."`;
  execSync(cmd, { stdio: 'inherit', timeout: 600000 });
}

function killDevServer(): void {
  try {
    execSync('pkill -f "next dev" 2>/dev/null || true');
  } catch {
    // Graceful fallback
  }
}

async function runHarness(): Promise<void> {
  ensureHarnessDir();
  
  // Phase 1: Specification
  invokeSession('prompts/specification.md', join(HARNES_DIR, 'spec.md'));
  
  // Phase 2: Generate-Evaluate Loop
  for (let cycle = 1; cycle <= MAX_CYCLES; cycle++) {
    console.log(`\n[CYCLE ${cycle}/${MAX_CYCLES}]`);
    
    killDevServer();
    await new Promise(res => setTimeout(res, 2000));
    
    // Generator consumes spec + previous evaluation
    invokeSession(
      'prompts/generator.md',
      join(HARNES_DIR, `gen-cycle-${cycle}.md`)
    );
    
    // Evaluator spins up server, tests runtime, writes critique
    invokeSession(
      'prompts/evaluator.md',
      join(HARNES_DIR, `eval-cycle-${cycle}.md`)
    );
    
    const evalReport = readFileSync(join(HARNES_DIR, `eval-cycle-${cycle}.md`), 'utf-8');
    if (evalReport.includes('SCORE: 8/10') || evalReport.includes('SCORE: 9/10')) {
      console.log('Target quality reached. Halting loop.');
      break;
    }
  }
}

runHarness().catch(console.error);

Why this structure? TypeScript provides explicit error boundaries and timeout controls that shell scripts lack. The orchestrator enforces a strict kill-restart cycle for the dev server, preventing port conflicts and stale builds. File-based output ensures every iteration leaves an auditable trail.

Step 2: Runtime Evaluator Configuration

The evaluator must interact with the application as a user would. Static code inspection cannot detect CSS cascade failures, JavaScript event delegation issues, or responsive breakpoint misalignments. Playwright MCP bridges this gap by providing programmatic browser control within the AI session.

# prompts/evaluator.md

## Server Initialization
Start the Next.js development server on port 3000. Wait 8 seconds for compilation.
nohup npx next dev --port 3000 > /dev/null 2>&1 & disown; sleep 8

## Runtime Validation Protocol
1. Navigate to http://localhost:3000
2. Capture DOM snapshots at 1440x900 (desktop), 768x1024 (tablet), 375x812 (mobile)
3. Click all interactive elements: buttons, links, form inputs, toggles
4. Verify no layout shifts occur during interaction
5. Check console for unhandled errors or hydration mismatches

## Scoring Rubric
- Design Quality (0-10): Visual hierarchy, spacing, typography cohesion
- Originality (0-10): Deviation from common AI patterns
- Craft (0-10): Animation smoothness, responsive behavior, accessibility
- Functionality (0-10): Interaction reliability, state management, error handling

## Output Requirements
Write critique to .harness/eval-report.md. Never output PASS without actionable feedback. If score < 6, mandate architectural pivot. If score >= 7, allow refinement.

Why Playwright over static review? Browser automation catches runtime failures that linters miss. Hydration mismatches, z-index stacking contexts, and touch-event delegation only manifest in a live DOM. Multi-viewport snapshots force responsive validation. Interaction testing verifies event handlers actually fire.

Step 3: Anti-Generative Constraints

To prevent statistical convergence, the generator must operate under explicit anti-generic constraints. These are injected as a skill file that overrides default model tendencies.

# prompts/generator.md

## Design Constraints (Enforced)
- NEVER use Inter, Roboto, Arial, or Space Grotesk
- NEVER apply purple gradients on white backgrounds
- NEVER use predictable card grids or centered hero sections
- ALWAYS select distinctive typefaces with clear personality
- ALWAYS implement asymmetric layouts or unexpected spatial relationships
- ALWAYS commit fully to a single visual direction

## Iteration Strategy
If previous evaluation score < 6: Discard current layout. Pivot to new architectural pattern.
If previous evaluation score >= 7: Refine spacing, animation curves, and micro-interactions.

Why hard constraints? LLMs optimize for likelihood. Without explicit negative constraints, they default to training-distribution averages. The anti-slop rules force the model into lower-probability, higher-creativity regions of the latent space. The pivot mandate prevents endless cosmetic tweaking, which yields diminishing returns.

Pitfall Guide

1. Context Window Degradation

Explanation: Running generator and evaluator in the same session causes instruction drift. The model forgets early constraints as conversation length increases, leading to contradictory outputs. Fix: Enforce strict process isolation. Each invocation must start with a clean context window. Pass state exclusively through files, not conversation history.

2. Evaluator Hallucination on Visuals

Explanation: Text-only evaluation cannot reliably assess visual quality. The model may claim a layout is responsive without verifying breakpoint behavior. Fix: Require DOM snapshots and interaction logs as proof. The evaluator must output specific viewport dimensions, element states, and console errors. Never accept subjective claims without runtime evidence.

3. Over-Optimization for Polish

Explanation: Generators tend to endlessly tweak margins, colors, and animations when scores plateau. This wastes compute and delays architectural improvements. Fix: Implement a score threshold that triggers mandatory pivots. If the score remains within a 1-point range for 3 consecutive cycles, force a layout restructuring. Polish only applies after architectural stability is achieved.

4. Ignoring Interaction Realities

Explanation: Static review catches missing props but misses broken event handlers, missing focus states, or unhandled loading states. Fix: Mandate interaction testing in every evaluation cycle. Click every button, fill every form, toggle every switch. Verify state transitions and error boundaries. Runtime behavior > static structure.

5. Missing Rollback Mechanism

Explanation: Autonomous loops can degrade quality if a bad iteration overwrites a better one. Without version control, recovery is impossible. Fix: Commit to Git after each cycle. Tag commits with evaluation scores. Maintain a best-score branch that automatically updates when a new cycle exceeds the previous maximum. Enable instant rollback to peak quality.

6. Hardcoded Scoring Thresholds

Explanation: Fixed score targets (e.g., "stop at 8/10") ignore iteration context. Early cycles should prioritize exploration; late cycles should prioritize stability. Fix: Use dynamic thresholds. Cycle 1-4: Accept scores 5-6 to encourage divergence. Cycle 5-8: Require 7+ to validate direction. Cycle 9+: Demand 8+ for production readiness. Adjust expectations based on iteration maturity.

7. Generator-Evaluator Feedback Stagnation

Explanation: When both agents operate under identical constraints, they can enter a loop where the evaluator critiques the same issues and the generator applies identical fixes. Fix: Inject controlled variance. Rotate constraint emphasis each cycle. If stagnation is detected, temporarily relax one constraint to force exploration, then reapply it. Use temperature scaling in the generator to encourage divergence when convergence is detected.

Production Bundle

Action Checklist

Isolate each agent in a separate CLI process with clean context windows
Configure file-based IPC using a dedicated .harness directory
Integrate Playwright MCP for runtime viewport and interaction testing
Implement anti-generic constraints to prevent statistical convergence
Set up Git versioning with automatic tagging per evaluation cycle
Define dynamic scoring thresholds that adapt to iteration maturity
Implement a kill-restart cycle for the dev server to prevent port conflicts
Log all evaluator outputs to a centralized dashboard for trend analysis

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Marketing/Landing Page	Frontend Design Loop (5-15 cycles)	Prioritizes visual craft, responsive behavior, and brand differentiation over complex state management	Low compute, high design ROI
SaaS Dashboard / App Shell	Full-Stack Sprint Loop	Requires contract negotiation, pass/fail gates, and correctness validation before UI generation	Higher compute, lower defect rate
Rapid Prototyping	Single-Pass + Static Review	Speed > polish. Accept generic patterns for internal validation or stakeholder demos	Minimal compute, fast turnaround
Production-Ready Public UI	Generator-Evaluator Runtime Loop	Catches visual/runtime bugs, enforces anti-slop constraints, validates responsive behavior	Moderate compute, high quality assurance

Configuration Template

// .kiro/settings/mcp.json
{
  "mcpServers": {
    "playwright-browser": {
      "command": "npx",
      "args": ["@anthropic/mcp-playwright"],
      "env": {
        "BROWSER_VIEWPORT_WIDTH": "1440",
        "BROWSER_VIEWPORT_HEIGHT": "900",
        "ALLOW_LOCAL_NETWORK": "true"
      },
      "disabled": false,
      "autoApprove": ["browser_navigate", "browser_snapshot", "browser_click"]
    }
  },
  "orchestrator": {
    "maxCycles": 12,
    "devServerPort": 3000,
    "harnessDir": ".harness",
    "gitAutoCommit": true,
    "scoreThresholds": {
      "early": 5,
      "mid": 7,
      "late": 8
    }
  }
}

Quick Start Guide

Initialize the harness directory: Create .harness/ and populate it with spec.md, eval-report.md, and cycle tracking files. Ensure Git is initialized with a .gitignore that excludes node_modules and .next/.
Configure MCP settings: Add the Playwright browser MCP to your CLI configuration. Set viewport dimensions, enable auto-approval for navigation/snapshot/click actions, and verify local network access.
Run the orchestrator: Execute the TypeScript runner. It will invoke the planner once, then cycle through generator and evaluator sessions. Monitor .harness/eval-report.md for score progression and architectural pivots.
Validate and iterate: After the loop completes, review the best-score Git tag. Run manual cross-browser testing to verify runtime behavior. If scores plateaued, adjust anti-slop constraints or increase cycle count before production deployment.

Building a Website with Anthropic's Generator-Evaluator Loop (Harness Engineering)