Deterministic Triage for AI-Generated Build Failures

Current Situation Analysis

The integration of AI coding assistants into development workflows has fundamentally shifted the failure surface of modern codebases. Where human developers typically introduce logical flaws, race conditions, or architectural debt, AI agents predominantly generate syntactically valid code that fails at the compilation boundary. The compiler rejects it, but the error output is structurally identical to a missing import or a typo. Engineering teams spend disproportionate time triaging these failures because traditional static analysis tooling is blind to them.

This blind spot exists because conventional linters operate on Abstract Syntax Trees (ASTs) or control-flow graphs. They assume the code compiles. Tools like staticcheck or golangci-lint excel at catching nil pointer dereferences, unused variables, and inefficient loops, but they never execute the type-checker. When an AI agent hallucinates a method signature, references a deprecated constant, or passes the wrong number of arguments to a standard library function, the compiler catches it first. By the time the linter runs, the build has already aborted. The error is logged, but it's never classified, tracked, or routed to the right developer.

Industry telemetry confirms this shift. Surveys across engineering organizations consistently show that developers now spend more time debugging AI-authored code than code they wrote themselves. The failures are not stochastic noise. They cluster into four distinct failure modes:

Undefined symbols: References to packages or variables that never existed in the target version.
Undefined methods: Calls to methods that belong to a different type or were removed in a recent SDK update.
Arity mismatches: Correct method name, incorrect argument count.
Type mismatches: Correct signature structure, wrong parameter types or return handling.

These are not edge cases. They are the default failure mode of LLM-assisted development. Treating them as generic build errors wastes cognitive bandwidth and slows CI pipelines. The industry needs a deterministic, compiler-adjacent triage layer that intercepts build output, classifies AI-specific hallucinations, and scopes them to active changes.

WOW Moment: Key Findings

The critical insight is that compiler stderr contains all the signal needed to classify AI hallucinations, but it's buried under noise. By intercepting and parsing this output before it reaches the developer, we can separate AI-induced build failures from human typos with near-zero latency.

Approach	Latency	Cost	Determinism	AI Hallucination Coverage
AST Static Analyzer	2–5s	Free	High	Low (misses missing APIs)
LLM Code Review	10–30s	$0.01–$0.05/run	Low (stochastic)	Medium-High (noisy)
Compiler Stderr Classifier	0.5–2s	Free	High	High (catches signature/API gaps)

This finding matters because it flips the conventional AI-review pipeline. Instead of feeding raw diffs into a frontier model and hoping it catches compilation errors, we use the compiler itself as the first detector. The compiler already performs exhaustive type checking, symbol resolution, and signature validation. Replicating that logic in an AST walker or LLM prompt is redundant. A stderr classifier extracts the compiler's verdict, maps it to AI-specific failure buckets, and filters it through a version-control diff. The result is a pre-PR check that runs in under two seconds, costs nothing, and never hallucinates its own output.

Core Solution

The architecture relies on three principles: intercept compiler output, classify via deterministic patterns, and scope to active changes. We avoid AST traversal and LLM inference at the first line of defense.

Step 1: Capture Compiler Stderr

The build command must run in a subprocess, with standard error piped to a buffer. Standard output is ignored because build failures, warnings, and stack traces are emitted to stderr.

func runBuild(ctx context.Context, dir string) ([]byte, error) {
    cmd := exec.CommandContext(ctx, "go", "build", "./...")
    cmd.Dir = dir
    var stderr bytes.Buffer
    cmd.Stderr = &stderr
    cmd.Stdout = io.Discard // Ignore stdout noise

    err := cmd.Run()
    // We intentionally ignore the exit code here.
    // The classifier needs to parse stderr regardless of success/failure.
    return stderr.Bytes(), nil
}

Step 2: Pattern Classification Engine

Instead of walking the AST, we map compiler error formats to structured failure types. Go's compiler output follows predictable patterns. We compile regex patterns once and reuse them across invocations.

type FailureCategory string

const (
    UndefinedSymbol FailureCategory = "undefined-symbol"
    UndefinedMethod FailureCategory = "undefined-method"
    ArityMismatch   FailureCategory = "arity-mismatch"
    TypeMismatch    FailureCategory = "type-mismatch"
)

type ErrorPattern struct {
    Category FailureCategory
    Regex    *regexp.Regexp
}

var patterns = []ErrorPattern{
    {
        Category: UndefinedMethod,
        Regex:    regexp.MustCompile(`(?P<file>[^:]+):(?P<line>\d+):.*\bundefined:\s*(?P<method>\w+\.\w+)`),
    },
    {
        Category: ArityMismatch,
        Regex:    regexp.MustCompile(`(?P<file>[^:]+):(?P<line>\d+):.*not enough arguments in call to\s*(?P<func>\w+)`),
    },
    {
        Category: TypeMismatch,
        Regex:    regexp.MustCompile(`(?P<file>[^:]+):(?P<line>\d+):.*cannot use\s*(?P<arg>\w+)\s*\(type\s*(?P<type>\w+)\) as\s*(?P<expected>\w+)`),
    },
}

func classifyErrors(raw []byte) map[FailureCategory][]string {
    results := make(map[FailureCategory][]string)
    lines := strings.Split(string(raw), "\n")

    for _, line := range lines {
        for _, p := range patterns {
            if matches := p.Regex.FindStringSubmatch(line); matches != nil {
                results[p.Category] = append(results[p.Category], line)
                break // Match once per line
            }
        }
    }
    return results
}

Step 3: Diff-Scoped Filtering

Running this against an entire repository produces noise. The value emerges when we restrict analysis to files modified in the current branch or PR. We extract changed files via git diff --name-only, then filter the classified errors to only those paths.

func getChangedFiles(ctx context.Context, baseBranch string) ([]string, error) {
    cmd := exec.CommandContext(ctx, "git", "diff", "--name-only", baseBranch+"...HEAD")
    out, err := cmd.Output()
    if err != nil {
        return nil, err
    }
    var files []string
    for _, f := range strings.Split(strings.TrimSpace(string(out)), "\n") {
        if f != "" {
            files = append(files, f)
        }
    }
    return files, nil
}

func filterByScope(classified map[FailureCategory][]string, scope []string) map[FailureCategory][]string {
    scopeSet := make(map[string]struct{})
    for _, f := range scope {
        scopeSet[f] = struct{}{}
    }

    filtered := make(map[FailureCategory][]string)
    for cat, errors := range classified {
        for _, errLine := range errors {
            // Extract file path from compiler output (first token before colon)
            parts := strings.SplitN(errLine, ":", 2)
            if len(parts) < 2 {
                continue
            }
            if _, ok := scopeSet[parts[0]]; ok {
                filtered[cat] = append(filtered[cat], errLine)
            }
        }
    }
    return filtered
}

Architecture Rationale

Why regex on stderr? The Go compiler already performs exhaustive type checking, symbol resolution, and signature validation. Re-implementing this in an AST walker duplicates compiler logic and introduces version drift. Regex on stderr is a thin, stable projection layer.
Why diff scoping? AI agents generate code incrementally. Triage should focus on what changed, not what already exists. Diff scoping reduces CI runtime by 60–80% and eliminates false positives from legacy code.
Why not LLMs first? LLM-based review is non-deterministic, costly, and slow. It's excellent for architectural feedback or pattern completeness (e.g., missing connection pings), but terrible for catching undefined: db.WithTimeout. The inverse pipeline is correct: deterministic compiler checks → AST/static analysis → LLM review only on flagged regions.

Pitfall Guide

1. Capturing stdout instead of stderr

Explanation: Build systems emit compilation errors, warnings, and stack traces to standard error. Capturing stdout returns empty or build-success messages. Fix: Explicitly bind cmd.Stderr to a buffer and discard cmd.Stdout. Verify by running go build ./... 2>&1 locally to confirm error routing.

2. Ignoring Go version drift in error formats

Explanation: Go 1.20, 1.21, and 1.22 changed compiler error phrasing and column reporting. Regex patterns that work on one version break on another. Fix: Version-detect the toolchain at runtime. Maintain a compatibility matrix of regex patterns per Go minor version. Use go version to select the active pattern set.

3. Over-matching generic typos as AI hallucinations

Explanation: A human typing db.QuerRowContext triggers the same undefined error as an AI hallucination. The classifier cannot distinguish intent from output alone. Fix: Tag errors as ai-suspect rather than ai-definitive. Combine with VCS metadata (e.g., Co-Authored-By trailers, Cursor/Devin commit signatures) to weight the classification. Treat all compiler failures as suspect until proven human.

4. Running full-repo scans in CI

Explanation: Scanning ./... on every PR causes timeout failures in large monorepos and floods logs with legacy errors. Fix: Always scope to git diff --name-only origin/main...HEAD. Cache build artifacts using GOCACHE and GOMODCACHE to ensure the compiler only rechecks changed packages.

5. Missing context extraction for arity/type errors

Explanation: Reporting arity-mismatch without showing the expected signature leaves developers guessing. The compiler output contains the signature, but naive parsers drop it. Fix: Extend regex capture groups to extract function names and expected signatures. Format output as:

internal/store/user.go:42: arity-mismatch
  ctx.WithTimeout(5 * time.Second) called with 1 arg, expected 2
  func WithTimeout(parent Context, timeout Duration) (Context, CancelFunc)

6. Assuming compiler output is line-buffered

Explanation: Some build wrappers or CI environments buffer stderr, causing delayed or interleaved output. Regex matching on incomplete lines fails. Fix: Use bufio.Scanner with SplitFunc that handles partial lines, or run the compiler with GOTRACEBACK=crash and GOFLAGS=-v to force immediate flushing. Alternatively, wrap the build in a script that redirects to a temp file and reads it post-execution.

7. Replacing all review with automation

Explanation: The classifier catches signature/API gaps but misses pattern-incompleteness bugs (e.g., sql.Open without db.Ping, missing rows.Close, unhandled context cancellation). These compile and run but fail in production. Fix: Treat the stderr classifier as a first-pass gate. Route flagged files to static analysis for resource leaks, then to LLM review for architectural pattern validation. Never rely on a single layer.

Production Bundle

Action Checklist

Instrument CI pipeline to capture go build ./... stderr into a temporary artifact
Implement version-aware regex matcher with Go 1.20+ compatibility matrix
Add diff-scoping logic using git diff --name-only against base branch
Format output to include expected signatures for arity/type mismatches
Integrate with PR status checks to block merges on undefined-method or arity-mismatch
Tag errors with VCS metadata (Co-Authored-By, tool signatures) for AI-suspect weighting
Route flagged files to secondary static analysis for resource leak detection
Establish evaluation harness: precision/recall tracking against 50+ real AI-authored PRs

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Pre-commit hook	Stderr classifier + diff scope	Instant feedback, zero cost, catches hallucinations before push	Free
PR review gate	Stderr classifier → AST analyzer → LLM triage	Layers cheap checks first, reserves expensive review for flagged regions	Low ($0.01–$0.03/PR)
Legacy codebase audit	Full-repo AST scan + pattern-incompleteness detector	Stderr classifier only catches build failures; legacy code may compile but contain AI artifacts	Medium (compute time)
New AI-generated module	Stderr classifier + strict signature enforcement	AI agents frequently hallucinate SDK methods; early detection prevents technical debt	Free
Multi-language monorepo	Language-specific stderr parsers + unified diff scoping	Each compiler emits different error formats; unified diff scope keeps CI fast	Low (maintenance overhead)

Configuration Template

# .github/workflows/ai-triage.yml
name: AI Build Triage
on:
  pull_request:
    branches: [main]

jobs:
  triage:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Set up Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.22'

      - name: Run AI Triage Classifier
        run: |
          go run ./cmd/triage \
            --base-branch origin/${{ github.base_ref }} \
            --format json \
            --output triage-report.json

      - name: Upload Triage Report
        uses: actions/upload-artifact@v4
        with:
          name: ai-triage-report
          path: triage-report.json

      - name: Fail on AI Hallucinations
        run: |
          if jq -e '.undefined_method | length > 0 or .arity_mismatch | length > 0' triage-report.json > /dev/null; then
            echo "::error::AI hallucination detected in build output"
            exit 1
          fi

Quick Start Guide

Initialize the classifier binary: Create a Go module with the stderr capture, regex classification, and diff-scoping logic. Build it as a standalone CLI tool.
Configure base branch detection: Pass --base-branch origin/main (or equivalent) to the tool so it knows which commits to diff against.
Run locally: Execute ./triage --base-branch main in your repository. Verify that only errors in changed files are reported, and that arity/type mismatches include expected signatures.
Integrate into CI: Add the workflow template to your repository. Set the gate to fail on undefined-method or arity-mismatch categories.
Validate precision: Run the tool against 10 recent PRs. Compare flagged errors against actual merge history. Adjust regex patterns if false positives exceed 5%. Iterate until precision stabilizes above 90%.

Building a linter for the bugs AI agents actually make