A tiny local model doing real GitHub-maintainer work in your browser — and the pattern behind it

By Codcompass Team·2026-06-01·8 min read

Current Situation Analysis

Building production LLM applications forces engineers into a structural trade-off that most teams misdiagnose. On one side, frontier API models handle multi-step planning, error recovery, and unstructured surface parsing with high reliability. On the other, local and open-weight models (3B–14B range) offer deterministic cost, zero network egress, and full data residency control. The industry default is to push the local side harder: wrapping small models in complex agent frameworks, chaining prompts, or fine-tuning for reasoning. This approach consistently fails in production because it attacks the wrong variable. Small models do not lack intelligence; they lack reliable runtime planning capacity. They drift into prose when they should invoke tools, hallucinate parameter names, and terminate on the first unexpected response.

The misunderstanding stems from treating every LLM interaction as a novel reasoning problem. In reality, the vast majority of production workflows are repetitive operations with variable inputs: fetch resource, extract fields, apply scoring, route to destination, notify stakeholders. The cognitive load isn't in deciding what to do once the request is understood. The load is in the step-by-step execution loop. Forcing a local model to plan that loop at runtime introduces latency, brittleness, and unpredictable token consumption. The architectural lever that actually moves the needle isn't better runtime reasoning. It's eliminating runtime reasoning entirely by compiling deterministic workflows into parameterized execution units, leaving the local model with a single, well-bounded task: intent classification and argument extraction.

WOW Moment: Key Findings

The structural shift from runtime planning to compile-time workflow encoding creates a measurable divergence across cost, reliability, and compliance metrics. The following comparison isolates the operational impact of three common deployment strategies for high-volume, repetitive LLM tasks.

Approach	Cost per 10k Executions	Avg Latency (P95)	Multi-step Reliability	Data Residency
Frontier API Routing	$120–$180	1.8–3.2s	98.2%	External (US/EU)
Local Agent Reasoning	$0.80–$1.50	4.5–8.0s	61.4%	Fully Local
Compiled Macro Routing	$0.12–$0.25	0.9–1.4s	94.5%+	Fully Local

The compiled macro pattern decouples capability from execution cost. A frontier model is used once during design time to author and validate the workflow sequence. That sequence becomes deterministic code. At runtime, a local model (e.g., Qwen 2.5 7B quantized to 4-bit) only performs intent matching and parameter extraction. Benchmarks on pre-registered routing corpora show accuracy jumping from 53.5% to 94.5% once schema serialization is corrected, with zero structural failures. The capability gap between models becomes irrelevant for that specific workflow because the model never plans the steps. It only routes to them. This enables air-gapped deployments, predictable billing, and CI-verifiable execution paths without sacrificing throughput.

Core Solution

The macro pattern operates on a strict separation of concerns: design-time compilation versus runtime execution. The implementation requires three coordinated components: a workflow definition schema, a deterministic execution pipeline, and a lightweight intent router.

Step 1: Define the Workflow Blueprint

Workflows are declared as typed, parameterized units. The definition includes a

routing intent, a strict input schema, a sequential execution pipeline, and validation fixtures. The schema must be compiled to a format the router can consume natively.

import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';

interface WorkflowBlueprint<T extends z.ZodTypeAny> {
  id: string;
  routingIntent: string;
  inputSchema: T;
  executionPipeline: (args: z.infer<T>, context: ExecutionContext) => Promise<WorkflowResult>;
  validationFixtures: Array<{ input: z.infer<T>; expected: WorkflowResult }>;
}

function createWorkflow<T extends z.ZodTypeAny>(blueprint: WorkflowBlueprint<T>) {
  return {
    ...blueprint,
    serializedSchema: zodToJsonSchema(blueprint.inputSchema, { strictUnions: true }),
    validate: async () => {
      for (const fixture of blueprint.validationFixtures) {
        const result = await blueprint.executionPipeline(fixture.input, mockContext());
        if (!deepEqual(result, fixture.expected)) {
          throw new Error(`Fixture mismatch for ${blueprint.id}`);
        }
      }
      return true;
    }
  };
}

Step 2: Implement the Deterministic Handler

The execution pipeline contains the actual tool calls. This is ordinary, testable code. No LLM is invoked during execution. The sequence is fixed, typed, and version-controlled.

const triageRepositoryWorkflow = createWorkflow({
  id: 'repo_triage_v1',
  routingIntent: 'analyze recent pull requests and assign labels',
  inputSchema: z.object({
    repositoryOwner: z.string(),
    repositoryName: z.string(),
    prCount: z.number().default(5),
    labelStrategy: z.enum(['semantic', 'conventional', 'manual']).default('semantic')
  }),
  executionPipeline: async ({ repositoryOwner, repositoryName, prCount, labelStrategy }, ctx) => {
    const prs = await ctx.github.fetchPullRequests(repositoryOwner, repositoryName, prCount);
    const triaged = await Promise.all(
      prs.map(async (pr) => {
        const diff = await ctx.github.fetchDiff(pr.number);
        const classification = await ctx.classifier.score(diff, { strategy: labelStrategy });
        return ctx.github.applyLabels(pr.number, classification.tags);
      })
    );
    return { processed: triaged.length, results: triaged };
  },
  validationFixtures: [/* recorded fixtures */]
});

Step 3: Runtime Intent Routing

At execution time, the local model receives the user request and the compiled schema registry. It outputs exactly one tool invocation with extracted arguments. No chain-of-thought, no multi-turn planning.

const router = new LocalIntentRouter({
  model: 'qwen2.5:7b-instruct-q4_K_M',
  registry: [triageRepositoryWorkflow, /* other macros */]
});

const userRequest = 'Check the last 3 PRs in acme/checkout and tag them using conventional commits';
const routingResult = await router.resolve(userRequest);

// Output: { tool: 'repo_triage_v1', args: { repositoryOwner: 'acme', repositoryName: 'checkout', prCount: 3, labelStrategy: 'conventional' } }

Architecture Decisions & Rationale

Schema Compilation: Zod schemas do not automatically serialize to JSON Schema in a way LLM parsers expect. Explicit compilation via zodToJsonSchema with strict union handling prevents the model from guessing parameter names. This single fix accounts for the 41% accuracy jump in routing benchmarks.
Single-Turn Execution: Workflows must be encoded as complete sequences. Splitting a pipeline into multiple router turns reintroduces runtime reasoning, which defeats the pattern. Composition is handled by creating a new macro that chains existing handlers, not by chaining router calls.
Deterministic Handlers: Tool sequences are written in standard TypeScript. This enables unit testing, mocking, and CI validation. The LLM never touches the execution path, eliminating hallucination during runtime.
Intent Matching over Semantic Search: The router uses structured intent strings matched against the request, not vector similarity. This reduces false positives and ensures predictable routing behavior.

Pitfall Guide

1. Schema Serialization Drift

Explanation: Relying on implicit schema conversion causes the router to receive a generic {type: "object"} definition. The model then guesses parameter names, leading to silent argument mismatches. Fix: Always compile schemas to JSON Schema at definition time. Validate the serialized output against the LLM's expected format before deployment.

2. Over-Granular Macro Splitting

Explanation: Breaking a single workflow into multiple router turns (e.g., fetch → extract → score as separate calls) forces the model to plan at runtime. This reintroduces the exact reasoning bottleneck the pattern aims to eliminate. Fix: Encode end-to-end sequences as single macros. If workflows need composition, create a parent macro that orchestrates child handlers synchronously.

3. Missing Failure Contracts

Explanation: Handlers assume success paths. When a downstream API returns 429 or malformed data, the macro crashes without a structured recovery path, leaving the router in an undefined state. Fix: Define explicit error states in the macro contract. Implement retry policies, circuit breakers, and fallback routing to a frontier model when local execution fails beyond a threshold.

4. Applying to Exploratory or Novel Tasks

Explanation: The pattern requires repetitive, well-defined surfaces. Applying it to open-ended debugging, creative generation, or rapidly changing third-party UIs creates maintenance overhead that exceeds the routing benefit. Fix: Implement hybrid routing. Configure the router to delegate to a frontier API when confidence scores fall below a threshold or when the request matches a "novel" intent category.

5. Skipping the Distillation Gate

Explanation: Teams manually write macros but never enforce encoding of ad-hoc tool sequences. The macro library stagnates while session logs accumulate uncompiled workflows, creating technical debt. Fix: Wire a CI hook that scans session logs for raw tool call sequences. Fail the build if unencoded workflows exceed a configurable threshold. Auto-suggest macro definitions from log patterns.

6. Underestimating the Model Floor

Explanation: Running the router on models below 7B parameters (especially non-instruct variants) causes high false-positive routing and argument extraction failures. The failure detector fires more often than successful routing. Fix: Maintain a 7B+ instruct-tuned baseline for routing. Quantization to 4-bit is acceptable, but architecture and instruction tuning are non-negotiable for reliable intent classification.

7. Ignoring Versioning and Schema Evolution

Explanation: Updating a macro's input schema without versioning breaks existing router caches and causes silent argument mapping failures in production. Fix: Version macros explicitly (v1, v2). Implement schema migration handlers and deprecation warnings. Route legacy requests to archived macro versions until clients update.

Production Bundle

Action Checklist

Schema Compilation: Verify all macro schemas are explicitly compiled to JSON Schema with strict union handling before router initialization.
CI Gate Integration: Deploy a session log scanner that flags unencoded tool sequences and auto-generates macro scaffolds.
Hybrid Routing Config: Set confidence thresholds for fallback to frontier APIs. Log all fallback events for workflow encoding review.
Failure Contracts: Define retry limits, circuit breaker states, and explicit error payloads for every macro handler.
Test Fixtures: Maintain recorded input/output pairs for each macro. Run validation suites on every schema or handler change.
Model Baseline: Enforce a minimum 7B instruct-tuned model for routing. Validate routing accuracy quarterly as model weights update.
Versioning Strategy: Implement semantic versioning for macros. Maintain backward-compatible handlers during transition periods.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume repetitive tasks (triage, labeling, extraction)	Compiled Macro Routing	Deterministic execution, predictable latency, local deployment	~$0.15 per 10k runs
Novel reasoning or creative generation	Frontier API Routing	Requires multi-step planning and unstructured surface handling	~$150 per 10k runs
Air-gapped or strict data residency environments	Compiled Macro Routing	Zero external network egress, full control over execution path	Infrastructure only
Exploratory debugging or rapidly changing APIs	Local Agent Reasoning	Flexibility outweighs reliability; macro encoding overhead too high	~$1.20 per 10k runs
Mixed workload with 80% routine / 20% novel	Hybrid Routing	Macros handle routine; frontier handles exceptions; cost optimized	~$35 per 10k runs

Configuration Template

// router.config.ts
import { createRouter, compileSchemas, loadMacros } from '@internal/workflow-engine';
import { qwen7bInstruct } from '@internal/model-registry';

export const productionRouter = createRouter({
  model: qwen7bInstruct,
  schemas: compileSchemas(loadMacros('./workflows')),
  fallback: {
    enabled: true,
    threshold: 0.72,
    provider: 'frontier-api',
    maxRetries: 1
  },
  telemetry: {
    logRoutingDecisions: true,
    captureArgumentDrift: true,
    exportInterval: '5m'
  }
});

// ci-gate.ts
import { scanSessionLogs, suggestMacro } from '@internal/distillation-gate';

export async function enforceEncoding() {
  const unencoded = await scanSessionLogs({ window: '24h' });
  if (unencoded.length > 0) {
    console.error(`Found ${unencoded.length} unencoded workflows.`);
    unencoded.forEach(workflow => {
      console.warn(suggestMacro(workflow.toolCalls));
    });
    process.exit(1);
  }
}

Quick Start Guide

Initialize the Macro Registry: Create a workflows/ directory. Define your first macro using createWorkflow, specifying intent, schema, handler, and validation fixtures.
Compile Schemas: Run the schema compiler to generate JSON Schema artifacts. Verify the output matches your LLM router's expected format.
Deploy the Router: Instantiate the local intent router with your compiled registry and a 7B+ instruct-tuned model. Test with sample requests to verify single-turn tool invocation.
Wire the CI Gate: Add the session log scanner to your pipeline. Configure it to fail builds when unencoded tool sequences exceed your threshold.
Monitor & Iterate: Track routing confidence scores and argument drift. When fallbacks occur, encode the workflow into a new macro and retire the fallback path.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back