Back to KB
Difficulty
Intermediate
Read Time
9 min

An AR procedure verifier in 50 lines (with Ollama or Claude vision)

By Codcompass Team··9 min read

Architecting Compounding Vision Agents for Industrial Procedure Verification

Current Situation Analysis

Industrial AR verification systems have historically been built as stateless inference pipelines. A camera captures a frame, the frame is sent to a vision model or LLM, and a pass/fail verdict is returned. This approach treats every verification event as an isolated transaction. The immediate consequence is a complete absence of longitudinal learning. When an operator consistently skips a torque step or misaligns a bracket, the system has no mechanism to remember the pattern, adjust its reasoning, or flag systemic drift.

This problem is frequently overlooked because engineering teams optimize for initial inference accuracy rather than operational compounding. The focus remains on prompt engineering and model selection, while the feedback loop between deployment and model improvement is treated as an afterthought. In reality, the value of a verification system compounds only when it can retain context across sessions, correlate visual descriptions with procedural expectations, and convert operational deltas into structured training signals.

Data from production deployments reveals a clear pattern: zero-shot vision agents plateau quickly on domain-specific procedures. A base model like Claude Vision or Moondream running via Ollama can achieve 70-80% accuracy on generic assembly steps, but drops significantly when faced with proprietary tooling, non-standard lighting, or subtle procedural deviations. Without a memory layer and trajectory export, the 200th run performs identically to the first. Conversely, systems that capture full verification trajectories and feed them into Direct Preference Optimization (DPO) pipelines consistently show 15-30% accuracy gains after fine-tuning. The missing link is not model capability; it is the architectural bridge between stateless inference and continuous improvement.

WOW Moment: Key Findings

The transition from isolated API calls to a memory-backed agent loop fundamentally changes the cost/accuracy trajectory of visual verification systems. The table below compares three architectural approaches commonly evaluated in production environments.

ApproachContext RetentionCost per FrameLongitudinal ImprovementTraining Data Yield
Stateless LLM CallNone~$0.01 (Claude) / $0 (Moondream)Flat (0% gain over time)Zero
Memory-Backed Agent LoopFTS5/SQLite session history~$0.01 (Claude) / $0 (Moondream)Moderate (10-15% via context injection)High (ShareGPT trajectories)
Fine-Tuned Specialized ModelEmbedded in weights~$0.005 (optimized routing)High (20-30% via DPO)Continuous (self-generating)

This finding matters because it shifts the engineering focus from prompt iteration to data pipeline design. A memory-backed loop turns every operational session into a labeled dataset. The delta between the agent's verdict and the ground truth becomes a reward signal. When exported in standard formats like ShareGPT, this data feeds directly into DPO pipelines, enabling domain-specific fine-tuning without manual annotation. The system stops being a static checker and becomes a self-improving verification engine.

Core Solution

Building a compounding verification pipeline requires decoupling three concerns: visual description, procedural reasoning, and state management. The architecture below uses a TypeScript runtime for orchestration, a Python sidecar for persistent memory, and pluggable vision adapters.

Step 1: Environment & Dependency Setup

The runtime requires Node.js 20+ for the agent loop and Python 3.10+ for the memory sidecar. The sidecar manages FTS5-indexed SQLite storage, enabling full-text search across historical sessions. Install the core dependencies:

npm install @codcompass/vision-agent-core
pip install -r node_modules/@codcompass/vision-agent-core/sidecar/r

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back