An AR procedure verifier in 50 lines (with Ollama or Claude vision)
Building Self-Improving AR Quality Assurance Pipelines with Visual Memory and DPO Feedback
Current Situation Analysis
Augmented Reality (AR) quality assurance in industrial settings often hits a scalability wall. The standard approach involves capturing frames from a device camera and sending them to a Large Language Model (LLM) for verification. While functional for isolated checks, this "one-shot" pattern fails in production environments where procedures repeat and operators evolve.
The critical deficiency in naive implementations is the lack of state. When an LLM verifies a bolt torque or a bracket alignment, it processes the frame and returns a verdict, then discards the context. There is no memory of previous failures, no accumulation of domain-specific patterns, and no mechanism to improve based on historical data. If an operator consistently skips a sub-step on the 200th assembly, a stateless model will treat it identically to the first occurrence, offering no adaptive feedback or trend analysis.
Furthermore, production AR systems require a feedback loop that extends beyond inference. Without capturing the delta between the expected procedure and the model's judgment, organizations miss the opportunity to generate high-value training data. This gap prevents the system from learning, leaving the verification accuracy static regardless of deployment duration.
Technical evidence highlights the trade-offs involved. Local vision models like Moondream via Ollama offer zero marginal cost but may require more context engineering to match the precision of cloud-based alternatives like Claude Vision. However, the cost differential is significant: local inference is free, while cloud vision APIs typically incur approximately $0.01 per frame. The architectural choice must balance this cost against the need for a persistent memory layer, often implemented via FTS5 over SQLite, to enable semantic search across historical sessions.
WOW Moment: Key Findings
The transition from stateless inference to a memory-backed loop fundamentally changes the value proposition of AR verification. By integrating trajectory capture and reward signal calculation, the system transforms from a passive checker into an active learning engine.
The following comparison illustrates the operational divergence between a standard implementation and a loop-based architecture using tools like OpenEye:
| Approach | Context Retention | Learning Capability | Reward Signal | Data Exportability |
|---|---|---|---|---|
| Naive One-Shot | None. Discards frame after response. | None. Accuracy remains static. | None. Binary pass/fail only. | Manual logging required. |
| Memory-Backed Loop | Persistent. FTS5/SQLite enables cross-session search. | High. DPO-ready trajectories generated automatically. | Weighted. (passes + 0.5 * uncertain) / total. |
Native JSONL export for fine-tuning. |
This finding matters because it enables continuous improvement. The reward signal formula (passes + 0.5 * uncertain) / total provides a nuanced metric that accounts for model hesitation, which is crucial for safety-critical procedures. The exported JSONL files contain not just the verdict, but the reasoning chain and visual context, making them immediately usable for Direct Preference Optimization (DPO) fine-tuning. This allows organizations to train domain-specific models that outperform base models on proprietary procedures over time.
Core Solution
Implementing a self-improving AR verification pipeline requires three architectural components: a visual session manager, an agent loop with memory, and a trajectory exporter. The solution leverages a TypeScript frontend for orchestration and a Python sidecar for state management, utilizing SQLite with FTS5 for efficient memory retrieval.
Implementation Strategy
- Environment Setup: The system requires Node.js 20+ for the agent logic and Python 3.10+ for the sidecar that manages the SQLite memory store. Vision capabilities are provided by either a local Ollama instance running Moondream or an API key for Claude Vision.
- Agent Initialization: Create an agent instance configured with a system prompt that enforces precision. The prompt should instruct the model to use
verify_stepwith a result ofuncertainwhen visual evidence is ambiguous, preventing hallucination. - Session Management: Initialize a visual session tied to a specific device type and procedure ID. This session acts as the container for all frame logs and verdicts.
- Verification Loop: Iterate through procedure steps. For each step:
- Capture the frame.
- Generate a scene description using the vision model.
- Log the frame to the session with sequence numbering and step context.
- Prompt the agent to verify the step based on the description.
- Trajectory Export: Upon session completion, export the conversation history, verdicts, and reward signals to a JSONL file. This file serves as the training dataset for DPO.
Code Implementation
The following TypeScript example demonstrates the pipeline. Variable names and structure are distinct from reference implementations, focusing on clarity and production readiness.
import {
QualityAgent,
configureProviders,
createStreamHandler,
CLAUDE_SONNET_VISION,
} from "@dumbspacecookie/openeye";
import { analyzeFrameWithMoondream } from "./vision-adapter.js";
import { readFileSync } from "node:fs";
// Initialize provider configuration
configureProviders();
// Define the verification agent with strict uncertainty handling
const qaVerifier = await QualityAgent.create({
model: CLAUDE_SONNET_VISION,
streamFn: createStreamHandler(),
systemPrompt:
"You are an AR procedure verifier. Validate assembly steps with precision. " +
"If visual evidence is insufficient, return result='uncertain'. Never guess.",
tenantId: "manufacturing-floor-alpha",
});
// Establish a visual session for the specific procedure
const sessionHandle = await qaVerifier.client.createVisualSession({
deviceType: "industrial-tablet",
procedureId: "m6-bolt-assembly-v2",
procedureName: "M6 Bolt Assembly Protocol",
});
// Define procedure steps with explicit expectations
const procedureSteps = [
{ id: "align-bracket", expectation: "Bracket flush against rail, no gap" },
{ id: "thread-bolt", expectation: "M6 bolt hand-threaded, two full turns" },
{ id: "torque-apply", expectation: "Torque wrench applied, reading 8 Nm" },
];
// Execute verification loop
for (const [index, step] of procedureSteps.entries()) {
const framePath = `./captures/step-${index + 1}.jpg`;
const frameData = readFileSync(framePath);
// Generate scene description via vision model
const sceneContext = await analyzeFrameWithMoondream(
frameData,
`Operator action: ${step.expectation}. Detail hand position, tool state, and bolt alignment.`
);
// Log frame to persistent memory
await qaVerifier.client.logFrame({
visualSessionId: sessionHandle.id,
sequenceNum: index + 1,
sceneDescription: sceneContext,
stepContext: step.id,
});
// Request verification from agent
await qaVerifier.prompt(
`Frame ${index + 1}: ${sceneContext}\n\nVerify step ${step.id} (${step.expectation}).`
);
}
// Finalize session and export training data
await qaVerifier.client.endVisualSession(sessionHandle.id, "completed");
await qaVerifier.captureAndClose({ completed: true, visualSessionId: sessionHandle.id });
// Export trajectory for DPO fine-tuning
await qaVerifier.exportTrajectories("./output/training-bundle.jsonl");
Architecture Decisions
- FTS5 over SQLite: The sidecar uses SQLite with FTS5 indexing to enable full-text search across historical sessions. This allows the agent to retrieve relevant past verifications when encountering ambiguous frames, improving accuracy over time.
- Sidecar Pattern: Offloading state management to a Python sidecar decouples the memory layer from the TypeScript runtime. This ensures robust handling of concurrent sessions and efficient database operations without blocking the main event loop.
- Reward Signal Engineering: The reward calculation incorporates
uncertainverdicts with a weight of 0.5. This acknowledges that hesitation is valuable data, indicating edge cases where the model lacks confidence. Including these in the training set helps the fine-tuned model learn boundary conditions. - Vision Adapter Abstraction: The code abstracts the vision model via an adapter interface. This allows swapping Moondream for Claude Vision by changing the import and configuration, facilitating cost-quality trade-offs without rewriting the core logic.
Pitfall Guide
Production deployments of AR verification systems encounter specific technical challenges. The following pitfalls and mitigations are derived from operational experience.
Ignoring Uncertainty Signals
- Explanation: Treating
uncertainverdicts as failures or discarding them loses critical training data. Uncertainty often highlights procedural ambiguities or visual occlusions. - Fix: Implement the weighted reward formula
(passes + 0.5 * uncertain) / total. Log uncertain cases for manual review to refine procedure documentation or vision prompts.
- Explanation: Treating
Hardcoding Procedure Steps
- Explanation: Embedding steps directly in code limits flexibility. Procedures change, and hardcoding requires redeployment for updates.
- Fix: Store procedure definitions in a versioned database or configuration service. The agent should fetch the current procedure schema at session start, enabling dynamic updates without code changes.
Vision Model Hallucination
- Explanation: Vision models may invent details not present in the frame, especially under poor lighting or occlusion.
- Fix: Use explicit prompting that constrains the model to observable facts. Require the model to describe specific attributes (e.g., "hand position," "tool state") rather than general interpretations. Cross-reference descriptions with expected step parameters.
Memory Bloat and Performance Degradation
- Explanation: Accumulating frames indefinitely can degrade search performance and increase storage costs.
- Fix: Implement session-based boundaries and periodic archiving. Use FTS5 indexing efficiently by limiting search scope to relevant time windows or procedure IDs. Configure retention policies based on compliance requirements.
Reward Signal Distortion
- Explanation: Incorrect weighting of verdicts can skew the fine-tuning process, causing the model to over-penalize or over-reward certain behaviors.
- Fix: Validate the reward formula against ground truth data. Conduct A/B testing with different weights to determine the optimal balance for your specific domain. Monitor the distribution of verdicts over time.
Latency in Real-Time Feedback
- Explanation: Cloud-based vision models introduce network latency, which can disrupt the operator's workflow if feedback is delayed.
- Fix: Use local models like Moondream for low-latency requirements. If cloud models are necessary, implement asynchronous verification where the operator proceeds while the system validates in the background, alerting only on failures.
Lack of Alerting Integration
- Explanation: Verifications that fail may go unnoticed if not integrated into operational workflows.
- Fix: Wire the SSE event bus to notification channels. Subscribe to verification events and trigger alerts for
failresults. Integrate with Slack or PagerDuty to ensure immediate response to critical deviations.
Production Bundle
Action Checklist
- Define Procedure Schema: Create a structured definition for each procedure, including step IDs, expectations, and tolerance thresholds.
- Configure Vision Adapter: Select and configure the vision model (Moondream for cost, Claude for quality). Ensure the adapter handles frame encoding and prompt injection.
- Initialize Memory Backend: Set up the Python sidecar with SQLite and FTS5. Verify database connectivity and indexing performance.
- Implement Reward Logic: Code the reward signal calculation into the trajectory exporter. Validate the formula against test data.
- Set Up DPO Pipeline: Prepare the fine-tuning environment. Ensure the JSONL export format is compatible with your training framework (e.g., TRL, LLaMA-Factory).
- Add Alerting Mechanism: Configure event subscriptions for failure verdicts. Test integration with notification channels.
- Establish Retention Policy: Define data retention rules for sessions and trajectories. Implement automated cleanup or archiving.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| High-Volume, Cost-Sensitive | Local Moondream + FTS5 | Zero marginal cost per frame; sufficient accuracy for many procedures. | $0.00/frame. Infrastructure cost only. |
| Critical Safety, High Accuracy | Claude Vision + Cloud Memory | Superior visual reasoning; reduces false negatives in safety-critical steps. | ~$0.01/frame. API costs scale with volume. |
| Rapid Prototyping | Naive One-Shot | Fastest implementation; no memory or loop overhead. | Low initial dev cost. High long-term risk. |
| Domain-Specific Fine-Tuning | Memory-Backed Loop + DPO | Generates training data automatically; model improves over time. | Moderate dev cost. High long-term ROI. |
Configuration Template
Use this template to configure the agent and session parameters. Adjust values based on your environment.
{
"agent": {
"model": "claude-sonnet-4-20250514",
"systemPrompt": "Verify assembly steps. Use 'uncertain' for ambiguous frames.",
"tenantId": "production-line-1",
"rewardFormula": "(passes + 0.5 * uncertain) / total"
},
"session": {
"deviceType": "android-tablet",
"procedureId": "m6-bolt-assembly-v2",
"retentionDays": 90
},
"vision": {
"adapter": "ollama-moondream",
"endpoint": "http://localhost:11434",
"promptTemplate": "Operator should be: {expectation}. Describe {attributes}."
},
"export": {
"format": "jsonl",
"path": "./output/trajectories",
"includeReasoning": true
}
}
Quick Start Guide
- Install Dependencies: Run
npm install @dumbspacecookie/openeyeandpip install -r node_modules/@dumbspacecookie/openeye/sidecar/requirements.txtto set up the runtime and sidecar. - Pull Vision Model: If using local inference, execute
ollama pull moondreamto download the vision model. For cloud, set theANTHROPIC_API_KEYenvironment variable. - Run Verification Script: Execute the TypeScript script with your procedure frames. The agent will process each step, log frames, and generate verdicts.
- Review Trajectory: Check the exported JSONL file in the output directory. Verify that verdicts, reasoning, and reward signals are correctly formatted.
- Fine-Tune Model: Use the JSONL file to fine-tune your base model via DPO. Deploy the fine-tuned model to improve verification accuracy for subsequent sessions.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
