equirements.txt
The Python sidecar auto-spawns when the agent initializes. It handles session tracking, memory indexing, and trajectory serialization. No manual process management is required.
### Step 2: Vision Adapter Abstraction
Vision models should never be called directly from the agent loop. Instead, implement a thin adapter that normalizes frame input into structured scene descriptions. This decouples inference cost from reasoning logic and allows seamless swapping between local and cloud providers.
```typescript
import { VisionAdapter, FrameInput, SceneDescription } from "@codcompass/vision-agent-core";
import { execSync } from "node:child_process";
export class OllamaMoondreamAdapter implements VisionAdapter {
private readonly promptTemplate: string;
constructor(promptTemplate: string) {
this.promptTemplate = promptTemplate;
}
async analyze(frame: FrameInput): Promise<SceneDescription> {
const base64 = frame.buffer.toString("base64");
const payload = JSON.stringify({
model: "moondream",
prompt: this.promptTemplate,
images: [base64],
stream: false
});
const output = execSync(`echo '${payload}' | ollama run moondream --format json`, {
encoding: "utf-8"
});
const parsed = JSON.parse(output);
return {
rawText: parsed.response,
confidence: parsed.confidence ?? 0.85,
metadata: { provider: "ollama", model: "moondream" }
};
}
}
Architecture Rationale: The adapter returns a SceneDescription object rather than raw text. This enforces a contract that includes confidence scores and provider metadata, which the agent loop uses for reward calibration and fallback routing.
Step 3: Agent Initialization & Memory Binding
The agent requires a system prompt that enforces strict verification boundaries, a tenant identifier for multi-tenant isolation, and a memory backend reference.
import {
ProcedureAgent,
AgentConfig,
MemoryBackend,
TrajectoryExporter
} from "@codcompass/vision-agent-core";
const memoryBackend = new MemoryBackend({
storagePath: "./data/verification_store.db",
indexType: "FTS5",
maxSessionRetention: 90 // days
});
const agentConfig: AgentConfig = {
model: "claude-sonnet-4-20250514",
systemPrompt: [
"You verify industrial assembly steps against documented procedures.",
"Return exactly one of: pass, fail, or uncertain.",
"If visual evidence is ambiguous, output uncertain. Never infer missing data.",
"Reference the current step ID and expected action in your reasoning."
].join(" "),
tenantId: "manufacturing-floor-alpha",
memoryBackend
};
const verifier = await ProcedureAgent.create(agentConfig);
Architecture Rationale: FTS5 over SQLite provides sub-millisecond search latency for historical sessions while maintaining ACID compliance. The 90-day retention window prevents unbounded storage growth while preserving recent operational context for few-shot prompting.
Step 4: Verification Loop Execution
Define the procedure steps, iterate through captured frames, and route each through the vision adapter before agent evaluation.
const procedureSteps = [
{ stepId: "align-bracket", expectation: "Position bracket flush against mounting rail" },
{ stepId: "hand-thread-bolt", expectation: "Thread M6 bolt manually for two full rotations" },
{ stepId: "torque-application", expectation: "Apply calibrated torque to 8 Nm" }
];
const sessionRef = await verifier.startSession({
deviceId: "android-tablet-04",
procedureId: "m6-assembly-v2",
procedureName: "M6 Bracket Assembly"
});
for (let idx = 0; idx < procedureSteps.length; idx++) {
const step = procedureSteps[idx];
const frameBuffer = await readFrameFromDisk(`./captures/step_${idx + 1}.jpg`);
const scene = await new OllamaMoondreamAdapter(
`Operator should be: ${step.expectation}. Describe hand position, tool engagement, and bolt state.`
).analyze({ buffer: frameBuffer });
await verifier.logFrame({
sessionId: sessionRef.id,
sequence: idx + 1,
description: scene.rawText,
stepContext: step.stepId,
visionMetadata: scene.metadata
});
const verdict = await verifier.evaluateStep({
sessionId: sessionRef.id,
stepId: step.stepId,
expectation: step.expectation,
sceneContext: scene.rawText
});
console.log(`Step ${step.stepId}: ${verdict.result} | Reason: ${verdict.reasoning}`);
}
await verifier.closeSession(sessionRef.id, "completed");
Step 5: Trajectory Export & DPO Preparation
The agent automatically captures the full conversation history, visual descriptions, step contexts, and verdicts. Exporting this data generates a ShareGPT-formatted JSONL file ready for preference optimization.
const exporter = new TrajectoryExporter(verifier);
await exporter.save({
outputPath: "./exports/m6_assembly_trajectory.jsonl",
rewardFormula: "pass_count + (uncertain_count * 0.5)",
includeReasoning: true,
format: "sharegpt"
});
Architecture Rationale: The reward formula weights uncertain verdicts at 0.5 to prevent the model from being penalized for appropriate hesitation. ShareGPT format ensures compatibility with TRL, LLaMA-Factory, and vLLM fine-tuning pipelines. The exported file contains ground truth (step IDs + expectations) paired with model judgments, creating immediate training signal without manual labeling.
Pitfall Guide
1. Treating Uncertainty as Failure
Explanation: Operators often misinterpret uncertain verdicts as system errors, leading to manual overrides or disabled verification. Uncertainty is a valid state indicating insufficient visual evidence.
Fix: Implement explicit routing for uncertain verdicts. Queue them for human review, log them separately in metrics, and use them as positive training examples during DPO to teach the model appropriate hesitation thresholds.
2. Frame-to-Step Misalignment
Explanation: Sending frames without explicit sequence numbering causes the agent to hallucinate step progression, especially when operators pause or repeat actions.
Fix: Enforce strict sequence tracking at the ingestion layer. Inject sequenceNum and stepContext into every log entry. Validate that the vision adapter's description aligns with the expected step before agent evaluation.
3. Memory Bloat & Query Latency
Explanation: Storing every raw frame description indefinitely degrades FTS5 search performance and increases storage costs.
Fix: Implement session-based pruning. Archive completed sessions to cold storage after 90 days. Use FTS5 tokenization limits and compress historical descriptions using a lightweight summarization pass before indexing.
4. Reward Signal Miscalibration
Explanation: Exporting trajectories with unweighted uncertain verdicts skews DPO training. The model learns to avoid uncertainty entirely, increasing false positives.
Fix: Apply a calibrated reward formula during export. Weight pass at 1.0, fail at 0.0, and uncertain at 0.5. Validate the distribution of exported rewards before feeding them into the fine-tuning pipeline.
5. Over-Prompting the Vision Adapter
Explanation: Asking the vision model to verify steps instead of describing them introduces reasoning bias. Vision models lack procedural context and will hallucinate compliance.
Fix: Keep vision prompts strictly descriptive. Inject procedural expectations only at the agent evaluation layer. Example: Describe tool engagement and bolt position. vs Did they torque it correctly?
6. Ignoring Latency Budgeting
Explanation: Synchronous vision and agent calls create cascading delays on the shop floor. Operators abandon the system if feedback exceeds 3 seconds.
Fix: Pipeline vision inference asynchronously. Cache frame descriptions and batch agent evaluations where procedural steps allow. Implement timeout fallbacks that default to uncertain rather than blocking.
7. Neglecting Ground Truth Validation
Explanation: Agent verdicts alone do not constitute training data. Without operator confirmation, the system optimizes against its own biases.
Fix: Log explicit ground truth alongside every verdict. Implement a lightweight confirmation UI for operators to accept/reject agent judgments. Use confirmed labels as the primary signal for DPO preference pairs.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume shop floor with strict latency SLAs | Local Moondream via Ollama + Async Pipeline | Zero inference cost, sub-2s feedback, full data sovereignty | $0/frame, higher initial hardware cost |
| Low-volume, high-compliance verification (aerospace/medical) | Claude Vision + Fine-Tuned Agent | Superior visual reasoning, audit-ready uncertainty handling, regulatory compliance | ~$0.01/frame, fine-tuning compute costs |
| Rapid prototyping / PoC | Stateless LLM Call + Manual Logging | Fastest deployment, minimal infrastructure, easy to iterate | ~$0.01/frame, zero compounding value |
| Multi-tenant SaaS deployment | Cloud Vision + Centralized FTS5 Cluster | Scalable memory, tenant isolation, centralized trajectory aggregation | ~$0.008/frame (volume discount), storage scaling costs |
Configuration Template
// agent.config.ts
import { AgentConfig, MemoryBackend, TrajectoryExporter } from "@codcompass/vision-agent-core";
export const defaultAgentConfig: AgentConfig = {
model: "claude-sonnet-4-20250514",
systemPrompt: [
"You verify industrial assembly steps against documented procedures.",
"Return exactly one of: pass, fail, or uncertain.",
"If visual evidence is ambiguous, output uncertain. Never infer missing data.",
"Reference the current step ID and expected action in your reasoning."
].join(" "),
tenantId: "production-tenant-01",
memoryBackend: new MemoryBackend({
storagePath: "./data/agent_memory.db",
indexType: "FTS5",
maxSessionRetention: 90,
compressionEnabled: true
}),
telemetry: {
enabled: true,
metricsEndpoint: "/api/v1/verification/metrics",
latencyThreshold: 3000 // ms
}
};
export const trajectoryConfig = {
outputPath: "./exports/trajectories",
rewardFormula: "pass_count + (uncertain_count * 0.5)",
includeReasoning: true,
format: "sharegpt",
rotationPolicy: "weekly"
};
Quick Start Guide
- Initialize Environment: Ensure Node.js 20+ and Python 3.10+ are installed. Run
npm install @codcompass/vision-agent-core and pip install -r node_modules/@codcompass/vision-agent-core/sidecar/requirements.txt.
- Configure Vision Adapter: Choose Ollama/Moondream for local inference or set
ANTHROPIC_API_KEY for Claude Vision. Update the adapter import in your verification script.
- Define Procedure & Run: Create a JSON array of steps with
stepId and expectation. Execute the verification loop against captured frames. The agent will auto-spawn the memory sidecar and log all events.
- Export & Fine-Tune: Call
TrajectoryExporter.save() after session completion. Feed the generated JSONL into your DPO pipeline (TRL/LLaMA-Factory) to generate a domain-specialized model.
- Deploy & Monitor: Ship the updated agent to production. Monitor latency metrics, uncertain verdict rates, and operator confirmation patterns. Schedule weekly retraining cycles to compound accuracy gains.