ucturally sound but implementation-flawed code.
- Parallel DAG execution reduces wall-clock time by ~41% compared to sequential loops while maintaining verification rigor.
- Adaptive routing based on severity-classified verifier feedback cuts redundant solver regenerations by ~60%.
Core Solution
CLMA (Closed-Loop Multi-Agent) is engineered as a three-layer system that decouples performance-critical orchestration from LLM interaction and real-time visualization.
βββββββββββββββββββββββββββββββββββββββ
β Web UI (Flask + SSE + SVG) β
β Real-time flow graphs & gauges β
βββββββββββββββββββββββββββββββββββββββ€
β Python Interface (pybind11) β
β Agent orchestration & scoring β
βββββββββββββββββββββββββββββββββββββββ€
β C++17 Core Engine β
β Orchestrator Β· DAG Β· Rule Engine β
β Token Monitor Β· Plugin Manager β
βββββββββββββββββββββββββββββββββββββββ
The C++ core manages DAG processing, rule matching, and token tracking. The Python layer handles agent orchestration, LLM API routing, and scoring logic. The Web UI streams agent state transitions via Server-Sent Events (SSE).
The Five Agent Roles
Every query routes through a dynamic subset of five specialized agents:
| Agent | Role | Prompt Template |
|---|
| Refiner | Reformulates the user's query into a structured task. Extracts implicit requirements. | "Restate the task clearly. Identify edge cases." |
| Reasoner | Produces a solution strategy without writing code. Plans the approach. | "Outline the algorithm. Consider time/space complexity." |
| Solver | Generates the actual implementation code. | "Write production-quality code following the plan." |
| Verifier | Reviews the Solver's output. Checks correctness, completeness, and potential bugs. | "Review this code. List issues by severity." |
| Evaluator | Scores the final output on three dimensions. Decides if iteration is needed. | "Rate this solution on reasonableness, executability, and satisfaction." |
Three-Dimensional Scoring & Closed Loop
The Evaluator produces a multi-axis score:
- Reasonableness (0β1): Does the approach make sense for the problem?
- Executability (0β1): Would the code actually run without errors?
- Satisfaction (0β1): Does the output fully address the user's query?
Overall = Reasonableness Γ 0.4 + Executability Γ 0.4 + Satisfaction Γ 0.2
If the overall score falls below a configurable threshold (default 0.7), the framework loops back: Refiner receives Verifier's feedback, Solver generates an improved version, Verifier checks again, and Evaluator re-scores. This continues up to max_iterations (default 3).
Execution Mode Evolution
The initial Single Closed Loop proved effective but inefficient for varying complexity levels:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query β
β β β
β Refiner β Reasoner β Solver β Verifier β Evaluator β
β β β β
β βββββββ score < threshold? ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DAG Mode introduces parallel decomposition. The C++ DAG processor splits complex queries into independent sub-tasks, executes them concurrently, and aggregates verifier feedback before final evaluation. This eliminates sequential bottlenecks while preserving the self-verifying loop.
Pitfall Guide
- Coarse Single-Score Feedback: Relying on a single aggregate score masks failure modes. High reasonableness with low executability requires code-level fixes, not architectural rethinking. Best Practice: Always use multi-dimensional scoring to route targeted feedback to the correct agent.
- Uniform Pipeline for All Query Complexities: Applying the same 5-agent sequence to trivial and complex queries wastes tokens and latency. Best Practice: Implement adaptive execution routing (Single Loop vs. DAG) based on dependency analysis and task decomposition requirements.
- Unbounded Iteration Loops: Without explicit caps and monitoring, closed loops can run indefinitely or exhaust API budgets. Best Practice: Enforce configurable
max_iterations, real-time token tracking at the C++ core level, and early-exit conditions when scores plateau.
- Synchronous Blocking in Parallel DAG Execution: Running independent sub-tasks sequentially negates DAG performance gains. Best Practice: Leverage asynchronous task scheduling with non-blocking result aggregation in the orchestrator layer.
- Ignoring Verifier-Solver Feedback Granularity: Passing raw error logs or unstructured critiques causes refiner confusion and regeneration loops. Best Practice: Standardize feedback schemas with severity classification, explicit improvement directives, and code-location references.
- Over-Optimizing for Satisfaction Over Executability: Prioritizing user intent alignment without enforcing runtime correctness produces polished but broken code. Best Practice: Weight Executability equally with Reasonableness in the scoring formula; block deployment until both exceed threshold.
- State Leakage Across Iterations: Carrying forward stale context or unverified assumptions from previous loops compounds errors. Best Practice: Reset solver context per iteration while preserving only structured verifier feedback and refiner constraints.
Deliverables
- CLMA Architecture Blueprint: Complete system design document covering C++17 core modules, Python orchestration layer, DAG processor implementation, and SSE streaming architecture.
- Agent Role & Prompt Template Checklist: Verified prompt schemas for Refiner, Reasoner, Solver, Verifier, and Evaluator, including edge-case extraction patterns and severity-classification rubrics.
- Configuration Templates: Production-ready YAML/JSON configs for scoring weights, threshold tuning,
max_iterations, token budgets, and DAG decomposition rules.
- Deployment & Streaming Guide: Step-by-step instructions for setting up the Flask + SSE Web UI, configuring pybind11 bindings, and monitoring real-time agent flow graphs.