ost-per-decision and complicating bias testing. The scalable pattern is a hybrid agent: a small reasoning model orchestrates specialist tools. Each tool (e.g., fraud scorer, ledger writer) is independently testable, replaceable, and auditable. This decomposition makes bias testing tractable and reduces inference costs.
3. Structural Human-in-the-Loop
HITL is an architectural pattern, not a UI checkbox. In high-stakes contexts like credit or clinical eligibility, autonomous decisions invite maximum scrutiny. The architecture must include a review queue, an override path, and an audit log for human decisions. Agents should recommend actions, while humans review edge cases or high-risk decisions. This queue must be core infrastructure, capable of handling volume and ensuring no decision slips through without required review.
4. Observability as a Launch Gate
Observability is a deployment requirement, equal to the feature itself. The system must answer "what did this system do at 3 PM last Tuesday and why?" at any time. This requires request logs, decision traces, drift metrics, and incident timelines. Without this layer, deployment in a regulated workflow is impossible.
Implementation Example: TypeScript Orchestrator
The following TypeScript example demonstrates a compliance-first orchestrator. It integrates provenance tracking, tool execution, HITL routing, and audit persistence.
// Core interfaces for compliance and provenance
interface ProvenanceRecord {
traceId: string;
modelVersion: string;
promptHash: string;
contextSources: string[];
toolCalls: ToolInvocation[];
timestamp: Date;
complianceFlags: string[];
}
interface DecisionResult {
status: 'APPROVED' | 'REJECTED' | 'REVIEW_REQUIRED';
reasoning: string;
riskScore: number;
provenance: ProvenanceRecord;
}
interface ToolInvocation {
toolName: string;
input: Record<string, unknown>;
output: Record<string, unknown>;
latencyMs: number;
}
// HITL Queue Interface
interface HumanReviewQueue {
enqueue(decision: DecisionResult, trace: ProvenanceRecord): Promise<void>;
getPendingCount(): Promise<number>;
}
// Audit Store Interface
interface AuditStore {
persistDecision(result: DecisionResult, trace: ProvenanceRecord): Promise<void>;
persistError(error: Error, trace: ProvenanceRecord): Promise<void>;
}
// Orchestrator Implementation
class RegulatedOrchestrator {
constructor(
private planner: ReasoningModel,
private tools: Map<string, Tool>,
private auditStore: AuditStore,
private reviewQueue: HumanReviewQueue,
private riskThreshold: number
) {}
async execute(request: RiskRequest): Promise<DecisionResult> {
const trace = this.initProvenance(request);
try {
// 1. Planning: Small model generates tool sequence
const plan = await this.planner.generatePlan(request, trace);
// 2. Execution: Run specialist tools
const toolResults = await this.runTools(plan, trace);
// 3. Synthesis: Model combines results into decision
const decision = await this.planner.synthesize(request, toolResults, trace);
// 4. Compliance Routing
if (decision.riskScore >= this.riskThreshold) {
await this.reviewQueue.enqueue(decision, trace);
return { ...decision, status: 'REVIEW_REQUIRED' };
}
// 5. Persistence: Immutable audit log
await this.auditStore.persistDecision(decision, trace);
return decision;
} catch (error) {
await this.auditStore.persistError(error, trace);
throw error;
}
}
private initProvenance(request: RiskRequest): ProvenanceRecord {
return {
traceId: generateUUID(),
modelVersion: this.planner.version,
promptHash: hashJSON(request),
contextSources: request.sources,
toolCalls: [],
timestamp: new Date(),
complianceFlags: ['EU_AI_ACT', 'SOC2']
};
}
private async runTools(plan: ToolPlan, trace: ProvenanceRecord): Promise<ToolInvocation[]> {
const results: ToolInvocation[] = [];
for (const step of plan.steps) {
const tool = this.tools.get(step.name);
if (!tool) throw new Error(`Tool ${step.name} not found`);
const start = Date.now();
const output = await tool.execute(step.input);
const latency = Date.now() - start;
results.push({ toolName: step.name, input: step.input, output, latencyMs: latency });
trace.toolCalls.push(results[results.length - 1]);
}
return results;
}
}
Architecture Rationale:
- Small Reasoning Model: The
planner uses a lightweight model to orchestrate tools. This reduces cost and latency while maintaining control.
- Tool Isolation: Tools are mapped and executed independently. This allows for unit testing, bias auditing, and replacement without retraining the orchestrator.
- Provenance Injection: Every step updates the
trace object. This ensures complete lineage is captured before the decision is finalized.
- Threshold-Based HITL: Decisions above the risk threshold are routed to the
reviewQueue. This ensures human oversight where it matters most, without bottlenecking low-risk transactions.
- Audit Persistence: All outcomes, including errors, are persisted to an immutable store. This satisfies regulatory requirements for reconstruction and accountability.
Pitfall Guide
-
The "Checkbox" HITL Implementation
- Mistake: Adding a "Review" button to the UI without backend queueing, override paths, or audit logging for human decisions.
- Fix: Treat HITL as core infrastructure. Implement a message queue for review items, track human overrides with full attribution, and ensure the queue integrates with the audit store.
-
Monolith Cost Blowout
- Mistake: Using a large foundation model for every task, including simple classification or data extraction.
- Fix: Adopt hybrid agents. Use small models for orchestration and specialist tools for specific tasks. Calculate cost-per-decision early and optimize tool selection based on complexity.
-
Provenance Debt
- Mistake: Logging raw text or relying on model outputs without structured lineage. This makes audit reconstruction impossible.
- Fix: Define a
ProvenanceRecord schema from day one. Capture model version, prompt hash, context, and tool calls for every inference. Store this in an immutable ledger.
-
Observability Blind Spots
- Mistake: Monitoring only uptime and latency. Ignoring drift, bias metrics, and decision traces.
- Fix: Implement comprehensive observability. Track input/output drift, tool performance, and decision distribution. Treat observability dashboards as launch requirements.
-
Roadmap Silos
- Mistake: Engineering and compliance teams work on separate timelines. Accreditation is treated as a final step.
- Fix: Unify roadmaps. Integrate compliance checkpoints into the development lifecycle. Account for 4-9 months for accreditation in project planning.
-
Over-Building Generic Capabilities
- Mistake: Building in-house solutions for transcription, search, or standard NLP tasks.
- Fix: Buy generic capabilities. Build only the IP that differentiates your business. Partner for high-stakes workflows where pre-certified solutions exist.
-
Ignoring Operations Costs
- Mistake: Scoping only the build phase. Underestimating costs for monitoring, retraining, drift detection, and incident response.
- Fix: Budget for operations from day one. Include MLOps, drift monitoring, and incident response in the total cost of ownership.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Core IP (e.g., Fraud Logic) | Build In-House | Differentiates business; requires deep customization. | High CapEx (18-24mo ramp), Low OpEx long-term. |
| Generic (Transcription/Search) | Buy Off-the-Shelf | Commodity capability; fast deployment. | Low CapEx, Subscription OpEx. |
| High-Stakes Workflow | Partner | Pre-certified solutions reduce accreditation risk; speed to market. | Shared Rev/Partner Cost, Lower Risk. |
| Regulated Data Processing | Build/Partner Hybrid | Build orchestration; partner for certified data handling. | Moderate CapEx, Compliance Savings. |
Configuration Template
Use this YAML template to scaffold a regulated orchestrator configuration. It defines tools, compliance settings, and HITL parameters.
orchestrator:
model: "small-reasoning-v2"
max_tokens: 1024
temperature: 0.1
risk_threshold: 0.85
tools:
- name: "kyc_checker"
endpoint: "internal://kyc/v1"
timeout_ms: 500
retry_policy: "exponential_backoff"
- name: "fraud_scorer"
endpoint: "internal://fraud/v2"
timeout_ms: 200
retry_policy: "none"
- name: "ledger_writer"
endpoint: "internal://ledger/v1"
timeout_ms: 1000
retry_policy: "idempotent"
compliance:
provenance:
store: "immutable_ledger"
retention_days: 2555 # 7 years
schema_version: "1.0"
hitl:
queue: "rabbitmq://audit-queue"
escalation_timeout_hours: 4
reviewer_roles: ["compliance_officer", "senior_analyst"]
observability:
drift_detection: true
bias_monitoring: true
log_level: "DEBUG"
metrics_endpoint: "prometheus://metrics"
Quick Start Guide
- Scaffold Provenance Logger: Implement the
ProvenanceRecord structure and integrate it into your request pipeline. Ensure every inference generates a trace.
- Wrap Specialist Tools: Create interfaces for your existing tools (KYC, fraud, ledger). Ensure they return structured outputs and latency metrics.
- Deploy Orchestrator: Instantiate the
RegulatedOrchestrator with a small reasoning model and your tool map. Configure the risk threshold.
- Connect HITL Queue: Set up the review queue and integrate it with your human review interface. Test override paths and audit logging.
- Run Audit Simulation: Execute test cases covering edge cases and high-risk scenarios. Verify lineage reconstruction, HITL routing, and observability metrics. Validate against regulatory requirements.