routed through vector search. These architectural shifts enable compliance auditing, horizontal scaling, and deterministic debugging.
Core Solution
Building a production-ready multi-agent system requires implementing five interdependent layers. Each layer solves a specific distributed systems problem. The implementation below uses TypeScript to demonstrate the architectural boundaries, state management, and routing logic.
Step 1: Implement Intent-Based Orchestration
Hardcoded routing fails when workflows evolve. Replace static chains with a classifier that evaluates incoming requests and routes them to the appropriate agent cluster.
interface RoutingIntent {
targetAgent: string;
requiredTools: string[];
confidence: number;
}
class IntentClassifier {
async classify(prompt: string): Promise<RoutingIntent> {
// Production: Replace with LLM call or fine-tuned NLU model
const lower = prompt.toLowerCase();
if (lower.includes('inventory') || lower.includes('stock')) {
return { targetAgent: 'fulfillment-agent', requiredTools: ['erp-query'], confidence: 0.92 };
}
if (lower.includes('policy') || lower.includes('compliance')) {
return { targetAgent: 'research-agent', requiredTools: ['vector-search'], confidence: 0.88 };
}
return { targetAgent: 'general-agent', requiredTools: [], confidence: 0.65 };
}
}
Rationale: Intent classification decouples request parsing from execution. It enables dynamic agent registration without modifying routing logic. Confidence thresholds allow fallback to human-in-the-loop or clarification agents when routing is ambiguous.
Step 2: Separate Knowledge Retrieval from Live Operations
RAG and MCP serve fundamentally different purposes. RAG retrieves semantic context from static or semi-static corpora. MCP (Model Context Protocol) provides authenticated, real-time access to external systems. Mixing them causes data staleness and permission leaks.
class KnowledgeRouter {
async resolveQuery(intent: RoutingIntent, query: string): Promise<string> {
if (intent.requiredTools.includes('vector-search')) {
return this.performRAG(query);
}
if (intent.requiredTools.includes('erp-query')) {
return this.executeMCPCall('inventory-service', query);
}
throw new Error('Unsupported tool requirement');
}
private async performRAG(query: string): Promise<string> {
// Vector DB query over policy docs, FAQs, runbooks
return `RAG context for: ${query}`;
}
private async executeMCPCall(service: string, params: string): Promise<string> {
// MCP client handles auth, schema validation, and live API routing
return `Live data from ${service}: ${params}`;
}
}
Rationale: Routing live operational queries through vector search guarantees stale answers. MCP enforces schema validation, authentication, and audit logging for every external call. This separation is non-negotiable for financial, inventory, or customer data workflows.
Step 3: Enforce Security Boundaries for Remote Agents
Local agents share memory space and trust context. Remote agents cross network boundaries, requiring explicit authentication, authorization, and payload auditing.
interface AgentExecutionContext {
agentId: string;
scope: 'local' | 'remote';
credentials?: string;
auditTrail: string[];
}
class AgentExecutor {
async run(context: AgentExecutionContext, task: string): Promise<string> {
if (context.scope === 'remote') {
this.validateTrustBoundary(context);
}
const result = await this.invokeAgent(context.agentId, task);
context.auditTrail.push(`[${new Date().toISOString()}] ${context.agentId} executed: ${task}`);
return result;
}
private validateTrustBoundary(ctx: AgentExecutionContext): void {
if (!ctx.credentials) throw new Error('Remote execution requires scoped credentials');
// Enforce mTLS, token validation, and least-privilege scope
}
}
Rationale: Remote agents must never inherit orchestrator permissions. Scoped credentials, network encryption, and mandatory audit logging prevent privilege escalation and data exfiltration. Protocols like A2A (Agent-to-Agent) standardize this communication, but the security policy must be enforced at the execution layer.
Step 4: Implement Durable State Management
In-memory state is acceptable for notebooks. Production requires checkpointing, crash recovery, and cross-agent context sharing.
interface WorkflowCheckpoint {
workflowId: string;
stepIndex: number;
agentState: Record<string, unknown>;
conversationHistory: Array<{ role: string; content: string }>;
timestamp: string;
}
class DurableStateBackend {
async saveCheckpoint(checkpoint: WorkflowCheckpoint): Promise<void> {
// Production: PostgreSQL/Redis with atomic writes
console.log(`Persisting checkpoint ${checkpoint.workflowId} at step ${checkpoint.stepIndex}`);
}
async loadCheckpoint(workflowId: string): Promise<WorkflowCheckpoint | null> {
// Production: Query storage backend
return null;
}
}
Rationale: Durable state enables partial retries, horizontal scaling, and compliance auditing. When an agent crashes, the workflow resumes from the last checkpoint rather than restarting. Conversation history accumulation prevents context amnesia across handoffs.
Black-box failures are unfixable. Every agent handoff, tool call, and state mutation must emit structured traces.
import { trace } from '@opentelemetry/api';
class ObservabilityMiddleware {
async wrapExecution(workflowId: string, fn: () => Promise<string>): Promise<string> {
const span = trace.startSpan('agent-execution', {
attributes: { workflowId, timestamp: new Date().toISOString() }
});
try {
const result = await fn();
span.setAttribute('status', 'success');
return result;
} catch (error) {
span.recordException(error as Error);
span.setAttribute('status', 'failed');
throw error;
} finally {
span.end();
}
}
}
Rationale: OpenTelemetry integration provides distributed tracing across agent boundaries. Span metadata captures routing decisions, tool latency, and failure points. This transforms debugging from guesswork into deterministic analysis.
Pitfall Guide
1. Routing Live Data Through RAG
Explanation: Developers default to vector search because it's easier to configure. RAG retrieves semantic matches from indexed documents, not real-time system state.
Fix: Route operational queries (inventory, CRM, pricing) through MCP clients or direct HTTP/API calls. Reserve RAG for policy, documentation, and historical context.
2. Storing Workflow State in Memory
Explanation: In-memory objects disappear on process restart, container scaling, or agent crashes. Workflows lose context and cannot resume.
Fix: Implement a persistent checkpoint backend (PostgreSQL, Redis, or cloud storage). Serialize agent state and conversation history at every handoff.
3. Hardcoding Agent Routing Paths
Explanation: Sequential if/else or linear chains break when new agents are added or requirements change. Routing becomes a maintenance bottleneck.
Fix: Deploy an intent classifier with a dynamic agent registry. Route based on classified intent and required tool capabilities, not static paths.
4. Ignoring Trust Boundaries for Remote Agents
Explanation: Cross-network agent calls inherit orchestrator permissions by default, creating privilege escalation risks and unaudited data access.
Fix: Enforce scoped credentials, mTLS, and explicit authorization checks for every remote invocation. Log all cross-boundary payloads for compliance.
5. Overloading Single Agents with Multiple Domains
Explanation: Prompting one agent to handle finance, coding, and support pollutes context windows and degrades accuracy across all domains.
Fix: Decompose into specialized workers. Each agent should own a narrow domain, a specific tool subset, and a clear success metric.
6. Skipping Observability Until Production Fails
Explanation: Console logs and unstructured errors provide zero visibility into routing loops, state collisions, or tool failures.
Fix: Instrument every execution step with distributed tracing. Emit span attributes for agent ID, tool latency, and routing decisions from day one.
7. Treating MCP and RAG as Interchangeable
Explanation: Both return "context," but their guarantees differ fundamentally. RAG is probabilistic and read-only. MCP is deterministic and supports state mutations.
Fix: Document retrieval patterns explicitly. Use RAG for knowledge synthesis. Use MCP for live data, writes, and external system integration.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Static policy/docs retrieval | RAG with vector DB | Semantic matching over unstructured text | Low compute, moderate storage |
| Live inventory/CRM queries | MCP client + direct API | Real-time state, schema validation, audit | Higher network overhead, strict auth |
| Single-node prototype | In-memory state + console logs | Fast iteration, minimal infra | Zero infra cost, high production risk |
| Multi-node production | Persistent checkpoint + distributed tracing | Crash recovery, horizontal scaling, debugging | Moderate infra cost, high reliability |
| Local agent execution | Shared memory + implicit trust | Low latency, simple routing | Minimal security overhead |
| Remote agent execution | Scoped tokens + mTLS + audit logs | Trust boundary enforcement, compliance | Higher auth/infra cost, mandatory |
Configuration Template
// production-config.ts
export const workflowConfig = {
orchestration: {
classifierEndpoint: '/api/v1/classify',
fallbackAgent: 'general-agent',
confidenceThreshold: 0.75
},
storage: {
backend: 'postgresql',
checkpointTable: 'workflow_checkpoints',
retentionDays: 30
},
security: {
remoteAuth: 'scoped-oauth2',
enforceMtls: true,
auditLogEnabled: true
},
observability: {
tracerProvider: 'opentelemetry',
exportEndpoint: 'https://otel-collector.internal:4318',
spanAttributes: ['agentId', 'toolName', 'latencyMs', 'routingDecision']
},
retrieval: {
rag: { vectorDb: 'pgvector', index: 'policy_docs' },
mcp: { registry: 'tool-registry.json', timeoutMs: 5000 }
}
};
Quick Start Guide
- Initialize the routing layer: Deploy the
IntentClassifier and connect it to your agent registry. Test with 5–10 representative prompts to validate confidence scores.
- Configure persistent storage: Set up a PostgreSQL or Redis instance. Implement
saveCheckpoint and loadCheckpoint methods. Verify atomic writes and crash recovery.
- Instrument tracing: Add OpenTelemetry SDK to your execution wrapper. Emit spans for every agent invocation and tool call. Validate trace export to your observability backend.
- Test boundary conditions: Simulate agent crashes, network timeouts, and low-confidence routing. Verify checkpoint resume, fallback routing, and audit log completeness.
- Deploy with feature flags: Roll out the orchestrator behind a toggle. Monitor trace latency, checkpoint write success rate, and routing accuracy before full production cutover.