ning provider, model identifier, endpoint configuration, and role-specific system instructions. This isolates capability boundaries and prevents prompt contamination across tiers.
import { MultiAgentOrchestrator } from '@open-multi-agent/core'
import type { AgentSpec, OrchestratorConfig } from '@open-multi-agent/core'
const architectSpec: AgentSpec = {
id: 'architect',
provider: 'anthropic',
model: 'claude-opus-4-7',
systemPrompt: 'Analyze requirements, decompose into modular components, and define API contracts. Prioritize correctness and extensibility.',
temperature: 0.2,
}
const builderSpec: AgentSpec = {
id: 'builder',
provider: 'openai',
model: 'gpt-5.4-mini',
systemPrompt: 'Implement the architectural blueprint. Generate production-ready code with error handling and type safety.',
temperature: 0.3,
}
const validatorSpec: AgentSpec = {
id: 'validator',
provider: 'openai',
model: 'llama3.1',
baseURL: 'http://localhost:11434/v1',
apiKey: 'local-placeholder',
systemPrompt: 'Review generated code for syntax errors, security anti-patterns, and compliance with the API contract. Return pass/fail with brief notes.',
temperature: 0.1,
}
Rationale: The architect requires high reasoning capacity and low temperature for deterministic planning. The builder handles implementation, which benefits from a cost-efficient model with strong code generation capabilities. The validator performs checklist-style review, making it ideal for a local model where marginal cloud cost is zero. The baseURL override routes the validator through an OpenAI-compatible inference server without requiring provider-specific adapters.
Step 2: Initialize the Orchestrator with Lazy Adapter Loading
The orchestrator instantiates LLM adapters on demand. This means you only load SDKs for providers you actually use, reducing bundle size and initialization overhead.
const orchestratorConfig: OrchestratorConfig = {
defaultModel: 'claude-opus-4-7',
defaultProvider: 'anthropic',
onProgress: (event) => {
telemetryCollector.processEvent(event)
},
}
const orchestrator = new MultiAgentOrchestrator(orchestratorConfig)
const pipeline = orchestrator.createTeam('code-generation-pipeline', {
name: 'code-generation-pipeline',
agents: [architectSpec, builderSpec, validatorSpec],
sharedMemory: true,
executionMode: 'dependency-aware',
})
Rationale: sharedMemory: true allows agents to reference previous outputs without re-prompting, reducing redundant input tokens. executionMode: 'dependency-aware' ensures the builder waits for the architect, and the validator waits for the builder, preventing race conditions. The onProgress callback hooks into the execution lifecycle, enabling real-time telemetry without blocking the main thread.
Step 3: Implement Event-Driven Telemetry
Cost and latency tracking must happen during execution, not after. We attach a ledger that captures start/end timestamps and aggregates token usage per agent.
class ExecutionLedger {
private activeSessions = new Map<string, { startTime: number; model: string }>()
private completedRuns: Array<{ agentId: string; model: string; latencyMs: number; inputTokens: number; outputTokens: number }> = []
processEvent(event: any) {
if (event.type === 'agent_start' && event.agentId) {
const spec = [architectSpec, builderSpec, validatorSpec].find(a => a.id === event.agentId)
this.activeSessions.set(event.agentId, {
startTime: Date.now(),
model: spec?.model ?? 'unknown',
})
}
if (event.type === 'agent_complete' && event.agentId) {
const session = this.activeSessions.get(event.agentId)
if (session) {
this.completedRuns.push({
agentId: event.agentId,
model: session.model,
latencyMs: Date.now() - session.startTime,
inputTokens: event.tokenUsage?.input ?? 0,
outputTokens: event.tokenUsage?.output ?? 0,
})
this.activeSessions.delete(event.agentId)
}
}
}
calculateCosts() {
const pricingMap: Record<string, { inputPerM: number; outputPerM: number }> = {
'claude-opus-4-7': { inputPerM: 5.00, outputPerM: 25.00 },
'gpt-5.4-mini': { inputPerM: 0.75, outputPerM: 4.50 },
'llama3.1': { inputPerM: 0, outputPerM: 0 },
}
return this.completedRuns.map(run => {
const rates = pricingMap[run.model] ?? { inputPerM: 0, outputPerM: 0 }
const cost = (run.inputTokens / 1_000_000) * rates.inputPerM + (run.outputTokens / 1_000_000) * rates.outputPerM
return { ...run, cost }
})
}
}
const telemetryCollector = new ExecutionLedger()
Rationale: The ledger decouples telemetry from business logic. By capturing agent_start and agent_complete events, we measure wall-clock latency accurately. Token usage is pulled from the orchestrator's native tokenUsage payload, ensuring consistency with provider billing. The pricing map is externalized so you can update rates without modifying execution logic. This pattern scales to dozens of agents and supports post-run cost attribution for chargeback or budgeting systems.
Step 4: Execute and Inspect Results
const runResult = await orchestrator.runTeam(pipeline, 'Implement a rate-limited HTTP client with exponential backoff and circuit breaker logic.')
const costReport = telemetryCollector.calculateCosts()
console.table(costReport)
console.log(`Total execution cost: $${costReport.reduce((sum, r) => sum + r.cost, 0).toFixed(4)}`)
The orchestrator handles provider switching, token accounting, and dependency resolution internally. Your application code remains provider-agnostic while gaining granular economic visibility.
Pitfall Guide
1. The Output Token Trap
Explanation: Developers optimize for input token reduction but ignore that output pricing is 4β8x higher. Agents that generate large responses (code, reports, schemas) will dominate costs regardless of input optimization.
Fix: Assign output-heavy roles to models with favorable output pricing or implement response length constraints. Use streaming with early termination when the output exceeds expected bounds.
2. Local Endpoint Misconfiguration
Explanation: OpenAI-compatible local servers (Ollama, vLLM, LM Studio) often ignore API keys, but the official OpenAI SDK validates that the key is non-empty. Passing undefined or null throws a client-side error before the request reaches the local server.
Fix: Always provide a placeholder string like 'local-placeholder' or 'ollama' in the apiKey field when routing through baseURL.
3. Synchronous Blocking on Local Models
Explanation: Local models run on consumer hardware and typically exhibit 2β5x higher latency than hosted APIs. If your pipeline waits synchronously for a local validator, you introduce unpredictable delays that break SLAs.
Fix: Run local validation asynchronously or implement a timeout fallback. If the local model exceeds a threshold (e.g., 15s), route to a budget cloud model instead.
4. Prompt Contamination Across Tiers
Explanation: System prompts tuned for frontier models often rely on implicit reasoning capabilities that budget or local models lack. Copy-pasting the same prompt across agents causes silent degradation in cheaper tiers.
Fix: Write tier-aware prompts. Frontier models can handle open-ended reasoning. Budget models require explicit constraints, step-by-step instructions, and output format enforcement. Local models need highly structured, deterministic tasks.
5. Over-Engineering with Dynamic Routing
Explanation: Teams attempt to build runtime model selectors that switch models per turn based on cost or latency signals. This adds significant complexity, introduces new failure modes, and often yields marginal savings compared to static per-agent assignment.
Fix: Start with static per-agent routing. It captures 80% of the economic benefit with 5% of the complexity. Only introduce dynamic routing when you have proven telemetry, clear cost thresholds, and automated fallback chains.
6. Ignoring Provider Rate Limits
Explanation: Mixing providers means managing multiple quota pools. A sudden spike in builder requests can exhaust OpenAI rate limits while Anthropic remains underutilized, causing silent failures or retries that inflate costs.
Fix: Implement per-provider rate limiting and circuit breakers. Distribute load evenly and monitor 429 responses. Use exponential backoff with jitter to prevent thundering herd effects.
7. Missing Fallback Chains
Explanation: When a specific model tier fails (e.g., local server crashes, cloud API outage), the entire pipeline halts. Teams assume mixed-model setups are resilient because they use multiple providers, but they rarely implement graceful degradation.
Fix: Define fallback hierarchies. If the local validator fails, route to a budget cloud model. If the budget model fails, route to a frontier model with a cost warning. Log all fallbacks for post-mortem analysis.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-frequency validation (1000+ runs/day) | Local model via Ollama/vLLM | Zero marginal cloud cost; validation is checklist-driven | Reduces monthly spend by 60β80% |
| Strategic planning & architecture | Frontier model (Opus/GPT-5.5) | Requires complex reasoning and novel decomposition | Higher per-run cost, but low frequency minimizes impact |
| Code generation & implementation | Budget cloud model (GPT-mini/Gemini Flash) | Strong code capability at 5β10x lower output pricing | Balances quality and cost for high-output tasks |
| Strict SLA requirements (<2s latency) | All-hosted tiered routing | Local models introduce unpredictable latency | Slightly higher cost, but guarantees response time |
| Budget-constrained prototype | Local + single budget cloud model | Minimizes cloud spend while maintaining functionality | Near-zero marginal cost; acceptable for internal tools |
Configuration Template
import { MultiAgentOrchestrator } from '@open-multi-agent/core'
import type { AgentSpec, OrchestratorConfig } from '@open-multi-agent/core'
// Agent specifications with explicit tier assignment
const agents: AgentSpec[] = [
{
id: 'planner',
provider: 'anthropic',
model: 'claude-opus-4-7',
systemPrompt: 'Decompose complex requirements into modular tasks. Define interfaces and constraints.',
temperature: 0.2,
},
{
id: 'executor',
provider: 'openai',
model: 'gpt-5.4-mini',
systemPrompt: 'Implement tasks according to planner specifications. Output production-ready code.',
temperature: 0.3,
},
{
id: 'reviewer',
provider: 'openai',
model: 'llama3.1',
baseURL: 'http://localhost:11434/v1',
apiKey: 'local-placeholder',
systemPrompt: 'Validate output against planner constraints. Return pass/fail with concise notes.',
temperature: 0.1,
},
]
// Orchestrator with telemetry hook
const config: OrchestratorConfig = {
defaultModel: 'claude-opus-4-7',
defaultProvider: 'anthropic',
onProgress: (event) => {
// Route to your telemetry service
console.debug(`[${event.type}] Agent: ${event.agentId}`)
},
}
const orchestrator = new MultiAgentOrchestrator(config)
const team = orchestrator.createTeam('tiered-pipeline', {
name: 'tiered-pipeline',
agents,
sharedMemory: true,
executionMode: 'dependency-aware',
})
export { orchestrator, team }
Quick Start Guide
- Install dependencies:
npm install @open-multi-agent/core
- Pull and serve a local model: Run
ollama pull llama3.1 followed by ollama serve in a separate terminal
- Define your agent specs: Assign provider, model, and system prompts per role using the configuration template
- Initialize the orchestrator: Pass the specs to
createTeam() and attach an onProgress callback for telemetry
- Execute and monitor: Call
runTeam() with your task prompt, then inspect the telemetry ledger for cost and latency breakdowns
This pattern transforms multi-agent systems from cost centers into economically sustainable automation pipelines. By aligning model capability with task complexity, you eliminate wasted token spend while preserving output quality. The architecture scales horizontally, supports continuous cost optimization, and provides the telemetry foundation required for production-grade AI engineering.