Difficulty

Intermediate

Read Time

9 min

How to Run a Mixed-Model AI Agent Team in TypeScript?

By Codcompass Team·2026-05-16·9 min read

Architecting Cost-Optimized Multi-Agent Systems with Tiered Model Routing

Current Situation Analysis

Multi-agent architectures have rapidly moved from experimental prototypes to production automation pipelines. Yet a fundamental economic flaw persists in how most teams configure these systems: they assign a single, high-capability frontier model to every agent in the team. This approach treats model selection as a global constant rather than a per-role variable, creating a severe cost-quality mismatch that scales linearly with execution volume.

The problem is rarely intentional. Most orchestration frameworks expose a global model selector at initialization. Developers wire up three or four agents, inherit the default model, and ship. The architectural consequence is predictable. High-leverage, low-frequency roles (like system design or strategic planning) correctly consume premium tokens, but high-frequency, checklist-driven roles (like syntax validation, format checking, or simple routing) are forced to pay the same premium rate. At modest cadences (100 executions per day), a three-agent team running exclusively on frontier models routinely exceeds $400 monthly. At production scale, the bill crosses four figures without delivering proportional quality gains.

The economic driver behind this inefficiency is the pricing structure of modern LLM APIs. Output token pricing is consistently 4x to 8x higher than input pricing across major providers. Agents that generate large responses (architects, code generators, report synthesizers) naturally trigger the expensive output tier. Agents that primarily consume context and return short verdicts (reviewers, validators, routers) are over-provisioned. When every agent shares the same model, you are effectively paying premium output rates for tasks that only require input-heavy processing or deterministic validation.

This pattern is overlooked because early agent frameworks prioritized developer experience over economic granularity. Swapping models required changing a single configuration knob, which simplified debugging but eliminated the ability to align model capability with task complexity. The result is a system that works functionally but fails economically at scale.

WOW Moment: Key Findings

Shifting from a uniform model strategy to a tiered, per-agent routing model fundamentally changes the unit economics of multi-agent automation. By aligning model capability with task frequency and output volume, teams can reduce marginal cloud spend by 70–85% while maintaining or improving output quality.

The following comparison illustrates the economic and operational impact of three common architectural patterns, based on a standardized three-agent workflow (Architect → Builder → Validator) running 100 executions daily:

Architecture Pattern	Avg Cost/Run	Monthly Estimate (100 runs/day)	Latency Profile	Quality Alignment
Uniform Frontier (All Opus/GPT-5.5)	$0.28	~$840	Fast (hosted APIs)	Over-provisioned for validation
Dual-Tier Cloud (Opus + GPT-5.4 mini)	$0.09	~$270	Fast (hosted APIs)	Balanced for planning vs execution
Hybrid Local-Cloud (Opus + GPT-mini + Ollama)	$0.04	~$120	Variable (local bottleneck)	Optimized for cost, latency trade-off

This finding matters because it decouples capability from cost. You no longer need to choose between expensive precision and cheap degradation. Instead, you assign premium models to roles that require complex reasoning, high-stakes output generation, or novel problem decomposition. Budget and local models handle repetitive validation, format enforcement, and context summarization. The orchestrator remains unchanged; only the agent specifications shift. This pattern enables production-scale automation without runaway token budgets, and it establishes a telemetry foundation for continuous cost optimization.

Core Solution

The implementation relies on per-agent configuration primitives rather than global model overrides. We will construct a three-agent pipeline using the open-multi-agent ecosystem, assign distinct model tiers to each role, and attach an event-driven telemetry layer to track cost and latency in real time.

Step 1: Define Per-Agent Specifications

Instead of inheriting a global model, each agent receives an explicit blueprint contai

ning provider, model identifier, endpoint configuration, and role-specific system instructions. This isolates capability boundaries and prevents prompt contamination across tiers.

import { MultiAgentOrchestrator } from '@open-multi-agent/core'
import type { AgentSpec, OrchestratorConfig } from '@open-multi-agent/core'

const architectSpec: AgentSpec = {
  id: 'architect',
  provider: 'anthropic',
  model: 'claude-opus-4-7',
  systemPrompt: 'Analyze requirements, decompose into modular components, and define API contracts. Prioritize correctness and extensibility.',
  temperature: 0.2,
}

const builderSpec: AgentSpec = {
  id: 'builder',
  provider: 'openai',
  model: 'gpt-5.4-mini',
  systemPrompt: 'Implement the architectural blueprint. Generate production-ready code with error handling and type safety.',
  temperature: 0.3,
}

const validatorSpec: AgentSpec = {
  id: 'validator',
  provider: 'openai',
  model: 'llama3.1',
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'local-placeholder',
  systemPrompt: 'Review generated code for syntax errors, security anti-patterns, and compliance with the API contract. Return pass/fail with brief notes.',
  temperature: 0.1,
}

Rationale: The architect requires high reasoning capacity and low temperature for deterministic planning. The builder handles implementation, which benefits from a cost-efficient model with strong code generation capabilities. The validator performs checklist-style review, making it ideal for a local model where marginal cloud cost is zero. The baseURL override routes the validator through an OpenAI-compatible inference server without requiring provider-specific adapters.

Step 2: Initialize the Orchestrator with Lazy Adapter Loading

The orchestrator instantiates LLM adapters on demand. This means you only load SDKs for providers you actually use, reducing bundle size and initialization overhead.

const orchestratorConfig: OrchestratorConfig = {
  defaultModel: 'claude-opus-4-7',
  defaultProvider: 'anthropic',
  onProgress: (event) => {
    telemetryCollector.processEvent(event)
  },
}

const orchestrator = new MultiAgentOrchestrator(orchestratorConfig)

const pipeline = orchestrator.createTeam('code-generation-pipeline', {
  name: 'code-generation-pipeline',
  agents: [architectSpec, builderSpec, validatorSpec],
  sharedMemory: true,
  executionMode: 'dependency-aware',
})

Rationale: sharedMemory: true allows agents to reference previous outputs without re-prompting, reducing redundant input tokens. executionMode: 'dependency-aware' ensures the builder waits for the architect, and the validator waits for the builder, preventing race conditions. The onProgress callback hooks into the execution lifecycle, enabling real-time telemetry without blocking the main thread.

Step 3: Implement Event-Driven Telemetry

Cost and latency tracking must happen during execution, not after. We attach a ledger that captures start/end timestamps and aggregates token usage per agent.

class ExecutionLedger {
  private activeSessions = new Map<string, { startTime: number; model: string }>()
  private completedRuns: Array<{ agentId: string; model: string; latencyMs: number; inputTokens: number; outputTokens: number }> = []

  processEvent(event: any) {
    if (event.type === 'agent_start' && event.agentId) {
      const spec = [architectSpec, builderSpec, validatorSpec].find(a => a.id === event.agentId)
      this.activeSessions.set(event.agentId, {
        startTime: Date.now(),
        model: spec?.model ?? 'unknown',
      })
    }

    if (event.type === 'agent_complete' && event.agentId) {
      const session = this.activeSessions.get(event.agentId)
      if (session) {
        this.completedRuns.push({
          agentId: event.agentId,
          model: session.model,
          latencyMs: Date.now() - session.startTime,
          inputTokens: event.tokenUsage?.input ?? 0,
          outputTokens: event.tokenUsage?.output ?? 0,
        })
        this.activeSessions.delete(event.agentId)
      }
    }
  }

  calculateCosts() {
    const pricingMap: Record<string, { inputPerM: number; outputPerM: number }> = {
      'claude-opus-4-7': { inputPerM: 5.00, outputPerM: 25.00 },
      'gpt-5.4-mini': { inputPerM: 0.75, outputPerM: 4.50 },
      'llama3.1': { inputPerM: 0, outputPerM: 0 },
    }

    return this.completedRuns.map(run => {
      const rates = pricingMap[run.model] ?? { inputPerM: 0, outputPerM: 0 }
      const cost = (run.inputTokens / 1_000_000) * rates.inputPerM + (run.outputTokens / 1_000_000) * rates.outputPerM
      return { ...run, cost }
    })
  }
}

const telemetryCollector = new ExecutionLedger()

Rationale: The ledger decouples telemetry from business logic. By capturing agent_start and agent_complete events, we measure wall-clock latency accurately. Token usage is pulled from the orchestrator's native tokenUsage payload, ensuring consistency with provider billing. The pricing map is externalized so you can update rates without modifying execution logic. This pattern scales to dozens of agents and supports post-run cost attribution for chargeback or budgeting systems.

Step 4: Execute and Inspect Results

const runResult = await orchestrator.runTeam(pipeline, 'Implement a rate-limited HTTP client with exponential backoff and circuit breaker logic.')

const costReport = telemetryCollector.calculateCosts()
console.table(costReport)
console.log(`Total execution cost: $${costReport.reduce((sum, r) => sum + r.cost, 0).toFixed(4)}`)

The orchestrator handles provider switching, token accounting, and dependency resolution internally. Your application code remains provider-agnostic while gaining granular economic visibility.

Pitfall Guide

1. The Output Token Trap

Explanation: Developers optimize for input token reduction but ignore that output pricing is 4–8x higher. Agents that generate large responses (code, reports, schemas) will dominate costs regardless of input optimization. Fix: Assign output-heavy roles to models with favorable output pricing or implement response length constraints. Use streaming with early termination when the output exceeds expected bounds.

2. Local Endpoint Misconfiguration

Explanation: OpenAI-compatible local servers (Ollama, vLLM, LM Studio) often ignore API keys, but the official OpenAI SDK validates that the key is non-empty. Passing undefined or null throws a client-side error before the request reaches the local server. Fix: Always provide a placeholder string like 'local-placeholder' or 'ollama' in the apiKey field when routing through baseURL.

3. Synchronous Blocking on Local Models

Explanation: Local models run on consumer hardware and typically exhibit 2–5x higher latency than hosted APIs. If your pipeline waits synchronously for a local validator, you introduce unpredictable delays that break SLAs. Fix: Run local validation asynchronously or implement a timeout fallback. If the local model exceeds a threshold (e.g., 15s), route to a budget cloud model instead.

4. Prompt Contamination Across Tiers

Explanation: System prompts tuned for frontier models often rely on implicit reasoning capabilities that budget or local models lack. Copy-pasting the same prompt across agents causes silent degradation in cheaper tiers. Fix: Write tier-aware prompts. Frontier models can handle open-ended reasoning. Budget models require explicit constraints, step-by-step instructions, and output format enforcement. Local models need highly structured, deterministic tasks.

5. Over-Engineering with Dynamic Routing

Explanation: Teams attempt to build runtime model selectors that switch models per turn based on cost or latency signals. This adds significant complexity, introduces new failure modes, and often yields marginal savings compared to static per-agent assignment. Fix: Start with static per-agent routing. It captures 80% of the economic benefit with 5% of the complexity. Only introduce dynamic routing when you have proven telemetry, clear cost thresholds, and automated fallback chains.

6. Ignoring Provider Rate Limits

Explanation: Mixing providers means managing multiple quota pools. A sudden spike in builder requests can exhaust OpenAI rate limits while Anthropic remains underutilized, causing silent failures or retries that inflate costs. Fix: Implement per-provider rate limiting and circuit breakers. Distribute load evenly and monitor 429 responses. Use exponential backoff with jitter to prevent thundering herd effects.

7. Missing Fallback Chains

Explanation: When a specific model tier fails (e.g., local server crashes, cloud API outage), the entire pipeline halts. Teams assume mixed-model setups are resilient because they use multiple providers, but they rarely implement graceful degradation. Fix: Define fallback hierarchies. If the local validator fails, route to a budget cloud model. If the budget model fails, route to a frontier model with a cost warning. Log all fallbacks for post-mortem analysis.

Production Bundle

Action Checklist

Audit current agent roles: classify each as input-heavy, output-heavy, or validation-focused
Map pricing tiers: assign frontier models to high-leverage roles, budget models to execution, local models to validation
Implement per-agent AgentSpec configuration with explicit provider and model fields
Attach event-driven telemetry to capture latency and token usage per agent
Externalize pricing maps and update them monthly to reflect provider rate changes
Add timeout and fallback logic for local and budget model endpoints
Write tier-aware system prompts instead of reusing identical instructions across agents
Monitor 429 rate limit responses and implement per-provider circuit breakers

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-frequency validation (1000+ runs/day)	Local model via Ollama/vLLM	Zero marginal cloud cost; validation is checklist-driven	Reduces monthly spend by 60–80%
Strategic planning & architecture	Frontier model (Opus/GPT-5.5)	Requires complex reasoning and novel decomposition	Higher per-run cost, but low frequency minimizes impact
Code generation & implementation	Budget cloud model (GPT-mini/Gemini Flash)	Strong code capability at 5–10x lower output pricing	Balances quality and cost for high-output tasks
Strict SLA requirements (<2s latency)	All-hosted tiered routing	Local models introduce unpredictable latency	Slightly higher cost, but guarantees response time
Budget-constrained prototype	Local + single budget cloud model	Minimizes cloud spend while maintaining functionality	Near-zero marginal cost; acceptable for internal tools

Configuration Template

import { MultiAgentOrchestrator } from '@open-multi-agent/core'
import type { AgentSpec, OrchestratorConfig } from '@open-multi-agent/core'

// Agent specifications with explicit tier assignment
const agents: AgentSpec[] = [
  {
    id: 'planner',
    provider: 'anthropic',
    model: 'claude-opus-4-7',
    systemPrompt: 'Decompose complex requirements into modular tasks. Define interfaces and constraints.',
    temperature: 0.2,
  },
  {
    id: 'executor',
    provider: 'openai',
    model: 'gpt-5.4-mini',
    systemPrompt: 'Implement tasks according to planner specifications. Output production-ready code.',
    temperature: 0.3,
  },
  {
    id: 'reviewer',
    provider: 'openai',
    model: 'llama3.1',
    baseURL: 'http://localhost:11434/v1',
    apiKey: 'local-placeholder',
    systemPrompt: 'Validate output against planner constraints. Return pass/fail with concise notes.',
    temperature: 0.1,
  },
]

// Orchestrator with telemetry hook
const config: OrchestratorConfig = {
  defaultModel: 'claude-opus-4-7',
  defaultProvider: 'anthropic',
  onProgress: (event) => {
    // Route to your telemetry service
    console.debug(`[${event.type}] Agent: ${event.agentId}`)
  },
}

const orchestrator = new MultiAgentOrchestrator(config)
const team = orchestrator.createTeam('tiered-pipeline', {
  name: 'tiered-pipeline',
  agents,
  sharedMemory: true,
  executionMode: 'dependency-aware',
})

export { orchestrator, team }

Quick Start Guide

Install dependencies: npm install @open-multi-agent/core
Pull and serve a local model: Run ollama pull llama3.1 followed by ollama serve in a separate terminal
Define your agent specs: Assign provider, model, and system prompts per role using the configuration template
Initialize the orchestrator: Pass the specs to createTeam() and attach an onProgress callback for telemetry
Execute and monitor: Call runTeam() with your task prompt, then inspect the telemetry ledger for cost and latency breakdowns

This pattern transforms multi-agent systems from cost centers into economically sustainable automation pipelines. By aligning model capability with task complexity, you eliminate wasted token spend while preserving output quality. The architecture scales horizontally, supports continuous cost optimization, and provides the telemetry foundation required for production-grade AI engineering.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back