Deterministic AI Agent Testing: Mocking LLM SDKs and Streaming Events with Vitest 4

Current Situation Analysis

Building AI agents introduces a fundamental conflict with traditional testing methodologies. Unit tests rely on deterministic contracts: identical inputs must yield identical outputs. Large language models inherently violate this principle. They introduce external state dependencies, non-deterministic generation, and complex asynchronous streaming protocols. When developers attempt to test agent orchestration code, they quickly hit three architectural walls. First, every test execution requires a live API key, which creates credential leakage risks in version control or CI pipelines. Second, real inference calls introduce network latency, turning a fast test suite into a multi-minute bottleneck that breaks developer feedback loops. Third, streaming responses use AsyncIterable objects that cannot be stubbed with standard promise mocks, forcing developers to either skip stream testing or write fragile integration harnesses.

This gap is frequently overlooked because agent development prioritizes prompt engineering and workflow design over test infrastructure. Many teams defer testing until deployment, relying on manual validation or production monitoring. The cost compounds quickly. A single CI pipeline running 50 agent tests against a live endpoint can consume thousands of tokens per run, while feedback loops stretch beyond acceptable thresholds. The industry standard has shifted toward mocking, but the Anthropic SDK’s class-based architecture and streaming event model require precise interception strategies. Without them, tests either fail at runtime due to constructor mismatches or silently pass while masking broken orchestration logic. The result is a testing vacuum where agent code runs in production with unverified prompt formatting, untracked token usage, and unhandled stream edge cases.

WOW Moment: Key Findings

The most significant leverage point in AI agent testing is decoupling orchestration logic from inference execution. By intercepting the SDK at the constructor level and replacing network calls with deterministic stubs, you can validate agent behavior without touching production endpoints. The performance and cost differential is stark.

Testing Approach	Execution Time	Cost per Run	Determinism	CI/CD Viability
Live API Calls	200–800ms/test	$0.001–$0.005/test	Low (varies by model)	Poor (rate limits, latency)
Mocked Unit Suite	<150ms/suite	$0.00	High (fixed payloads)	Excellent (fast feedback)
Integration Tests	1–3s/test	$0.002–$0.008/test	Medium (controlled prompts)	Good (nightly/scheduled)

This finding matters because it shifts testing from a cost center to a development accelerator. Mocked suites run in under 150 milliseconds, enabling test-driven workflows for agent logic. You can verify prompt formatting, token counting, stream filtering, and fallback routing without waiting for network round-trips. The trade-off is intentional: mocked tests validate your code’s handling of the SDK, not the model’s actual intelligence. That boundary is critical. Unit tests cover orchestration, integration tests cover inference quality. Separating them prevents CI pipelines from becoming expensive, slow, and flaky.

Core Solution

Implementing a deterministic test harness for Anthropic-based agents requires three architectural decisions: constructor interception, generator-based stream simulation, and output normalization. The following implementation uses Vitest 4.1.7 and @anthropic-ai/sdk 0.100.1. All examples assume ESM module resolution.

Step 1: Project Configuration

Ensure your environment supports top-level vi.mock() hoisting. Node.js requires explicit ESM configuration to resolve module boundaries correctly during test compilation.

// package.json
{
  "type": "module",
  "scripts": {
    "test": "vitest run"
  },
  "devDependencies": {
    "vitest": "^4.1.7",
    "@anthropic-ai/sdk": "^0.100.1"
  }
}

Without "type": "module", Vitest may fall back to CommonJS resolution, breaking mock hoisting and causing ReferenceError: vi is not defined during test execution.

Step 2: Constructor Interception

The Anthropic SDK exports a default class. Vitest’s vi.mock() factory must return a constructor-compatible function. Arrow functions lack a prototype property and will throw TypeError: is not a constructor when invoked with new.

// src/__mocks__/anthropic.ts
import { vi } from 'vitest';

export const mockCreate = vi.fn();
export const mockStream = vi.fn();

vi.mock('@anthropic-ai/sdk', () => {
  const MockClient = vi.fn().mockImplementation(function() {
    this.messages = {
      create: mockCreate,
      stream: mockStream
    };
  });
  return { default: MockClient };
});

Declaring mockCreate and mockStream outside the factory allows cross-test reference sharing. This pattern isolates mock state from test execution, enabling precise assertion tracking without recreating mock functions per test case.

Step 3: Synchronous Call Validation

Standard API calls return promises. Use mockResolvedValue to simulate successful responses. Validate both the output payload and the request arguments to ensure prompt construction remains consistent.

// src/agent.test.ts
import { describe, it, expect, beforeEach } from 'vitest';
import { mockCreate } from './__mocks__/anthropic';
import { buildAgent } from './agent';

beforeEach(() => vi.clearAllMocks());

describe('Agent orchestration', () => {
  it('formats requests and extracts token metrics', async () => {
    mockCreate.mockResolvedValue({
      content: [{ type: 'text', text: 'Acknowledged.' }],
      usage: { input_tokens: 22, output_tokens: 5 }
    });

    const runner = buildAgent('test-credential');
    const output = await runner.execute('Verify system status');

    expect(output.text).toBe('Acknowledged.');
    expect(output.metrics.inputTokens).toBe(22);
    expect(output.metrics.outputTokens).toBe(5);

    const requestPayload = mockCreate.mock.calls[0][0];
    expect(requestPayload.model).toBe('claude-haiku-4-5-20251001');
    expect(requestPayload.messages[0].content).toBe('Verify system status');
  });
});

This pattern verifies that your agent correctly maps user input to the SDK’s message array, selects the intended model, and parses token usage for cost tracking. Inspecting mock.calls[0][0] provides direct access to the serialized request, enabling assertion on system prompts, temperature settings, and tool definitions.

Step 4: Streaming Event Simulation

Streaming endpoints return AsyncIterable objects. Standard promise mocks fail because for await...of expects an iterator, not a resolved value. Use an async function* generator to yield structured events matching the SDK’s Server-Sent Events format.

it('processes text deltas and ignores metadata events', async () => {
  async function* generateStream() {
    yield { type: 'message_start', message: { usage: { input_tokens: 14 } } };
    yield { type: 'content_block_start', content_block: { type: 'text', text: '' } };
    yield { type: 'content_block_delta', delta: { type: 'text_delta', text: 'Processing' } };
    yield { type: 'content_block_delta', delta: { type: 'text_delta', text: ' complete.' } };
    yield { type: 'message_delta', delta: { stop_reason: 'end_turn' } };
    yield { type: 'message_stop' };
  }

  mockStream.mockReturnValue(generateStream());

  const runner = buildAgent('test-credential');
  const collected: string[] = [];

  for await (const event of runner.stream('Start pipeline')) {
    collected.push(event);
  }

  expect(collected).toEqual(['Processing', ' complete.']);
  expect(collected.join('')).toBe('Processing complete.');
});

Note the use of mockReturnValue instead of mockResolvedValue. The generator instance must be returned synchronously so the iterator protocol can consume it asynchronously. This pattern accurately reproduces the SDK’s event lifecycle, allowing you to test filtering logic, chunk accumulation, and termination conditions without network overhead.

Step 5: Classifier Logic with Fallbacks

LLM-based routing requires defensive parsing. Models may return lowercase strings, extra whitespace, or unexpected phrases. Normalize output and enforce a whitelist to prevent runtime crashes.

it('normalizes classifier output and handles unknown responses', async () => {
  mockCreate.mockResolvedValue({
    content: [{ type: 'text', text: '  command  ' }],
    usage: { input_tokens: 18, output_tokens: 1 }
  });

  const runner = buildAgent('test-credential');
  const intent = await runner.classify('Execute backup routine');

  expect(intent).toBe('COMMAND');
});

it('falls back to default category on unrecognized output', async () => {
  mockCreate.mockResolvedValue({
    content: [{ type: 'text', text: 'I am not sure how to categorize this.' }],
    usage: { input_tokens: 20, output_tokens: 8 }
  });

  const runner = buildAgent('test-credential');
  const intent = await runner.classify('Random input');

  expect(intent).toBe('UNKNOWN');
});

This pattern isolates orchestration logic from model behavior. The tests verify that your code correctly trims, uppercases, and validates responses, ensuring the agent remains stable even when the model deviates from instructions. Production agents must treat LLM output as untrusted data until normalized and whitelisted.

Pitfall Guide

Arrow Function Constructor Trap Explanation: Passing an arrow function to vi.mock() creates a closure without a prototype. When the SDK instantiates the client with new, JavaScript throws a constructor error because arrow functions lack the [[Construct]] internal method. Fix: Always use the function keyword inside the mock factory. It preserves constructor compatibility and allows this binding within the mock implementation.
Promise Mocking for Async Iterables Explanation: Using mockResolvedValue on a streaming method returns a Promise<AsyncIterable>. The for await...of loop attempts to iterate over the promise object itself, causing a TypeError: object is not async iterable. Fix: Use mockReturnValue to return the generator instance directly. The iterator protocol handles asynchronous consumption internally; the mock should not wrap it in a promise.
Mock State Leakage Across Tests Explanation: Vitest hoists vi.mock() calls, but mock implementations persist across test cases unless explicitly cleared. Subsequent tests may inherit stale return values, call counts, or spy states. Fix: Add beforeEach(() => vi.clearAllMocks()) to reset call history and implementations. For complex setups involving timers or DOM, use vi.restoreAllMocks() in afterAll.
Over-Mocking SDK Internals Explanation: Mocking deep SDK properties (e.g., client.messages.stream().controller.abort()) couples tests to implementation details. SDK updates frequently change internal class structures, breaking tests on minor version bumps. Fix: Mock only the public interface your agent consumes (create, stream). Validate request payloads and response shapes, not internal SDK mechanics or private properties.
Ignoring Non-Deterministic Output Handling Explanation: Testing only the "happy path" where the model returns exact expected strings leaves agents vulnerable to malformed responses in production. LLMs frequently add punctuation, change casing, or wrap output in markdown. Fix: Always include test cases for lowercase output, extra whitespace, partial JSON, and fallback triggers. Defensive parsing is a production requirement, not an edge case.
Missing ESM Module Resolution Explanation: Vitest’s mock hoisting relies on static analysis. Without "type": "module" in package.json, Node.js may resolve imports as CommonJS, breaking vi.mock() placement and causing reference errors during compilation. Fix: Explicitly set "type": "module" and ensure test files use consistent extensions. Configure vitest.config.ts with test: { globals: true } if relying on global test APIs.
Confusing Unit Mocks with Integration Validation Explanation: Mocked tests prove your code handles the SDK correctly. They do not prove the model understands your prompts or respects temperature settings. Relying solely on mocks creates a false sense of security. Fix: Maintain a separate integration test suite that runs against a sandbox API key on a scheduled basis. Use mocks for CI/CD velocity, integration tests for prompt quality assurance and rate limit validation.

Production Bundle

Action Checklist

Configure "type": "module" in package.json to enable ESM and Vitest hoisting
Implement constructor mock using function keyword inside vi.mock() factory
Declare mock functions (mockCreate, mockStream) outside the factory for cross-test access
Add beforeEach(() => vi.clearAllMocks()) to prevent state leakage
Replace streaming mocks with async function* generators and mockReturnValue
Validate request payloads by inspecting mock.calls[0][0]
Implement output normalization (trim, uppercase, whitelist) for classifier logic
Schedule nightly integration tests against a sandbox API key for prompt validation

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
CI/CD Pipeline	Mocked Unit Suite	Sub-150ms execution, zero token cost, deterministic	$0.00
Prompt Engineering Iteration	Live Sandbox API	Requires actual model feedback to refine instructions	$0.001–$0.005/run
Regression Testing	Mocked Suite + Snapshot	Validates orchestration logic and response parsing	$0.00
Production Readiness	Integration Suite	Confirms end-to-end SDK compatibility and rate limit handling	$0.002–$0.008/run
Streaming Logic Validation	Generator Mocks	Simulates `AsyncIterable` without network overhead	$0.00

Configuration Template

// vitest.config.ts
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    globals: true,
    environment: 'node',
    include: ['src/**/*.test.ts'],
    setupFiles: ['./src/test-setup.ts'],
    coverage: {
      provider: 'v8',
      reporter: ['text', 'lcov'],
      include: ['src/**/*.ts'],
      exclude: ['src/**/*.test.ts', 'src/__mocks__/**']
    }
  }
});

// src/test-setup.ts
import { vi, beforeEach } from 'vitest';

beforeEach(() => {
  vi.clearAllMocks();
  vi.useFakeTimers({ shouldAdvanceTime: true });
});

// src/__mocks__/anthropic.ts
import { vi } from 'vitest';

export const mockCreate = vi.fn();
export const mockStream = vi.fn();

vi.mock('@anthropic-ai/sdk', () => {
  const MockClient = vi.fn().mockImplementation(function() {
    this.messages = {
      create: mockCreate,
      stream: mockStream
    };
  });
  return { default: MockClient };
});

Quick Start Guide

Initialize your project with "type": "module" and install vitest@4 and @anthropic-ai/sdk.
Create src/__mocks__/anthropic.ts with the constructor factory pattern using the function keyword.
Write your first test using mockResolvedValue for synchronous calls and async function* for streams.
Run npx vitest run to verify zero API calls, sub-150ms execution, and deterministic output parsing.
Add beforeEach mock resets and output normalization tests to cover edge cases before deploying to CI.

Testing AI Agents with Vitest 4 — Mocking LLM Calls and Streaming Responses in Practice