Testing AI Agents with Vitest 4 — Mocking LLM Calls and Streaming Responses in Practice
Deterministic AI Agent Testing: Mocking LLM SDKs and Streaming Events with Vitest 4
Current Situation Analysis
Building AI agents introduces a fundamental conflict with traditional testing methodologies. Unit tests rely on deterministic contracts: identical inputs must yield identical outputs. Large language models inherently violate this principle. They introduce external state dependencies, non-deterministic generation, and complex asynchronous streaming protocols. When developers attempt to test agent orchestration code, they quickly hit three architectural walls. First, every test execution requires a live API key, which creates credential leakage risks in version control or CI pipelines. Second, real inference calls introduce network latency, turning a fast test suite into a multi-minute bottleneck that breaks developer feedback loops. Third, streaming responses use AsyncIterable objects that cannot be stubbed with standard promise mocks, forcing developers to either skip stream testing or write fragile integration harnesses.
This gap is frequently overlooked because agent development prioritizes prompt engineering and workflow design over test infrastructure. Many teams defer testing until deployment, relying on manual validation or production monitoring. The cost compounds quickly. A single CI pipeline running 50 agent tests against a live endpoint can consume thousands of tokens per run, while feedback loops stretch beyond acceptable thresholds. The industry standard has shifted toward mocking, but the Anthropic SDK’s class-based architecture and streaming event model require precise interception strategies. Without them, tests either fail at runtime due to constructor mismatches or silently pass while masking broken orchestration logic. The result is a testing vacuum where agent code runs in production with unverified prompt formatting, untracked token usage, and unhandled stream edge cases.
WOW Moment: Key Findings
The most significant leverage point in AI agent testing is decoupling orchestration logic from inference execution. By intercepting the SDK at the constructor level and replacing network calls with deterministic stubs, you can validate agent behavior without touching production endpoints. The performance and cost differential is stark.
| Testing Approach | Execution Time | Cost per Run | Determinism | CI/CD Viability |
|---|---|---|---|---|
| Live API Calls | 200–800ms/test | $0.001–$0.005/test | Low (varies by model) | Poor (rate limits, latency) |
| Mocked Unit Suite | <150ms/suite | $0.00 | High (fixed payloads) | Excellent (fast feedback) |
| Integration Tests | 1–3s/test | $0.002–$0.008/test | Medium (controlled prompts) | Good (nightly/scheduled) |
This finding matters because it shifts testing from a cost center to a development accelerator. Mocked suites run in under 150 milliseconds, enabling test-driven workflows for agent logic. You can verify prompt formatting, token counting, stream filtering, and fallback routing without waiting for network round-trips. The trade-off is intentional: mocked tests validate your code’s handling of the SDK, not the model’s actual intelligence. That boundary is critical. Unit tests cover orchestration, integration tests cover inference quality. Separating them prevents CI pipelines from becoming expensive, slow, and flaky.
Core Solution
Implementing a deterministic test harness for Anthropic-based agents requires three architectural decisions: constructor interception, generator-based stream simulation, and output normalization. The following implementation uses Vitest 4.1.7 and @anthropic-ai/sdk 0.100.1. All examples assume ESM module resolution.
Step 1: Project Configuration
Ensure your environment supports top-level vi.mock() hoisting. Node.js requires explicit ESM configuration to resolve module boundaries correctly during test compilation.
// package.json
{
"type": "module",
"scripts": {
"test": "vitest run"
},
"devDependencies": {
"vitest": "^4.1.7",
"@anthropic-ai/sdk": "^0.100.1"
}
}
Without "type": "module", Vitest may fall back to CommonJS resolution, breaking mock hoisting and causing ReferenceError: vi is not defined during test execution.
Step 2: Constructor Interception
The Anthropic SDK exports a default class. Vitest’s vi.mock() factory must return a constructor-compatible function. Arrow functions lack a prototype property and will throw TypeError: is not a constructor when invoked with new.
// src/__mocks__/anthropic.ts
import { vi } from 'vitest';
export const mockCreate = vi.fn();
export const mockStream = vi.fn();
vi.mock('@anthropic-ai/sdk', () => {
const MockClient = vi.fn().mockImplementation(function() {
this.messages = {
create: mockCreate,
stream: mockStream
};
});
return { default: MockClient };
});
Declaring mockCreate and mockStream outside the factory allows cross-test reference sharing. This pattern isolates mock state from test execution, enabling precise assertion tracking without recreating mock functions per test case.
Step 3: Synchronous Call Validation
Standard API calls return promises. Use mockResolvedValue to simulate successful responses. Validate both the output payload and the request arguments to ensure prompt construction remains consistent.
// src/agent.test.ts
import { describe, it, expect, beforeEach } from 'vitest';
import { mockCreate } from './__mocks__/anthropic';
import { buildAgent } from './agent';
beforeEach(() => vi.clearAllMocks());
describe('Agent orchestration', () => {
it('formats requests and extracts token metrics', async () => {
mockCreate.mockResolvedValue({
content: [{ type: 'text', text: 'Acknowledged.' }],
usage: { input_tokens: 22, output_tokens: 5 }
});
const runner = buildAgent('test-credential');
const output = await runner.execute('Verify system status');
expect(output.text).toBe('Acknowledged.');
expect(output.metrics.inputTokens).toBe(22);
expect(output.metrics.outputTokens).toBe(5);
const requestPayload = mockCreate.mock.calls[0][0];
expect(requestPayload.model).toBe('claude-haiku-4-5-20251001');
expect(requestPayload.messages[0].content).toBe('Verify system status');
});
});
This pattern verifies that your agent correctly maps user input to the SDK’s message array, selects the intended model, and parses token usage for cost tracking. Inspecting mock.calls[0][0] provides direct access to the serialized request, enabling assertion on system prompts, temperature settings, and tool definitions.
Step 4: Streaming Event Simulation
Streaming endpoints return AsyncIterable objects. Standard promise mocks fail because for await...of expects an iterator, not a resolved value. Use an async function* generator to yield structured events matching the SDK’s Server-Sent Events format.
it('processes text deltas and ignores metadata events', async () => {
async function* generateStream() {
yield { type: 'message_start', message: { usage: { input_tokens: 14 } } };
yield { type: 'content_block_start', content_block: { type: 'text', text: '' } };
yield { type: 'content_block_delta', delta: { type: 'text_delta', text: 'Processing' } };
yield { type: 'content_block_delta', delta: { type: 'text_delta', text: ' complete.' } };
yield { type: 'message_delta', delta: { stop_reason: 'end_turn' } };
yield { type: 'message_stop' };
}
mockStream.mockReturnValue(generateStream());
const runner = buildAgent('test-credential');
const collected: string[] = [];
for await (const event of runner.stream('Start pipeline')) {
collected.push(event);
}
expect(collected).toEqual(['Processing', ' complete.']);
expect(collected.join('')).toBe('Processing complete.');
});
Note the use of mockReturnValue instead of mockResolvedValue. The generator instance must be returned synchronously so the iterator protocol can consume it asynchronously. This pattern accurately reproduces the SDK’s event lifecycle, allowing you to test filtering logic, chunk accumulation, and termination conditions without network overhead.
Step 5: Classifier Logic with Fallbacks
LLM-based routing requires defensive parsing. Models may return lowercase strings, extra whitespace, or unexpected phrases. Normalize output and enforce a whitelist to prevent runtime crashes.
it('normalizes classifier output and handles unknown responses', async () => {
mockCreate.mockResolvedValue({
content: [{ type: 'text', text: ' command ' }],
usage: { input_tokens: 18, output_tokens: 1 }
});
const runner = buildAgent('test-credential');
const intent = await runner.classify('Execute backup routine');
expect(intent).toBe('COMMAND');
});
it('falls back to default category on unrecognized output', async () => {
mockCreate.mockResolvedValue({
content: [{ type: 'text', text: 'I am not sure how to categorize this.' }],
usage: { input_tokens: 20, output_tokens: 8 }
});
const runner = buildAgent('test-credential');
const intent = await runner.classify('Random input');
expect(intent).toBe('UNKNOWN');
});
This pattern isolates orchestration logic from model behavior. The tests verify that your code correctly trims, uppercases, and validates responses, ensuring the agent remains stable even when the model deviates from instructions. Production agents must treat LLM output as untrusted data until normalized and whitelisted.
Pitfall Guide
Arrow Function Constructor Trap Explanation: Passing an arrow function to
vi.mock()creates a closure without aprototype. When the SDK instantiates the client withnew, JavaScript throws a constructor error because arrow functions lack the[[Construct]]internal method. Fix: Always use thefunctionkeyword inside the mock factory. It preserves constructor compatibility and allowsthisbinding within the mock implementation.Promise Mocking for Async Iterables Explanation: Using
mockResolvedValueon a streaming method returns aPromise<AsyncIterable>. Thefor await...ofloop attempts to iterate over the promise object itself, causing aTypeError: object is not async iterable. Fix: UsemockReturnValueto return the generator instance directly. The iterator protocol handles asynchronous consumption internally; the mock should not wrap it in a promise.Mock State Leakage Across Tests Explanation: Vitest hoists
vi.mock()calls, but mock implementations persist across test cases unless explicitly cleared. Subsequent tests may inherit stale return values, call counts, or spy states. Fix: AddbeforeEach(() => vi.clearAllMocks())to reset call history and implementations. For complex setups involving timers or DOM, usevi.restoreAllMocks()inafterAll.Over-Mocking SDK Internals Explanation: Mocking deep SDK properties (e.g.,
client.messages.stream().controller.abort()) couples tests to implementation details. SDK updates frequently change internal class structures, breaking tests on minor version bumps. Fix: Mock only the public interface your agent consumes (create,stream). Validate request payloads and response shapes, not internal SDK mechanics or private properties.Ignoring Non-Deterministic Output Handling Explanation: Testing only the "happy path" where the model returns exact expected strings leaves agents vulnerable to malformed responses in production. LLMs frequently add punctuation, change casing, or wrap output in markdown. Fix: Always include test cases for lowercase output, extra whitespace, partial JSON, and fallback triggers. Defensive parsing is a production requirement, not an edge case.
Missing ESM Module Resolution Explanation: Vitest’s mock hoisting relies on static analysis. Without
"type": "module"inpackage.json, Node.js may resolve imports as CommonJS, breakingvi.mock()placement and causing reference errors during compilation. Fix: Explicitly set"type": "module"and ensure test files use consistent extensions. Configurevitest.config.tswithtest: { globals: true }if relying on global test APIs.Confusing Unit Mocks with Integration Validation Explanation: Mocked tests prove your code handles the SDK correctly. They do not prove the model understands your prompts or respects temperature settings. Relying solely on mocks creates a false sense of security. Fix: Maintain a separate integration test suite that runs against a sandbox API key on a scheduled basis. Use mocks for CI/CD velocity, integration tests for prompt quality assurance and rate limit validation.
Production Bundle
Action Checklist
- Configure
"type": "module"inpackage.jsonto enable ESM and Vitest hoisting - Implement constructor mock using
functionkeyword insidevi.mock()factory - Declare mock functions (
mockCreate,mockStream) outside the factory for cross-test access - Add
beforeEach(() => vi.clearAllMocks())to prevent state leakage - Replace streaming mocks with
async function*generators andmockReturnValue - Validate request payloads by inspecting
mock.calls[0][0] - Implement output normalization (trim, uppercase, whitelist) for classifier logic
- Schedule nightly integration tests against a sandbox API key for prompt validation
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| CI/CD Pipeline | Mocked Unit Suite | Sub-150ms execution, zero token cost, deterministic | $0.00 |
| Prompt Engineering Iteration | Live Sandbox API | Requires actual model feedback to refine instructions | $0.001–$0.005/run |
| Regression Testing | Mocked Suite + Snapshot | Validates orchestration logic and response parsing | $0.00 |
| Production Readiness | Integration Suite | Confirms end-to-end SDK compatibility and rate limit handling | $0.002–$0.008/run |
| Streaming Logic Validation | Generator Mocks | Simulates AsyncIterable without network overhead |
$0.00 |
Configuration Template
// vitest.config.ts
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
globals: true,
environment: 'node',
include: ['src/**/*.test.ts'],
setupFiles: ['./src/test-setup.ts'],
coverage: {
provider: 'v8',
reporter: ['text', 'lcov'],
include: ['src/**/*.ts'],
exclude: ['src/**/*.test.ts', 'src/__mocks__/**']
}
}
});
// src/test-setup.ts
import { vi, beforeEach } from 'vitest';
beforeEach(() => {
vi.clearAllMocks();
vi.useFakeTimers({ shouldAdvanceTime: true });
});
// src/__mocks__/anthropic.ts
import { vi } from 'vitest';
export const mockCreate = vi.fn();
export const mockStream = vi.fn();
vi.mock('@anthropic-ai/sdk', () => {
const MockClient = vi.fn().mockImplementation(function() {
this.messages = {
create: mockCreate,
stream: mockStream
};
});
return { default: MockClient };
});
Quick Start Guide
- Initialize your project with
"type": "module"and installvitest@4and@anthropic-ai/sdk. - Create
src/__mocks__/anthropic.tswith the constructor factory pattern using thefunctionkeyword. - Write your first test using
mockResolvedValuefor synchronous calls andasync function*for streams. - Run
npx vitest runto verify zero API calls, sub-150ms execution, and deterministic output parsing. - Add
beforeEachmock resets and output normalization tests to cover edge cases before deploying to CI.
Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register — Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
