Google's Gemini 3.5 Flash is 4x faster than other frontier models. Here is how to call it from TypeScript.

By Codcompass Team·2026-05-27·8 min read

High-Throughput Agentic Loops: Integrating Gemini 3.5 Flash in TypeScript

Current Situation Analysis

Modern AI architectures are shifting from single-turn question answering to autonomous, multi-step agentic workflows. Developers building coding assistants, data extraction pipelines, and interactive chat interfaces are increasingly constrained not by model accuracy, but by cumulative latency. When an agent must reason, call external tools, iterate on results, and stream feedback to a user, wall-clock time becomes the primary bottleneck.

This latency problem is frequently misunderstood. Engineering teams optimize for benchmark scores and per-token pricing while ignoring output throughput. In reality, the speed at which a model emits tokens directly dictates the responsiveness of streaming UIs and the stability of autonomous loops. A slower model increases timeout risks, degrades user experience, and forces developers to implement complex chunking or fallback strategies.

Google's May 19, 2026 release of Gemini 3.5 Flash addresses this gap by prioritizing output generation velocity. The model delivers approximately four times the output tokens per second compared to other frontier models. Independent evaluations confirm this throughput advantage across complex workloads: Terminal-Bench 2.1 (76.2%), MCP Atlas (83.6%), and CharXiv Reasoning (84.2%). These benchmarks specifically measure agentic execution, code synthesis, and multimodal reasoning, indicating that the speed improvement is not isolated to simple text completion but extends to multi-step task execution.

The industry has historically treated throughput as a secondary metric. However, when an agentic loop requires multiple model calls, tool executions, and user-facing streams, reducing the time between each generation step shrinks the entire critical path. This shifts the cost model: a higher per-token price can be offset by fewer total iterations, reduced infrastructure wait times, and faster task completion.

WOW Moment: Key Findings

The critical insight is that throughput fundamentally changes the economics of multi-turn AI workflows. When measuring cost per task rather than cost per call, the premium for higher-speed models often compresses or disappears.

Approach	Output Throughput	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Ideal Workload
Gemini 3.5 Flash	~4x faster than frontier baseline	$1.50	$9.00	Latency-sensitive streaming, agentic loops, coding assistants
Gemini 2.5 Flash	Standard frontier baseline	$0.30	$2.50	High-volume batch processing, cost-optimized reasoning, background ETL
Generic Frontier Model	Standard baseline	$2.50–$10.00	$10.00–$30.00	General chat, research, non-real-time applications

This comparison reveals a structural trade-off. Gemini 2.5 Flash remains the most economical choice for tasks where users do not wait for output and where token volume dominates cost. Gemini 3.5 Flash commands a premium, but its accelerated output generation reduces the time agents spend in active loops. In practice, this means fewer retry cycles, tighter timeout margins, and a more responsive streaming experience. For applications where wall-clock time directly impacts user retention or infrastructure scaling, the throughput advantage justifies the per-token delta.

Core Solution

Integrating Gemini 3.5 Flash into a TypeScript codebase requires a deliberate architecture that prioritizes asynchronous streaming, explicit schema validation, and proper tool-response

mapping. The following implementation demonstrates a production-ready pattern using the @google/genai SDK.

Step 1: SDK Initialization and Client Wrapper

The official package requires Node.js 18 or later. Wrap the SDK in a dedicated client class to centralize configuration, timeout handling, and error boundaries.

import { GoogleGenAI, Type, GenerateContentResponse } from "@google/genai";

interface GeminiClientConfig {
  apiKey: string;
  modelId: string;
  timeoutMs?: number;
}

export class GeminiOrchestrator {
  private readonly client: GoogleGenAI;
  private readonly model: string;
  private readonly timeout: number;

  constructor(config: GeminiClientConfig) {
    this.client = new GoogleGenAI({ apiKey: config.apiKey });
    this.model = config.modelId;
    this.timeout = config.timeoutMs ?? 30000;
  }

  private async withTimeout<T>(promise: Promise<T>): Promise<T> {
    return Promise.race([
      promise,
      new Promise<never>((_, reject) =>
        setTimeout(() => reject(new Error("Gemini request timed out")), this.timeout)
      ),
    ]);
  }
}

Architecture Rationale: Explicit timeout configuration prevents hanging connections during high-load periods. Wrapping the SDK isolates vendor-specific logic, making it easier to swap models or implement fallback routing later.

Step 2: Streaming Implementation

Blocking calls are unsuitable for interactive interfaces. The SDK exposes an async generator for streaming, which must be consumed without blocking the event loop.

  async streamGeneration(prompt: string): Promise<AsyncIterable<string>> {
    const stream = await this.withTimeout(
      this.client.models.generateContentStream({
        model: this.model,
        contents: prompt,
      })
    );

    return {
      async *[Symbol.asyncIterator]() {
        for await (const chunk of stream) {
          const text = chunk.text;
          if (text) yield text;
        }
      },
    };
  }

Architecture Rationale: Returning an AsyncIterable decouples the generation layer from the transport layer. This allows the same stream to be piped into WebSocket connections, Server-Sent Events, or CLI outputs without modifying the core logic. The chunk.text guard prevents undefined values from leaking into downstream consumers.

Step 3: Tool Calling with 3.x ID Echo

Gemini 3.x introduced a strict requirement: every function call response must include the exact id generated by the model. Omitting this field breaks the conversation state.

  async executeToolCall(
    prompt: string,
    toolDefinitions: Array<{ name: string; description: string; schema: any }>
  ): Promise<string> {
    const formattedTools = toolDefinitions.map((t) => ({
      functionDeclarations: [
        {
          name: t.name,
          description: t.description,
          parameters: {
            type: Type.OBJECT,
            properties: t.schema.properties,
            required: t.schema.required ?? [],
          },
        },
      ],
    }));

    const initialResponse = await this.withTimeout(
      this.client.models.generateContent({
        model: this.model,
        contents: prompt,
        config: { tools: formattedTools },
      })
    );

    const calls = initialResponse.functionCalls;
    if (!calls || calls.length === 0) {
      return initialResponse.text ?? "";
    }

    const conversationHistory: any[] = [
      { role: "user", parts: [{ text: prompt }] },
      initialResponse.candidates?.[0]?.content,
    ];

    for (const call of calls) {
      const executionResult = await this.invokeExternalTool(call.name, call.args);
      conversationHistory.push({
        role: "user",
        parts: [
          {
            functionResponse: {
              id: call.id,
              name: call.name,
              response: { result: executionResult },
            },
          },
        ],
      });
    }

    const finalResponse = await this.withTimeout(
      this.client.models.generateContent({
        model: this.model,
        contents: conversationHistory,
        config: { tools: formattedTools },
      })
    );

    return finalResponse.text ?? "";
  }

  private async invokeExternalTool(toolName: string, args: any): Promise<any> {
    switch (toolName) {
      case "fetch_repository_metrics":
        return { stars: 12400, openIssues: 42, lastCommit: "2026-05-18" };
      case "validate_schema":
        return { valid: true, errors: [] };
      default:
        throw new Error(`Unknown tool: ${toolName}`);
    }
  }

Architecture Rationale: The tool execution loop processes multiple parallel calls in a single turn, matching Gemini's capability to request several functions simultaneously. The id field is explicitly preserved and echoed back, satisfying the 3.x API contract. Using Type.OBJECT from the SDK enum ensures schema validation aligns with the model's expectations, preventing silent parsing failures.

Pitfall Guide

1. Omitting the `id` Field in Tool Responses

Explanation: The Gemini 3.x API attaches a unique identifier to every function call request. If your response does not echo this exact id, the model cannot correlate the result with the original request, causing state desynchronization. Fix: Always map call.id directly into the functionResponse.id field before appending to conversation history.

2. Budgeting Based Solely on Input Tokens

Explanation: Output generation typically consumes 2 to 4 times the token volume of the input prompt. Teams that calculate costs using only input pricing consistently underestimate monthly spend. Fix: Model your budget using a 1:3 input-to-output ratio as a baseline. Track actual output volume in production and adjust multipliers per workflow.

3. Blocking the Event Loop During Stream Consumption

Explanation: Using synchronous loops or awaiting entire responses before processing chunks defeats the purpose of streaming and increases perceived latency. Fix: Use for await...of with async generators. Pipe chunks directly to transport layers (SSE, WebSockets) without intermediate buffering unless explicitly required.

4. Passing String Literals for Schema Types

Explanation: The SDK expects the Type enum for parameter definitions. Passing raw strings like "object" or "string" may work in development but causes validation mismatches in production. Fix: Import Type from @google/genai and use Type.OBJECT, Type.STRING, etc., consistently across all tool declarations.

5. Ignoring the Search Grounding Quota

Explanation: The free tier enforces a 5,000 prompt monthly limit for Google Search grounding across all Gemini 3 models. Agentic workflows that repeatedly trigger search queries exhaust this cap quickly. Fix: Monitor grounding usage via the GCP console. Implement caching for repeated queries or switch to a paid tier with explicit per-query billing ($14 per 1,000 queries) before hitting the cap.

6. Assuming Single-Call Pricing Equals Task Cost

Explanation: Multi-turn agentic tasks involve multiple model invocations, tool executions, and context window management. Per-call pricing does not reflect the cumulative cost of a completed workflow. Fix: Instrument your application to track total tokens consumed per task completion. Compare 3.5 Flash and 2.5 Flash using task-level metrics, not individual API call pricing.

7. Missing Monthly Spend Caps

Explanation: The free tier provides limited access, but paid tiers remove daily caps. Without explicit budget controls, unexpected traffic spikes can generate unbounded charges. Fix: Configure monthly spend limits in the Google Cloud Console. Implement application-level circuit breakers that degrade gracefully when approaching budget thresholds.

Production Bundle

Action Checklist

Install @google/genai and verify Node.js 18+ runtime compatibility
Configure explicit API key injection via environment variables, avoiding hardcoded secrets
Implement async streaming with proper backpressure handling for user-facing interfaces
Validate all tool schemas using the Type enum instead of string literals
Echo the id field in every function response to maintain 3.x conversation state
Set monthly spend caps in the Google Cloud Console before enabling paid tiers
Instrument token tracking per task completion to validate cost assumptions
Cache repeated search grounding queries to preserve free tier quota

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Real-time chat or coding assistant	Gemini 3.5 Flash	Low latency improves UX; streaming throughput justifies premium	Higher per-token cost, lower infrastructure wait time
Batch document processing	Gemini 2.5 Flash	No user waiting; cost per token dominates economics	Significantly lower monthly spend at scale
Multi-step agentic loop	Gemini 3.5 Flash	Faster output reduces loop iterations and timeout risks	Premium offset by fewer total calls per task
Prototyping / internal tools	Free tier (either model)	Sufficient for development; rate limits prevent accidental overage	Zero cost until quota exhaustion

Configuration Template

// config/gemini.ts
import { GeminiOrchestrator } from "../core/GeminiOrchestrator";

export const geminiClient = new GeminiOrchestrator({
  apiKey: process.env.GEMINI_API_KEY ?? "",
  modelId: "gemini-3.5-flash",
  timeoutMs: 25000,
});

export const toolRegistry = [
  {
    name: "fetch_repository_metrics",
    description: "Retrieves star count, open issues, and last commit date for a given repository.",
    schema: {
      properties: {
        owner: { type: "string", description: "GitHub username or organization" },
        repo: { type: "string", description: "Repository name" },
      },
      required: ["owner", "repo"],
    },
  },
  {
    name: "validate_schema",
    description: "Checks if a JSON payload conforms to the expected structure.",
    schema: {
      properties: {
        payload: { type: "object", description: "Raw JSON object to validate" },
      },
      required: ["payload"],
    },
  },
];

Quick Start Guide

Install dependencies: Run npm install @google/genai and ensure your runtime is Node.js 18 or newer.
Set credentials: Export GEMINI_API_KEY in your environment or inject it via your deployment platform's secret manager.
Initialize the client: Import the wrapper class, pass your API key and target model ID, and configure a reasonable timeout.
Test streaming: Call the streaming method with a sample prompt and pipe the async iterator to your transport layer or console.
Validate tool calling: Register a simple function declaration, trigger a prompt that requires it, and verify the id echo in the response chain.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back