ng_budgetparameter has been replaced by athinking_level string enum (low, medium, high). The default is now medium, which balances reasoning depth with latency and cost. Use lowfor speed-sensitive agentic loops,mediumfor general orchestration, andhigh` only for complex mathematical or multi-step reasoning tasks.
Instead of sequential tool calls, leverage combined tool use to execute function calling, structured output, search grounding, and code execution in a single request. This reduces round-trip latency and token overhead.
Step 4: Handle Function Response Contracts
Every FunctionResponse part must include an id field and a name field that exactly matches the corresponding FunctionCall. Missing or mismatched identifiers break the tool execution loop.
TypeScript Implementation Example
import { GoogleGenAI, Type } from "@google/genai";
interface AgenticConfig {
modelId: string;
thinkingTier: "low" | "medium" | "high";
maxOutputTokens: number;
enableThoughtPreservation: boolean;
}
class AutonomousWorkflowEngine {
private client: GoogleGenAI;
private config: AgenticConfig;
private conversationHistory: any[] = [];
constructor(apiKey: string, config: AgenticConfig) {
this.client = new GoogleGenAI({ apiKey });
this.config = config;
}
async executeStep(userInput: string, tools: any[]) {
const requestPayload = {
model: this.config.modelId,
contents: [
...this.conversationHistory,
{ role: "user", parts: [{ text: userInput }] }
],
tools: tools,
generationConfig: {
thinkingLevel: this.config.thinkingTier,
maxOutputTokens: this.config.maxOutputTokens,
// temperature, top_p, top_k are deprecated for this model
},
systemInstruction: {
parts: [{ text: "Maintain intermediate reasoning across turns. Preserve thought signatures automatically." }]
}
};
const response = await this.client.models.generateContent(requestPayload);
// Platform automatically preserves thought context in response.history
if (response.history) {
this.conversationHistory = response.history;
}
return this.parseResponse(response);
}
private parseResponse(response: any) {
const candidates = response.candidates?.[0];
const parts = candidates?.content?.parts || [];
const toolCalls = parts.filter((p: any) => p.functionCall);
const textOutput = parts.find((p: any) => p.text)?.text || "";
return {
reasoning: candidates?.thought || null,
output: textOutput,
toolCalls,
requiresContinuation: toolCalls.length > 0
};
}
async resolveToolCalls(toolResponses: Array<{ id: string; name: string; result: any }>) {
const functionResponseParts = toolResponses.map(resp => ({
functionResponse: {
id: resp.id,
name: resp.name,
response: { result: resp.result }
}
}));
const continuationPayload = {
model: this.config.modelId,
contents: [
...this.conversationHistory,
{ role: "user", parts: [{ text: "Continue execution based on tool results." }] },
{ role: "model", parts: functionResponseParts }
],
generationConfig: {
thinkingLevel: this.config.thinkingTier,
maxOutputTokens: this.config.maxOutputTokens
}
};
const response = await this.client.models.generateContent(continuationPayload);
if (response.history) {
this.conversationHistory = response.history;
}
return this.parseResponse(response);
}
}
// Usage Example
const workflow = new AutonomousWorkflowEngine(process.env.GEMINI_API_KEY!, {
modelId: "gemini-3.5-flash",
thinkingTier: "medium",
maxOutputTokens: 65000,
enableThoughtPreservation: true
});
const mcpTools = [
{
functionDeclarations: [
{ name: "fetch_market_data", description: "Retrieve structured financial metrics", parameters: { type: Type.OBJECT, properties: { ticker: { type: Type.STRING } } } },
{ name: "generate_report", description: "Compile analysis into structured output", parameters: { type: Type.OBJECT, properties: { format: { type: Type.STRING } } } }
]
}
];
const step1 = await workflow.executeStep("Analyze TSLA market trends and generate a quarterly summary.", mcpTools);
if (step1.requiresContinuation) {
const resolved = await workflow.resolveToolCalls([
{ id: "call_01", name: "fetch_market_data", result: { price: 245.3, volume: "12M" } }
]);
console.log(resolved.output);
}
Architecture Rationale
- Thought Preservation over Manual State: The model natively carries forward reasoning context when thought signatures exist in history. This eliminates external vector stores or summary prompts for multi-turn agentic loops, reducing token overhead and state drift.
thinking_level Enum over Numeric Budgets: String enums provide predictable behavior across deployments. medium serves as the optimal default for orchestration, while high is reserved for tasks requiring deep mathematical or logical deduction. This prevents accidental over-provisioning of compute.
- Combined Tool Use: Executing function calling, structured output, and grounding in a single request minimizes network latency and token consumption. Sequential tool chaining introduces unnecessary round-trips that degrade real-time performance.
- Strict
FunctionResponse Contracts: The platform enforces exact id and name matching between FunctionCall and FunctionResponse. This prevents silent failures in tool execution loops and ensures deterministic routing.
Pitfall Guide
1. Assuming temperature/top_p/top_k Still Control Output
Explanation: These sampling parameters are deprecated for this model. Relying on them will either throw validation errors or be silently ignored, leading to unpredictable output variance.
Fix: Remove all sampling parameters from your configuration. Control output determinism exclusively through thinking_level and structured output schemas.
2. Forgetting FunctionResponse ID and Name Matching
Explanation: The platform requires every FunctionResponse to include an id and a name that exactly matches the originating FunctionCall. Mismatched or missing identifiers break the tool execution loop, causing the model to hallucinate or stall.
Fix: Implement a strict mapping layer that captures functionCall.id and functionCall.name during tool execution, then injects them verbatim into the response payload.
3. Ignoring the Default thinking_level Shift to medium
Explanation: The previous preview default was high. Migrating without adjusting expectations causes perceived quality regression on complex tasks, as the model now defaults to a faster, lighter reasoning tier.
Fix: Audit your workflow complexity. Explicitly set thinking_level: "high" for mathematical, multi-step debugging, or financial modeling tasks. Keep medium for standard orchestration and low for latency-critical loops.
4. Manually Reconstructing Conversation State
Explanation: Teams often write custom summarization logic or external memory managers to preserve context across turns. This duplicates platform functionality, consumes extra tokens, and introduces state synchronization bugs.
Fix: Rely on the SDK's automatic thought preservation. Pass the response.history array directly into subsequent requests. Only implement external state management when crossing session boundaries or persisting across user logins.
5. Deploying Computer Use Tasks on This Model
Explanation: Computer Use capabilities are explicitly unsupported in this release. Attempting to route UI automation or desktop control tasks to this model will fail or produce degraded results.
Fix: Route Computer Use workloads to gemini-3-flash-preview. Maintain a routing layer that dispatches tasks based on capability requirements, not just cost or latency.
6. Overlooking Batch Inference Pricing
Explanation: Synchronous inference runs at standard rates. Asynchronous or non-real-time workflows (e.g., nightly financial report generation, bulk code refactoring) can cut costs by 50% using batch inference, but teams often miss this optimization.
Fix: Implement a dual-path execution strategy. Use synchronous calls for real-time agentic loops and batch endpoints for deferred, high-volume tasks.
7. Token Budgeting Blind Spots with Thought Preservation
Explanation: Automatic thought preservation increases context window usage because intermediate reasoning tokens are retained across turns. Teams that don't account for this experience unexpected context overflow or cache misses.
Fix: Monitor usage.promptTokens and usage.cachedTokens closely. Enable context caching ($0.15 per million tokens) for repetitive system instructions and tool definitions. Implement sliding window truncation only when crossing session boundaries, not within active agentic loops.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time coding assistant | thinking_level: "medium", synchronous API | Balances latency with iterative debugging capability; thought preservation handles multi-turn refactoring | Baseline ($1.50/$9.00 per M) |
| Financial report generation | thinking_level: "high", batch inference | Complex multi-step reasoning requires deeper analysis; batch cuts costs by 50% for non-real-time output | Reduced ($0.75/$4.50 per M) |
| MCP tool orchestration | thinking_level: "medium", combined tool use | 83.6% MCP Atlas score proves reliable multi-tool chaining; combined execution minimizes round-trips | Baseline + cache savings |
| Desktop/UI automation | gemini-3-flash-preview, Computer Use enabled | This model explicitly lacks Computer Use support; preview version maintains UI control capabilities | Higher ($3.50/$10.50 per M) |
| High-volume log analysis | thinking_level: "low", batch + context caching | Speed-sensitive parsing doesn't require deep reasoning; caching eliminates redundant input token charges | Lowest ($0.75/$4.50 per M + $0.15 cache) |
Configuration Template
// production-config.ts
export const GEMINI_WORKFLOW_CONFIG = {
model: "gemini-3.5-flash",
generation: {
thinkingLevel: "medium" as const,
maxOutputTokens: 65000,
// Sampling parameters intentionally omitted per platform guidelines
},
tools: {
combinedExecution: true,
strictFunctionResponse: true, // Enforces id/name matching
cacheSystemInstructions: true // Leverages $0.15/M context caching
},
routing: {
computerUse: "gemini-3-flash-preview",
batchThreshold: 100, // Switch to batch endpoint for >100 concurrent async tasks
thinkingEscalation: {
default: "medium",
complexMath: "high",
latencyCritical: "low"
}
},
pricing: {
inputPerMillion: 1.50,
outputPerMillion: 9.00,
contextCachePerMillion: 0.15,
storagePerMillionPerHour: 1.00,
batchDiscount: 0.50
}
};
Quick Start Guide
- Install SDK & Set Credentials: Run
npm install @google/genai and export your API key as GEMINI_API_KEY.
- Initialize Client: Create a
GoogleGenAI instance with your key. No additional configuration is required for thought preservation; it activates automatically when conversation history is passed.
- Define Tools & Thinking Tier: Structure your
functionDeclarations array. Set thinking_level to medium for general workflows, or high for complex reasoning. Remove all sampling parameters.
- Execute & Chain: Call
generateContent with your initial prompt and tools. Capture response.history and pass it to subsequent requests. Map functionCall.id and functionCall.name exactly when returning FunctionResponse payloads.
- Validate & Scale: Run a small MCP tool chain test. Monitor token usage and latency. Switch to batch endpoints for deferred tasks to capture the 50% pricing discount. Deploy to production with context caching enabled for static instructions.