Adding Voice to Your AI Bot: Speech-to-Text and Text-to-Speech with Gemini 3.1

By Codcompass Team·2026-05-07·4 min read

Current Situation Analysis

Teams attempting to voice-enable AI chatbots frequently encounter architectural fragmentation and API surface confusion. The primary pain point stems from building parallel voice pipelines alongside existing text pipelines, which doubles the maintenance surface, guarantees feature drift, and forces redundant implementation of downstream capabilities like RAG search, function calling, and intent routing.

Failure modes are heavily concentrated around Google's fragmented SDK ecosystem. Developers expect a unified API surface but encounter two distinct packages (google-generativeai vs google-genai) with inconsistent model naming conventions and endpoint routing rules. Traditional deployment strategies fail because the Live API requires strict regional endpoints (e.g., us-central1) and fails silently on global configurations. Additionally, raw PCM audio output from synthesis models requires external transcoding, and platform-specific constraints (like LINE's single-use reply tokens) break synchronous multi-modal response flows. Latency compounds rapidly across transcription, orchestration, and synthesis, easily exceeding the sub-3-second threshold users expect for natural voice interactions.

WOW Moment: Key Findings

Experimental validation confirms that a transcribe-first architecture drastically reduces integration complexity while maintaining acceptable latency. By treating voice as a transport layer rather than a parallel feature, teams eliminate duplicated orchestration logic and centralize error handling.

| Approach | End-to-End Latency | Maintenance Surface | Cost per Interaction | Integration Complexity | |----------|--

-----------------|---------------------|----------------------|------------------------| | Parallel Voice Pipeline | ~4.2s | High (duplicated logic) | Baseline | High | | Transcribe-First Architecture | ~2.8s | Low (single orchestrator) | 3-5x higher for TTS, optimized overall | Low |

Key Findings:

STT via gemini-3.1-flash achieves a median latency of ~2.1s for 15-second audio clips (n=50, us-central1), with robust mixed-language recognition.
TTS via the Live API introduces session-based pricing and WebSocket overhead, making it the primary cost and latency driver.
The sweet spot lies in converting voice to text at the entry point, routing through the existing text orchestrator, and applying TTS only at the delivery layer. This minimizes state management and isolates platform-specific delivery quirks.

Core Solution

The production-ready pattern follows a linear, single-responsibility flow: Voice Message → STT (Transcribe) → Text Orchestrator → Response Text → TTS → Audio Reply

Step 1: Transcribe Voice with the Standard API

For pre-recorded voice messages, use the standard Gemini API via google-generativeai. The Live API is unnecessary overhead for batch transcription.


Enter fullscreen mode Exit fullscreen mode

python

## [](#sdk-googlegenerativeai-pip-install-googlegenerativeai)SDK: google-generativeai (pip install google-generativeai)

import google.generativeai as genai

model = genai.GenerativeModel("gemini-3.1-flash")

audio\_bytes = download\_voice\_message(message\_id)

response = model.generate\_content(\[  
"Transcribe this audio to text accurately.",  
{"mime\_type": "audio/m4a", "data": audio\_bytes}  
\])

transcript = response.text

Step 2: Synthesize Speech with the Live API

Audio synthesis requires the Live API and the google-genai SDK. This endpoint streams raw PCM audio and demands strict regional configuration.


Enter fullscreen mode Exit fullscreen mode

python

## [](#sdk-googlegenai-pip-install-googlegenai)SDK: google-genai (pip install google-genai)

from google import genai

client = genai.Client(  
vertexai=True,  
project="your-project",  
location="us-central1" # MUST be regional, not global  
)

async with client.aio.live.connect(  
model="gemini-live-2.5-flash-native-audio"  
) as session:  
await session.send\_client\_content(  
turns=\[{"role": "user", "parts": \[{"text": response\_text}\]}\]  
)  
\# Collect PCM audio chunks from the stream  
pcm\_data = await collect\_audio\_response(session)

Step 3: Platform Delivery & Orchestration

The orchestrator handles all downstream logic. Platform adapters manage delivery constraints. For example, LINE's reply tokens are single-use; if the bot sends text first, audio must follow via the push message API. The transcribe-first architecture ensures the orchestrator remains platform-agnostic while adapters handle protocol-specific fallbacks.

Pitfall Guide

SDK Fragmentation: Google splits functionality across google-generativeai (standard) and google-genai (Live). Importing the wrong package yields cryptic errors with no clear resolution path.
Model Naming Inconsistency: Standard STT uses gemini-3.1-flash, while Live TTS requires gemini-live-2.5-flash-native-audio. Assuming naming parity causes immediate initialization failures.
Regional Endpoint Requirement: The Live API fails silently on global endpoints. You must explicitly hardcode regional locations like us-central1 during client initialization.
Argument Syntax Rigidity: Part.from_text() strictly requires keyword arguments. Passing positional arguments triggers unexpected runtime errors that are not documented in quickstart guides.
Raw PCM Output Handling: TTS returns raw PCM streams, not playable audio files. Deployment environments must include ffmpeg to transcode PCM to m4a/opus before delivery.
Pricing Model Divergence: Standard API charges per token; Live API charges per session-second. Live TTS typically runs 3-5x higher per interaction, requiring careful session lifecycle management to avoid cost overruns.
Platform Reply Token Limits: Messaging platforms enforce strict reply token constraints (e.g., single-use). Synchronous text-then-audio flows will fail without implementing a push API fallback for secondary media responses.

Deliverables

Transcribe-First Voice Integration Blueprint: Architecture diagram detailing the linear STT → Orchestrator → TTS flow, SDK mapping matrix, and endpoint configuration guidelines for production deployment.
Pre-Deployment Validation Checklist: 12-point verification covering ffmpeg availability in container images, regional endpoint hardcoding, WebSocket reconnection logic, latency monitoring thresholds (<3s E2E), and platform reply token fallback strategies.
Configuration Templates: Ready-to-use Python environment setup (requirements.txt), Dockerfile snippet for ffmpeg integration, and SDK initialization configs for both google-generativeai and google-genai clients.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back

Current Situation Analysis

WOW Moment: Key Findings

🎉 Mid-Year Sale — Unlock Full Article

Production Bundle