-----------------|---------------------|----------------------|------------------------|
| Parallel Voice Pipeline | ~4.2s | High (duplicated logic) | Baseline | High |
| Transcribe-First Architecture | ~2.8s | Low (single orchestrator) | 3-5x higher for TTS, optimized overall | Low |
Key Findings:
- STT via
gemini-3.1-flash achieves a median latency of ~2.1s for 15-second audio clips (n=50, us-central1), with robust mixed-language recognition.
- TTS via the Live API introduces session-based pricing and WebSocket overhead, making it the primary cost and latency driver.
- The sweet spot lies in converting voice to text at the entry point, routing through the existing text orchestrator, and applying TTS only at the delivery layer. This minimizes state management and isolates platform-specific delivery quirks.
Core Solution
The production-ready pattern follows a linear, single-responsibility flow:
Voice Message β STT (Transcribe) β Text Orchestrator β Response Text β TTS β Audio Reply
Step 1: Transcribe Voice with the Standard API
For pre-recorded voice messages, use the standard Gemini API via google-generativeai. The Live API is unnecessary overhead for batch transcription.
Enter fullscreen mode Exit fullscreen mode
python
## [](#sdk-googlegenerativeai-pip-install-googlegenerativeai)SDK: google-generativeai (pip install google-generativeai)
import google.generativeai as genai
model = genai.GenerativeModel("gemini-3.1-flash")
audio\_bytes = download\_voice\_message(message\_id)
response = model.generate\_content(\[
"Transcribe this audio to text accurately.",
{"mime\_type": "audio/m4a", "data": audio\_bytes}
\])
transcript = response.text
Step 2: Synthesize Speech with the Live API
Audio synthesis requires the Live API and the google-genai SDK. This endpoint streams raw PCM audio and demands strict regional configuration.
Enter fullscreen mode Exit fullscreen mode
python
## [](#sdk-googlegenai-pip-install-googlegenai)SDK: google-genai (pip install google-genai)
from google import genai
client = genai.Client(
vertexai=True,
project="your-project",
location="us-central1" # MUST be regional, not global
)
async with client.aio.live.connect(
model="gemini-live-2.5-flash-native-audio"
) as session:
await session.send\_client\_content(
turns=\[{"role": "user", "parts": \[{"text": response\_text}\]}\]
)
\# Collect PCM audio chunks from the stream
pcm\_data = await collect\_audio\_response(session)
The orchestrator handles all downstream logic. Platform adapters manage delivery constraints. For example, LINE's reply tokens are single-use; if the bot sends text first, audio must follow via the push message API. The transcribe-first architecture ensures the orchestrator remains platform-agnostic while adapters handle protocol-specific fallbacks.
Pitfall Guide
- SDK Fragmentation: Google splits functionality across
google-generativeai (standard) and google-genai (Live). Importing the wrong package yields cryptic errors with no clear resolution path.
- Model Naming Inconsistency: Standard STT uses
gemini-3.1-flash, while Live TTS requires gemini-live-2.5-flash-native-audio. Assuming naming parity causes immediate initialization failures.
- Regional Endpoint Requirement: The Live API fails silently on global endpoints. You must explicitly hardcode regional locations like
us-central1 during client initialization.
- Argument Syntax Rigidity:
Part.from_text() strictly requires keyword arguments. Passing positional arguments triggers unexpected runtime errors that are not documented in quickstart guides.
- Raw PCM Output Handling: TTS returns raw PCM streams, not playable audio files. Deployment environments must include
ffmpeg to transcode PCM to m4a/opus before delivery.
- Pricing Model Divergence: Standard API charges per token; Live API charges per session-second. Live TTS typically runs 3-5x higher per interaction, requiring careful session lifecycle management to avoid cost overruns.
- Platform Reply Token Limits: Messaging platforms enforce strict reply token constraints (e.g., single-use). Synchronous text-then-audio flows will fail without implementing a push API fallback for secondary media responses.
Deliverables
- Transcribe-First Voice Integration Blueprint: Architecture diagram detailing the linear STT β Orchestrator β TTS flow, SDK mapping matrix, and endpoint configuration guidelines for production deployment.
- Pre-Deployment Validation Checklist: 12-point verification covering
ffmpeg availability in container images, regional endpoint hardcoding, WebSocket reconnection logic, latency monitoring thresholds (<3s E2E), and platform reply token fallback strategies.
- Configuration Templates: Ready-to-use Python environment setup (
requirements.txt), Dockerfile snippet for ffmpeg integration, and SDK initialization configs for both google-generativeai and google-genai clients.