json.loads(data["text"])
if msg.get("type") == "stop_listening":
break
await connection.flush_audio()
await connection.end_audio()
async def process_events():
async for event in connection:
if isinstance(event, TranscriptionStreamTextDelta):
await ws.send_json({"type": "partial", "text": event.text})
elif isinstance(event, TranscriptionStreamDone):
return event.text
audio_task = asyncio.create_task(receive_audio())
events_task = asyncio.create_task(process_events())
await audio_task
return await asyncio.wait_for(events_task, timeout=10.0)
Enter fullscreen mode Exit fullscreen mode
The browser sends raw PCM audio bytes (16-bit, mono, 16kHz) over WebSocket. The server forwards them to Voxtral and listens for transcript events. `TranscriptionStreamTextDelta` gives you partial results you can stream back to the UI; `TranscriptionStreamDone` gives you the final transcript.
One important constraint: `AudioFormat` only takes `encoding` and `sample_rate`. Don't pass `channels` or `bit_depth` β the SDK will error.
## [](#llm-tts)LLM + TTS
Once you have the transcript, the rest is simpler:
async def respond(text, history, ref_audio_b64):
messages = [{"role": "system", "content": SYSTEM_PROMPT}] + history[-20:]
messages.append({"role": "user", "content": text})
llm_response = client.chat.complete(
model="mistral-small-latest",
messages=messages,
max_tokens=300
)
answer = llm_response.choices[0].message.content
tts_response = client.audio.speech.complete(
model="voxtral-mini-tts-2603",
input=answer,
ref_audio=ref_audio_b64,
response_format="mp3"
)
return answer, base64.b64decode(tts_response.audio_data)
Enter fullscreen mode Exit fullscreen mode
The `ref_audio` parameter is what makes voice cloning work on the free plan. You pass a base64-encoded audio clip and the model adapts the voice inline β no persistent voice profile, no paid subscription. If you want the full explanation of how that works, [part 1 of this series](#) covers it.
## [](#what-it-actually-costs)What it actually costs
Development is free. Mistral has a free tier with rate limits generous enough to build and test without paying anything.
When you move to paid usage, the numbers are:
- STT: $0.003/minute
- TTS: $16 per million characters
- LLM (mistral-small): $0.10/1M input tokens, $0.30/1M output tokens
A typical turn β 10 seconds of speech, 100-word answer β works out to roughly **$0.011**. The TTS step dominates; the LLM cost is negligible.
At scale, assuming 10 turns per user per month:
Users
Turns/month
Cost/month
10
100
~$1
1,000
10,000
~$110
100,000
1M
~$11,000
1,000,000
10M
~$110,000
"Zero cost" is accurate at dev scale. At real user scale it's real money β just predictable, per-call money with no surprises.
The comparison with ElevenLabs has two parts. At low volume, ElevenLabs' subscription model works against you: you're paying $5β22/month before you've cloned a single voice. At high volume, the per-character rate matters more β Mistral's TTS is roughly 73% cheaper per character than ElevenLabs. And with Mistral, voice cloning is a parameter, not a plan upgrade. `ref_audio` works on the free tier; ElevenLabs instant voice cloning requires at minimum the Starter plan.
## [](#the-sovereignty-angle)The sovereignty angle
This is one that most AI tutorials don't mention, but it matters for a real slice of use cases.
Mistral is a French company. Their infrastructure is in Europe. All three API calls in this pipeline β speech recognition, LLM, and text-to-speech β stay within EU jurisdiction.
If you're building for European users, or in any regulated sector like healthcare, education, or legal, this is worth knowing before you choose a stack. GDPR requires that personal data be processed lawfully and, where it concerns EU residents, often requires understanding where it goes. US-based cloud providers are subject to the CLOUD Act, which means US authorities can compel disclosure of data on US-operated systems regardless of where the physical servers are. That creates a real compliance gap that some organizations simply can't accept.
Running voice data through Mistral sidesteps that entirely. If you're building a voice assistant for a school, a GP's surgery, or anything touching personal data in Europe, this isn't a nice-to-have β it can be a hard requirement. Worth knowing before you've already built on ElevenLabs.
## [](#whats-missing-for-production)What's missing for production
The pipeline as described is a working app. A few things you'd add before running it in front of real users:
- Persistent session storage (currently in-memory, resets on restart)
- Per-user rate limiting
- Retry logic for intermittent 503s from Voxtral β these do happen occasionally
- Secure API key handling on the server side
The core flow is solid. I've been testing it through several hundred turns and the STT + LLM + TTS chain is reliable.
* * *
If you want the full implementation, the FastAPI backend, WebSocket frontend, state machine, voice upload endpoint, I cover it in my [Mistral AI: Voxtral TTS (text to speech), Vision & AI Agents course on Udemy](https://www.udemy.com/course/mistral-ai-text-to-speech-agents/?couponCode=MISTRALAIGO). Everything in the course runs on the free plan.