ction. Audio flows at 24 kHz mono PCM16 between the worker and the Voice Agent API. LiveKit’s native FFI resampler handles conversion between WebRTC’s internal 48 kHz and the API’s expected format.
Prerequisites
You don’t need a microphone or speakers on the worker machine — the worker is a server-side participant. All audio I/O happens in the browser/mobile client.
Quick Start
1. Clone and Install
git clone https://github.com/kelsey-aai/voice-agent-livekit
cd voice-agent-livekit
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
2. Configure Environment
cp .env.example .env
Fill in .env:
ASSEMBLYAI_API_KEY= # https://www.assemblyai.com/dashboard/signup
LIVEKIT_URL=wss://<project>.livekit.cloud
LIVEKIT_API_KEY= # LiveKit Cloud → Settings → Keys
LIVEKIT_API_SECRET=
ROOM_NAME=voice-agent-demo
For self-hosted LiveKit, run livekit-server --dev and use LIVEKIT_URL=ws://localhost:7880.
3. Run the Worker
python worker.py
4. Connect a Client
The fastest way is the LiveKit Agents Playground:
- Open the playground.
- Paste your
LIVEKIT_URL and a token. Generate a token from the LiveKit Cloud dashboard, set the room to voice-agent-demo and the identity to anything other than voice-agent.
- Click Connect, allow microphone access, and start talking.
How it Works
The worker is one file (worker.py) and roughly 250 lines. Six steps do the actual work.
1. Mint a LiveKit Token and Join the Room
from livekit import api, rtc
token = (
api.AccessToken(LIVEKIT_API_KEY, LIVEKIT_API_SECRET)
.with_identity("voice-agent")
.with_grants(api.VideoGrants(
room_join=True, room=ROOM_NAME,
can_publish=True, can_subscribe=True,
))
.to_jwt()
)
room = rtc.Room()
await room.connect(LIVEKIT_URL, token)
AccessToken builds a signed JWT with the grants the worker needs: subscribe to incoming audio, publish a reply track. room.connect() opens the WebRTC signaling and media path.
2. Publish a Local Audio Track for the Agent’s Voice
audio_source = rtc.AudioSource(sample_rate=24_000, num_channels=1)
local_track = rtc.LocalAudioTrack.create_audio_track("agent-voice", audio_source)
await room.local_participant.publish_track(
local_track,
rtc.TrackPublishOptions(source=rtc.TrackSource.SOURCE_MICROPHONE),
)
AudioSource is LiveKit’s pump for sending audio into a room. We configure it at 24 kHz mono — the Voice Agent API’s default format — so reply audio goes straight in without resampling.
3. Subscribe to the User’s Audio Track
@room.on("track_subscribed")
def on_track_subscribed(track, publication, participant):
if track.kind == rtc.TrackKind.KIND_AUDIO:
asyncio.create_task(bridge_to_voice_agent(track))
LiveKit emits track_subscribed when a remote participant publishes a track and it gets routed to us. We only care about audio.
4. Forward Microphone Audio to the Voice Agent API
stream = rtc.AudioStream.from_track(
track=mic_track,
sample_rate=24_000, # ask LiveKit to resample to 24 kHz
num_channels=1,
)
async for event in stream:
pcm16_bytes = bytes(event.frame.data)
await ws.send(json.dumps({
"type": "input.audio",
"audio": base64.b64encode(pcm16_bytes).decode("ascii"),
}))
AudioStream does the resampling. WebRTC carries audio at 48 kHz internally, but we ask for 24 kHz mono and the LiveKit FFI resampler handles the conversion. Each AudioFrame exposes data as a memoryview of int16 samples — base64-encode and ship as input.audio.
5. Play the Agent’s Reply Back into the Room
elif t == "reply.audio":
pcm = base64.b64decode(event["data"])
samples = len(pcm) // 2 # 2 bytes per int16, mono
frame = rtc.AudioFrame(
data=pcm,
sample_rate=24_000,
num_channels=1,
samples_per_channel=samples,
)
await audio_source.capture_frame(frame)
The agent streams reply.audio events as soon as the LLM begins generating. Each chunk is wrapped in an AudioFrame and pushed into the AudioSource, which queues it up to 1 second deep and drains at 24 kHz on its own clock.
6. Handle Barge-In
elif t == "input.speech.started":
# User started talking; stop playback.
audio_source.clear_queue()
elif t == "reply.done":
if event.get("status") == "interrupted":
audio_source.clear_queue()
AudioSource.clear_queue() immediately discards every queued frame so the user doesn’t hear stale agent audio after they’ve spoken over it.
Tuning the Agent
Pick a Voice
"output": {"voice": "james"} # conversational US male
"output": {"voice": "sophie"} # clear UK female
"output": {"voice": "diego"} # Latin American Spanish
"output": {"voice": "arjun"} # Hindi/Hinglish
See the Voices catalog for samples. Multilingual voices code-switch automatically.
Adjust the System Prompt and Greeting
"session": {
"system_prompt": (
"You are a customer support agent for Acme. Speak in 1–2 short "
"sentences. Confirm the user's question before answering."
),
"greeting": "Hi, this is Acme support — what's going on?",
}
You can re-send session.update mid-conversation to swap the prompt or voice. greeting is locked once spoken, but system_prompt and voice are not.
Tune Turn Detection
"input": {
"turn_detection": {
"vad_threshold": 0.5, # 0.0–1.0; higher = ignore more noise
"min_silence": 600, # ms before confident end-of-turn
"max_silence": 1500, # ms hard ceiling
"interrupt_response": True, # set False to disable barge-in
}
}
For deliberate speech (eldercare, healthcare), raise max_silence to 2500. For fast-paced conversation, drop min_silence to 300.
Boost Domain-Specific Terms
If your conversation includes product names, medical terms, or rare proper nouns, add them to session.input.keyterms:
"input": { "keyterms": ["Universal-3 Pro Streaming", "AssemblyAI", "LiveKit"] }
Multiple Participants in One Room
This worker bridges one remote audio track to the Voice Agent API. Two ways to scale:
- One agent per room. Spin up a separate worker process per room. Best for 1-on-1 use cases like phone-style support agents.
- Mix participants before sending. If you want a meeting-style multi-talker agent, mix all remote audio with
rtc.AudioMixer and send the mix to one Voice Agent API session.
Pitfall Guide
- Token/Permission Misconfiguration: The worker connects but clients never hear the agent. Ensure
can_subscribe=True is set on the client token and can_publish=True on the worker token. Mismatched grants break media routing.
- Sample Rate Mismatch: Audio sounds pitched up or down. Both
AudioSource and AudioStream.from_track must be configured at sample_rate=24_000, num_channels=1. WebRTC defaults to 48 kHz; explicit configuration prevents FFI resampling errors.
- Audio Buffer Underrun: Choppy or robotic audio indicates the buffer ran dry. Run the worker close to your network egress. Increase
queue_size_ms from 1000 to 2000 inside AudioSource(...) to add headroom against transient stalls.
- Echo Cancellation Neglect: Agent interrupts itself or creates feedback loops. Browser clients using
getUserMedia({ audio: { echoCancellation: true } }) handle this automatically. Custom mobile clients must explicitly enable AEC on the capture side.
- WebSocket Authentication Failures:
UNAUTHORIZED close on the AssemblyAI WebSocket. Verify ASSEMBLYAI_API_KEY is valid, unexpired, and pasted without trailing whitespace. Regenerate keys via the dashboard if rotation policies apply.
- Turn Detection Over-Tuning: Setting
min_silence too low (<300ms) causes premature cut-offs; setting max_silence too high (>2500ms) creates unnatural pauses. Adjust based on domain: healthcare/eldercare requires longer ceilings, while fast-paced chat needs aggressive endpointing.
- Multi-Participant Mixing Strategy: Attempting to bridge multiple raw tracks to a single WebSocket session causes audio collision. Use
rtc.AudioMixer to combine tracks before forwarding, or isolate 1-on-1 conversations into separate rooms with dedicated worker processes.
Deliverables
- Blueprint: Complete architecture reference and deployment guide available at the official GitHub repository. Includes room topology diagrams, WebSocket event flowcharts, and scaling patterns for multi-room deployments.
- Checklist: Pre-flight validation script covering Python 3.10+ environment, API key generation, LiveKit project configuration, token grant verification, and client-side AEC confirmation.
- Configuration Templates: Production-ready
.env scaffolding, turn detection JSON profiles (fast-paced, standard, deliberate), system prompt templates with safety guardrails, and voice selection matrices for multilingual deployments.