des enterprise-grade privacy without sacrificing developer velocity.
Core Solution
The architecture relies on three distinct layers: a mobile capture endpoint, an in-memory relay server, and a desktop consumption hook. The pairing mechanism uses cryptographic secrets embedded in a QR code to establish a secure, ephemeral channel. A 256-bit secret initializes the pairing, and the relay server issues a 384-bit session token upon successful scan. Sessions reside in memory with a 30-minute sliding TTL, resetting on each transcript POST. This eliminates database dependencies while maintaining strict session isolation.
Step 1: Relay Server Implementation
The relay server handles session creation, transcript ingestion, and SSE broadcasting. It must validate origins, manage in-memory state, and stream partial/final transcripts without blocking.
// app/api/dictation/[...relay]/route.ts
import { buildDictationRelay } from '@dictation/core/server'
export const { GET, POST, OPTIONS } = buildDictationRelay({
sessionTTL: 1800, // 30 minutes sliding window
allowedOrigins: process.env.DICTATION_ORIGINS?.split(',') || ['*'],
maxPayloadSize: 4096, // Prevent oversized transcript bursts
})
Architecture Rationale: The relay uses an in-memory store rather than Redis or PostgreSQL because voice sessions are inherently ephemeral. A 30-minute sliding TTL ensures abandoned sessions are garbage-collected automatically. The maxPayloadSize guard prevents malformed or malicious transcript bursts from exhausting server memory. CORS validation is enforced at the route level to prevent cross-origin relay abuse.
Step 2: Mobile Capture Endpoint
The mobile page initializes the Web Speech API, manages microphone permissions, and forwards recognized text to the relay. It must handle continuous recognition, partial results, and browser-specific lifecycle events.
// app/mic/page.tsx
"use client"
export { DictationMic as default } from '@dictation/core/mobile'
Architecture Rationale: Abstracting the mobile endpoint into a dedicated export keeps the phone UI decoupled from desktop logic. The underlying implementation binds microphone activation to a user gesture (required by iOS Safari), initializes SpeechRecognition with continuous: true and interimResults: true, and POSTs transcript chunks to the relay endpoint. By keeping audio processing strictly client-side, the mobile page never transmits raw audio data.
Step 3: Desktop Consumption Hook
The desktop hook subscribes to the SSE stream, manages field registration, and injects text into the currently focused input. It abstracts partial/final transcript handling and provides a QR overlay for pairing.
// components/DictationInput.tsx
import { useDictationEngine, DictationOverlay } from '@dictation/core/web'
import { useRef, useEffect } from 'react'
export function DictationInput({ placeholder }: { placeholder: string }) {
const fieldRef = useRef<HTMLInputElement>(null)
const engine = useDictationEngine({
relayEndpoint: '/api/dictation',
locale: 'en-US',
})
useEffect(() => {
engine.bindField('primary-input', fieldRef)
return () => engine.unbindField('primary-input')
}, [engine])
return (
<div className="relative">
<input ref={fieldRef} placeholder={placeholder} />
<button onClick={engine.toggleOverlay}>
{engine.isActive ? 'Stop Dictation' : 'Start Dictation'}
</button>
<DictationOverlay
sessionToken={engine.sessionToken}
pairingSecret={engine.pairingSecret}
relayUrl={engine.relayEndpoint}
mobileEndpoint={engine.mobileEndpoint}
visible={engine.showOverlay}
onClose={engine.toggleOverlay}
/>
</div>
)
}
Architecture Rationale: The hook uses a field registry pattern (bindField/unbindField) to track which input currently holds focus. When a transcript arrives via SSE, the engine checks document.activeElement and injects text at the cursor position using selectionStart and selectionEnd. This prevents accidental form submissions and maintains natural typing flow. The QR overlay is conditionally rendered and manages the cryptographic pairing handshake without blocking the main thread.
Pitfall Guide
1. Ignoring Partial vs Final Transcript States
Explanation: The Web Speech API fires continuous events. If you only listen to isFinal: true, users experience delayed feedback and lose real-time correction capabilities.
Fix: Stream both partial and final transcripts. Debounce partials at 100-150ms intervals to prevent UI thrashing, then commit final transcripts immediately. Maintain a separate state buffer to reconstruct sentences when recognition restarts.
2. Chrome Recognition Timeout Drops
Explanation: Chrome's SpeechRecognition instance terminates after ~30-60 seconds of silence or when the tab loses focus. Unhandled onend events break the streaming pipeline.
Fix: Implement an auto-restart wrapper with exponential backoff. Listen to the onend event, verify the session is still active, and reinitialize the recognizer with a 500ms delay. Track restart attempts and fallback to a manual retry UI after 3 consecutive failures.
3. Focus Management Collisions
Explanation: Injecting text into a focused field can trigger unwanted form submissions, cursor jumps, or React controlled-component warnings if value is managed externally.
Fix: Use uncontrolled inputs or synchronize React state with requestAnimationFrame. Before injection, verify document.activeElement === fieldRef.current. Use setRangeText() or manual selectionStart/selectionEnd manipulation to preserve cursor position. Disable form submission on Enter during active dictation.
4. Session TTL Mismatch
Explanation: If the relay's TTL is shorter than the user's dictation session, the relay drops the connection mid-sentence, causing transcript loss.
Fix: Implement a sliding TTL that resets on every POST request. Cap the maximum session duration at 30 minutes for security, but ensure the TTL refreshes on each transcript chunk. Log TTL expirations to identify users with unusually long sessions.
5. CORS and Relay Bottlenecks
Explanation: Overly permissive CORS headers or unoptimized relay routes cause latency spikes and potential relay abuse. SSE connections can also stall if the server buffers responses.
Fix: Validate Origin headers strictly in production. Disable response buffering on the relay route (res.flushHeaders() or equivalent). Use HTTP/2 multiplexing for SSE streams. Implement rate limiting per session token to prevent transcript flooding.
6. Mobile Mic Permission UX Failures
Explanation: iOS Safari and modern Chrome require explicit user gestures to activate the microphone. Auto-starting recognition on page load triggers permission denials.
Fix: Bind microphone initialization to a tap or click event. Display a clear permission prompt before calling navigator.mediaDevices.getUserMedia(). Handle NotAllowedError gracefully by guiding users to browser settings.
7. Provider Lock-in and Accuracy Limits
Explanation: The Web Speech API varies in accuracy across languages and accents. Hardcoding it limits scalability for production workloads requiring higher precision.
Fix: Abstract the STT provider behind an interface. Design the mobile endpoint to accept a provider configuration that swaps between Web Speech API, Soniox, or cloud endpoints. Maintain a consistent transcript payload format so the relay and desktop hook remain provider-agnostic.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Privacy / Compliance Required | Local Web Speech API + Text Relay | Audio never leaves device; zero cloud storage | Low (text-only bandwidth) |
| High Accuracy / Multi-Language | Soniox or Cloud STT Provider | Superior phoneme modeling and language support | Medium-High (per-minute billing) |
| Low Latency / Real-Time Dictation | Optimized Local Relay | Sub-150ms text injection; no network audio roundtrip | Low |
| Enterprise SSO / Audit Logging | Cloud STT + Relay Middleware | Centralized logging, user attribution, compliance reporting | High (infra + licensing) |
Configuration Template
// config/dictation.ts
import type { DictationConfig } from '@dictation/core/types'
export const dictationConfig: DictationConfig = {
relay: {
endpoint: '/api/dictation',
sessionTTL: 1800,
maxPayloadSize: 4096,
allowedOrigins: process.env.NODE_ENV === 'production'
? ['https://app.yourdomain.com']
: ['http://localhost:3000'],
},
mobile: {
locale: 'en-US',
provider: 'web-speech', // 'web-speech' | 'soniox' | 'cloud'
autoRestart: true,
maxRestarts: 3,
},
desktop: {
partialDebounceMs: 120,
focusSync: true,
overlayTheme: 'minimal',
},
}
Quick Start Guide
- Install the core packages:
npm install @dictation/core
- Add the relay route at
app/api/dictation/[...relay]/route.ts using the provided template
- Create the mobile capture page at
app/mic/page.tsx exporting the mobile component
- Import
useDictationEngine and DictationOverlay into your target component
- Bind your input field, configure the relay endpoint, and trigger the QR overlay on user action
The entire pipeline initializes in under five minutes. Once paired, the mobile device streams recognized text directly into your desktop input field with zero audio transmission, full session isolation, and production-ready error handling.