Architecting Dual-Engine Local TTS for Desktop Applications

Current Situation Analysis

Building text-to-speech (TTS) capabilities into local-first desktop applications presents a persistent engineering trade-off: audio fidelity versus execution latency and privacy guarantees. Traditional cloud-based TTS APIs deliver consistent, high-quality voices but introduce network dependencies, recurring costs, and data exfiltration risks. Conversely, browser-native synthesis APIs eliminate external calls but suffer from fragmented voice libraries, inconsistent prosody across operating systems, and mechanical delivery that breaks immersion in conversational interfaces.

This problem is frequently misunderstood as a simple API integration task. In practice, production-grade local TTS requires careful orchestration of model lifecycle management, text preprocessing pipelines, isolated playback state, and cross-platform audio routing. Developers often underestimate the complexity of handling large neural model weights (~100MB ONNX files), managing asynchronous voice enumeration in browser environments, and preventing audio state collisions when multiple messages trigger playback simultaneously.

Data from local AI application deployments shows that users expect instant feedback for short interactions but demand natural prosody for extended reading. Relying on a single engine forces a compromise: either accept robotic output for zero-latency playback, or endure multi-second synthesis delays for premium quality. The architectural solution lies in decoupling engine selection from playback logic, implementing lazy resource loading, and isolating state per conversational unit. This approach preserves privacy, eliminates API dependencies, and delivers a tiered experience that adapts to user preferences and hardware constraints.

WOW Moment: Key Findings

The dual-engine architecture resolves the fidelity-latency paradox by routing requests through the most appropriate synthesis path based on context and configuration. The following comparison highlights the operational characteristics of each approach:

Approach	Audio Fidelity	First-Play Latency	Storage Footprint	Cross-Platform Consistency
Browser SpeechSynthesis	Low-Medium	<100ms	0MB	Low (OS-dependent voice libraries)
Supertonic ST-TTS	High	1.2–3.0s (initial)	~100MB (ONNX weights)	High (deterministic neural output)

This finding matters because it enables developers to offer a zero-friction fallback while reserving premium synthesis for users who prioritize natural delivery. The architectural implication is clear: abstract engine selection behind a unified interface, defer heavy resource allocation until first use, and isolate playback state to prevent UI desynchronization. This pattern transforms TTS from a monolithic feature into a composable, user-configurable subsystem.

Core Solution

Implementing a dual-engine TTS system requires separating configuration, state management, text preprocessing, and audio routing into distinct layers. The following architecture leverages Tauri 2's plugin system to isolate Rust/ONNX complexity while maintaining a lightweight TypeScript frontend.

Step 1: Plugin Abstraction & Interface Definition

Avoid custom Tauri command handlers. Instead, utilize the official plugin ecosystem to encapsulate model management, voice enumeration, and WAV synthesis at the Rust layer. The frontend interacts exclusively with a JavaScript facade.

// tts-facade.ts
export interface TtsEngineFacade {
  initializeModel(): Promise<void>;
  generateAudio(text: string, locale: string): Promise<string>;
  enumerateVoices(): Promise<VoiceProfile[]>;
  setActiveVoice(id: string): Promise<void>;
}

export interface VoiceProfile {
  id: string;
  name: string;
  locale: string;
  quality: 'standard' | 'neural';
}

Rationale: The plugin boundary prevents frontend code from directly managing ONNX runtime sessions or file I/O. Extending capabilities (e.g., streaming chunks, voice cloning) only requires backend updates, preserving frontend stability.

Step 2: State Isolation with Dual Stores

Separate configuration from playback execution. Use two independent state containers to prevent cross-contamination of settings and runtime data.

// config-registry.ts
import { create } from 'zustand';
import { persist } from 'zustand/middleware';

type TtsConfig = {
  activeEngine: 'browser' | 'neural';
  browser: { voiceId: string; rate: number; pitch: number };
  neural: { voiceId: string; rate: number };
};

export const useConfigRegistry = create(
  persist<TtsConfig>(
    (set) => ({
      activeEngine: 'browser',
      browser: { voiceId: 'default', rate: 1.0, pitch: 1.0 },
      neural: { voiceId: 'default', rate: 1.0 },
      setEngine: (engine) => set({ activeEngine: engine }),
      updateBrowserSettings: (patch) => set((state) => ({ browser: { ...state.browser, ...patch } })),
      updateNeuralSettings: (patch) => set((state) => ({ neural: { ...state.neural, ...patch } })),
    }),
    { name: 'tts-config-v1' }
  )
);

// playback-orchestrator.ts
import { create } from 'zustand';

type PlaybackSession = {
  messageId: string;
  status: 'idle' | 'loading' | 'playing' | 'error';
  audioInstance: HTMLAudioElement | null;
};

export const usePlaybackOrchestrator = create<{
  sessions: Record<string, PlaybackSession>;
  startSession: (id: string) => void;
  updateSession: (id: string, patch: Partial<PlaybackSession>) => void;
  terminateSession: (id: string) => void;
}>((set) => ({
  sessions: {},
  startSession: (id) => set((state) => ({
    sessions: { ...state.sessions, [id]: { messageId: id, status: 'loading', audioInstance: null } }
  })),
  updateSession: (id, patch) => set((state) => ({
    sessions: { ...state.sessions, [id]: { ...state.sessions[id], ...patch } }
  })),
  terminateSession: (id) => set((state) => {
    const next = { ...state.sessions };
    if (next[id]?.audioInstance) next[id].audioInstance.pause();
    delete next[id];
    return { sessions: next };
  }),
}));

Rationale: Per-message session tracking prevents global playback conflicts. Each conversational unit maintains independent loading, playing, and error states. Persistence ensures configuration survives application restarts without breaking legacy setups.

Step 3: Text Sanitization Pipeline

Neural and browser TTS engines fail predictably when exposed to raw markdown, code blocks, or mathematical notation. Implement a deterministic cleaning function before dispatch.

// text-sanitizer.ts
export function sanitizeForAudio(raw: string): string {
  let cleaned = raw;
  
  // Remove fenced code blocks
  cleaned = cleaned.replace(/```[\s\S]*?```/g, '');
  // Remove inline code
  cleaned = cleaned.replace(/`([^`]+)`/g, '$1');
  // Strip markdown formatting
  cleaned = cleaned.replace(/[*_~`#]/g, '');
  // Remove HTML tags
  cleaned = cleaned.replace(/<[^>]*>/g, '');
  // Normalize whitespace
  cleaned = cleaned.replace(/\s+/g, ' ').trim();
  
  return cleaned;
}

Rationale: Preprocessing occurs synchronously before engine dispatch. This prevents synthesis engines from attempting phonetic rendering of syntax characters, which causes audible glitches or silent failures.

Step 4: Engine Routing & Playback Execution

The router evaluates the active configuration, initializes resources if necessary, and delegates to the appropriate synthesis path.

// tts-router.ts
import { useConfigRegistry } from './config-registry';
import { usePlaybackOrchestrator } from './playback-orchestrator';
import { sanitizeForAudio } from './text-sanitizer';
import type { TtsEngineFacade } from './tts-facade';

let neuralModelReady = false;

export async function dispatchAudio(sessionId: string, rawText: string, facade: TtsEngineFacade): Promise<void> {
  const config = useConfigRegistry.getState();
  const orchestrator = usePlaybackOrchestrator.getState();
  
  orchestrator.startSession(sessionId);
  orchestrator.updateSession(sessionId, { status: 'loading' });

  const cleanText = sanitizeForAudio(rawText);
  if (!cleanText) {
    orchestrator.updateSession(sessionId, { status: 'error' });
    return;
  }

  try {
    if (config.activeEngine === 'browser') {
      await executeBrowserSynthesis(cleanText, config.browser);
    } else {
      if (!neuralModelReady) {
        await facade.initializeModel();
        neuralModelReady = true;
      }
      const wavBase64 = await facade.generateAudio(cleanText, 'en');
      const audio = new Audio(`data:audio/wav;base64,${wavBase64}`);
      audio.play();
      orchestrator.updateSession(sessionId, { status: 'playing', audioInstance: audio });
    }
  } catch (err) {
    console.error('TTS dispatch failed:', err);
    orchestrator.updateSession(sessionId, { status: 'error' });
  }
}

async function executeBrowserSynthesis(text: string, settings: { voiceId: string; rate: number; pitch: number }): Promise<void> {
  const utterances = text.split(/(?<=[.!?])\s+/);
  const synth = window.speechSynthesis;
  synth.cancel();

  utterances.forEach((chunk) => {
    const utterance = new SpeechSynthesisUtterance(chunk);
    utterance.rate = settings.rate;
    utterance.pitch = settings.pitch;
    const voices = synth.getVoices();
    const target = voices.find(v => v.voiceURI === settings.voiceId);
    if (target) utterance.voice = target;
    synth.speak(utterance);
  });
}

Rationale: Browser synthesis leverages native sentence boundary detection for natural pausing. Neural synthesis returns base64-encoded WAV data, which HTMLAudioElement decodes efficiently without external dependencies. Lazy initialization prevents startup overhead while maintaining instant subsequent playback.

Pitfall Guide

1. Blocking the Main Thread During Model Initialization

Explanation: Loading a ~100MB ONNX model synchronously freezes the UI, triggering browser watchdog timeouts or Tauri window unresponsiveness. Fix: Always invoke initializeModel() asynchronously. Display a non-blocking progress indicator and queue synthesis requests until the ready flag resolves.

2. TTS Engines Choking on Unsanitized Markup

Explanation: Feeding raw markdown or LaTeX directly to synthesis engines produces phonetic garbage or silent failures. Neural models attempt to pronounce asterisks, brackets, and code syntax. Fix: Implement a deterministic sanitization pipeline that strips formatting, code blocks, and HTML before dispatch. Never trust raw message content.

3. Global Playback State Causing UI Desynchronization

Explanation: Using a single isPlaying boolean forces all message bubbles to reflect the same state. Clicking play on one message incorrectly updates others. Fix: Track playback sessions by unique message ID. Maintain a dictionary of session objects with independent status, audio instances, and error flags.

4. Browser Voice List Asynchronicity

Explanation: speechSynthesis.getVoices() returns an empty array during initial app load. Relying on synchronous enumeration results in missing voice options. Fix: Attach a listener to the voiceschanged event. Populate the voice dropdown only after the event fires, and cache the result to avoid repeated DOM queries.

5. Base64 WAV Memory Accumulation

Explanation: Generating multiple base64 audio strings without cleanup increases heap usage. Long conversations with frequent playback can trigger garbage collection pauses. Fix: Reuse Audio instances where possible, or explicitly nullify references after playback completes. For high-frequency usage, consider streaming chunks instead of full-message base64 payloads.

6. Ignoring Synthesis Request Concurrency

Explanation: Rapid user clicks trigger multiple overlapping synthesis calls. Neural engines may queue requests internally, causing delayed playback or state corruption. Fix: Implement a request queue or debounce mechanism. Cancel pending synthesis before initiating a new session. Expose a isSynthesizing flag to disable UI controls during processing.

7. Cross-Platform Pitch/Rate Inconsistency

Explanation: Browser engines interpret rate and pitch values differently across Windows, macOS, and Linux. A rate of 1.5 may sound normal on one OS and accelerated on another. Fix: Normalize parameters per engine. Document OS-specific limits, apply clamping functions, and provide a test playback button so users can calibrate expectations before committing.

Production Bundle

Action Checklist

Initialize Tauri plugin for Supertonic ST-TTS and verify ONNX runtime bindings
Create dual Zustand stores: one for persistent configuration, one for ephemeral playback sessions
Implement text sanitization pipeline to strip markdown, code, and HTML before dispatch
Wire lazy model initialization with async loading state and user notification
Route synthesis requests through browser API or neural engine based on active configuration
Attach voiceschanged listener for browser voice enumeration and cache results
Implement per-message session tracking to prevent UI state collisions
Add error boundaries and fallback logging for synthesis failures

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Low-latency requirement (UI feedback, short prompts)	Browser SpeechSynthesis	Instant dispatch, zero storage, native OS optimization	$0, 0MB footprint
High-fidelity requirement (long-form reading, premium UX)	Supertonic ST-TTS	Deterministic neural prosody, consistent cross-platform output	~100MB initial download, higher CPU usage
Offline-only deployment	Supertonic ST-TTS	Fully local execution, no network dependency, privacy-compliant	One-time model fetch, no recurring costs
Memory-constrained devices (<4GB RAM)	Browser SpeechSynthesis	Zero model allocation, OS-managed audio pipeline	Minimal RAM/CPU overhead

Configuration Template

// app-config.ts
export const TTS_CONFIG = {
  engines: {
    browser: {
      enabled: true,
      defaultRate: 1.0,
      defaultPitch: 1.0,
      rateRange: [0.5, 2.0],
      pitchRange: [0.0, 2.0],
    },
    neural: {
      enabled: true,
      modelUrl: 'https://cdn.example.com/st-tts-v1.onnx',
      modelSizeMB: 100,
      supportedLocales: ['en', 'es', 'fr', 'de', 'ja'],
      defaultRate: 1.0,
      rateRange: [0.8, 1.5],
    },
  },
  state: {
    persistKey: 'tts-config-v1',
    sessionTimeoutMs: 30000,
    maxConcurrentSynthesis: 1,
  },
  ui: {
    showLoadingToast: true,
    autoStopOnNavigation: true,
    enableVoicePreview: true,
  },
};

Quick Start Guide

Install Plugin Dependencies: Add tauri-plugin-supertonic to your Rust backend and tauri-plugin-supertonic-api to your frontend package. Run cargo tauri dev to verify native bindings.
Initialize State Containers: Copy the dual-store pattern into your project. Configure persistence middleware and define session tracking interfaces.
Wire the Sanitizer & Router: Implement the text cleaning function and connect it to the dispatch router. Ensure browser synthesis and neural synthesis paths are isolated.
Test Cross-Platform Voices: Launch the application on Windows, macOS, and Linux. Verify voice enumeration, playback latency, and model download behavior. Adjust rate/pitch clamps per OS if necessary.
Deploy & Monitor: Ship the application with lazy model loading enabled. Monitor synthesis error rates and memory usage during extended sessions. Iterate on queueing strategies if users report overlapping playback.

How I Implemented Supertonic TTS into My Desktop App, OpenBench AI