g Removal
The first step is to strip all unsupported paralinguistic tags. This prevents literal reading and reduces unnecessary character consumption.
const PARALINGUISTIC_TAG_REGEX = /[\[\]\(\)\<\>][^\]\)\>]*[\]\)\>]/g;
function sanitizeTtsInput(rawText: string): string {
return rawText.replace(PARALINGUISTIC_TAG_REGEX, '').trim();
}
2. Prosodic Injection
Inworld TTS interprets specific text patterns as prosodic cues. Implement a transformation layer to convert emotional intent into these patterns.
- Emphasis: Wrap stressed words in asterisks. The engine applies vocal stress without vocalizing the asterisks.
- Pauses: Convert ellipses to SSML breaks for precise timing, or retain ellipses for natural tonal drops.
- Vocalizations: Replace meta-tags with onomatopoeia. Use
ha-ha for laughter, ahh for breath, and mmm for contemplation. Hyphens help the model render rhythmic sounds.
function injectProsody(text: string): string {
// Convert ellipses to SSML breaks for hard pauses
// 3 dots -> 0.3s pause, 5 dots -> 0.6s pause
let processed = text
.replace(/\.{5,}/g, '<break time="0.6s"/>')
.replace(/\.{3}/g, '<break time="0.3s"/>');
// Replace common tag patterns with onomatopoeia
processed = processed
.replace(/\[laugh\]|\(laughs\)/gi, 'ha-ha')
.replace(/\[sigh\]|\(sighs\)/gi, 'ahh')
.replace(/\[breathe\]/gi, 'nnn');
return processed;
}
3. SSML Construction
Inworld accepts a subset of SSML. All text containing SSML elements must be wrapped in a <speak> tag. Use <break> for precise pauses.
0.2s: Short beat.
0.4s: Sigh-like pause.
0.8s: Dramatic pause before a line.
function wrapInSsml(text: string): string {
const hasSsml = text.includes('<break') || text.includes('<prosody');
return hasSsml ? `<speak>${text}</speak>` : text;
}
4. Parameter Tuning
Use temperature to control vocal variance and speakingRate to adjust pacing. High-emotion scenes benefit from slightly elevated temperature and adjusted rate.
interface TtsRequestParams {
temperature: number;
speakingRate: number;
}
function determineParams(emotionIntensity: number): TtsRequestParams {
// Base params
let params: TtsRequestParams = { temperature: 0.7, speakingRate: 1.0 };
if (emotionIntensity > 0.7) {
// High emotion: increase variance, slight speed up
params.temperature = 0.85;
params.speakingRate = 1.1;
} else if (emotionIntensity < 0.3) {
// Low emotion: reduce variance, slow down
params.temperature = 0.6;
params.speakingRate = 0.9;
}
return params;
}
5. Architecture: The TTS Preprocessor
Encapsulate these steps in a dedicated service. This ensures idempotency, language awareness, and separation of concerns. The LLM generates raw text with tags; the TTS layer handles the translation to Inworld-compatible prosody.
export class InworldTtsEngine {
private readonly apiKey: string;
private readonly baseUrl: string;
constructor(config: { apiKey: string; baseUrl: string }) {
this.apiKey = config.apiKey;
this.baseUrl = config.baseUrl;
}
async synthesize(
rawText: string,
voiceId: string,
gender: VoiceGenderEnum,
emotionIntensity: number
): Promise<AudioStream> {
// 1. Sanitize
const cleanText = sanitizeTtsInput(rawText);
// 2. Inject Prosody
const prosodicText = injectProsody(cleanText);
// 3. Wrap SSML
const finalText = wrapInSsml(prosodicText);
// 4. Determine Params
const params = determineParams(emotionIntensity);
// 5. Build Request
const payload = {
text: finalText,
voice_id: voiceId,
gender: gender,
temperature: params.temperature,
speaking_rate: params.speakingRate,
};
return this.executeRequest(payload);
}
private async executeRequest(payload: any): Promise<AudioStream> {
// Implementation of API call to Inworld TTS endpoint
// Returns audio stream or buffer
throw new Error('API implementation placeholder');
}
}
export enum VoiceGenderEnum {
MALE = 'VOICE_GENDER_MALE',
FEMALE = 'VOICE_GENDER_FEMALE',
}
Rationale:
- Enum for Gender: Inworld requires specific enum values (
VOICE_GENDER_MALE, VOICE_GENDER_FEMALE). Passing string literals like "male" results in silent 400 errors. Using an enum enforces type safety.
- Regex Sanitization: A comprehensive regex ensures all variations of tags are removed, preventing leakage.
- SSML Wrapper: Conditional wrapping ensures plain text remains plain text, avoiding unnecessary parsing overhead.
- Parameter Mapping: Dynamic parameter adjustment based on emotion intensity allows the model to adapt its output without changing the voice ID.
Pitfall Guide
-
Gender Enum Mismatch
- Explanation: Passing
"male" or "female" strings to the gender field causes the API to return a 400 error, often silently in logs.
- Fix: Use the strict enum values
VOICE_GENDER_MALE and VOICE_GENDER_FEMALE. Validate inputs against these constants.
-
Tag Leakage
- Explanation: If sanitization is incomplete, tags like
[sigh] may be vocalized as literal text by certain voices, breaking immersion.
- Fix: Implement robust regex sanitization that covers brackets, parentheses, and angle brackets. Test with a diverse set of voices to ensure no leakage.
-
Missing SSML Wrapper
- Explanation: Using
<break> tags without wrapping the entire text in <speak> causes the parser to fail or ignore the SSML elements.
- Fix: Always wrap text containing SSML elements in
<speak>...</speak>. Ensure the wrapper is applied only when SSML is present.
-
Overuse of Breaks
- Explanation: Inserting too many
<break> tags results in robotic, staccato speech that lacks natural flow.
- Fix: Limit breaks to meaningful pauses. Use ellipses for natural tonal drops and reserve
<break> for precise timing requirements. A/B test break density.
-
Onomatopoeia Ambiguity
- Explanation: Spelling variations like
haha vs ha-ha can yield different results. haha might be read as a word, while ha-ha is interpreted as a sound.
- Fix: Use hyphens for rhythmic sounds (
ha-ha, ah-ah) and standard spelling for single sounds (ahh, mmm). Verify output with audio probes.
-
Temperature Misconfiguration
- Explanation: Setting
temperature too high on neutral text can introduce vocal artifacts or hallucination-like variations. Setting it too low results in flat audio.
- Fix: Map temperature to emotional intensity. Use a baseline of 0.7 and adjust within the range of 0.6 to 0.85 based on context.
-
Ignoring Audio Verification
- Explanation: Relying on log diffs or text output to verify TTS changes misses silent failures where tags produce silence.
- Fix: Always perform side-by-side audio comparisons when testing new prosody patterns. Implement automated audio quality checks in CI/CD if possible.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Emotion Scene | Onomatopoeia + Elevated Temp | Natural vocalization; dynamic variance | Higher chars for onomatopoeia; better UX |
| Precise Timing | SSML <break> | Deterministic pause duration | No extra cost; parsing overhead |
| Quick Emphasis | Asterisks *word* | Simple stress marker; low overhead | No extra cost |
| Budget Constraints | gTTS Fallback | Free; no API key required | Zero API cost; lower quality |
| Multi-Language | Prosodic Text + Params | Works across 15 languages; robust | Consistent cost per char |
Configuration Template
// config/inworld-tts.config.ts
export interface InworldTtsConfig {
apiKey: string;
baseUrl: string;
defaultVoiceId: string;
defaultGender: VoiceGenderEnum;
fallbackProvider?: 'gtts';
maxRetries: number;
timeoutMs: number;
}
export const INWORLD_TTS_CONFIG: InworldTtsConfig = {
apiKey: process.env.INWORLD_API_KEY || '',
baseUrl: 'https://api.inworld.ai/v1/tts',
defaultVoiceId: 'voice_archetype_01',
defaultGender: VoiceGenderEnum.FEMALE,
fallbackProvider: 'gtts',
maxRetries: 3,
timeoutMs: 5000,
};
export enum VoiceGenderEnum {
MALE = 'VOICE_GENDER_MALE',
FEMALE = 'VOICE_GENDER_FEMALE',
}
// Prosody mapping configuration
export const PROSODY_CONFIG = {
ellipsisToBreak: {
short: { count: 3, time: '0.3s' },
long: { count: 5, time: '0.6s' },
},
onomatopoeia: {
laugh: 'ha-ha',
sigh: 'ahh',
breath: 'nnn',
},
params: {
baseTemperature: 0.7,
highEmotionTemp: 0.85,
lowEmotionTemp: 0.6,
baseRate: 1.0,
highEmotionRate: 1.1,
lowEmotionRate: 0.9,
},
};
Quick Start Guide
- Install Dependencies: Ensure your project has TypeScript and an HTTP client. No specific Inworld SDK is required; direct API calls work.
- Define Enums and Config: Copy the
VoiceGenderEnum and INWORLD_TTS_CONFIG into your project. Set your API key in environment variables.
- Implement Preprocessor: Create the
sanitizeTtsInput, injectProsody, and wrapInSsml functions. Integrate them into a synthesize method that builds the request payload.
- Execute Request: Send a POST request to the Inworld TTS endpoint with the enriched text and parameters. Handle the audio response stream.
- Test and Iterate: Run audio tests with various inputs. Verify that tags are stripped, prosody is applied, and audio quality meets expectations. Adjust
temperature and speakingRate as needed.