Text-to-Speech

Text-to-speech (TTS) technology converts written text into natural-sounding audio using a pipeline of linguistic preprocessing, acoustic modeling, and vocoding. Modern neural TTS systems achieve near-human prosody and can be fine-tuned to bespoke voices with only minutes of reference audio.

Architecture Overview

Stage	Function	State-of-the-art Models
Grapheme-to-Phoneme (G2P)	Maps text to phonemes, handles homographs	Phonetisaurus, ESR-G2P
Prosody Predictor	Estimates pitch, duration, pauses	FastPitch, Pitchtron¹
Acoustic Model	Converts linguistic + prosody features to mel-spectrogram	Tacotron 2, VITS, FastSpeech 2
Neural Vocoder	Synthesizes waveform from mel	HiFi-GAN, WaveRNN, Codec-LM (EnCodec)

Key Quality Metrics

Metric	Typical Range	Notes
MOS (Mean Opinion Score)	4.3–4.6 / 5	>4.5 considered human-parity
Real-time Factor (RTF)	0.1–1.0	<1 means faster than real time
Word Error Rate (ASR back-eval)	<2 %	Proxy for intelligibility

Design Trade-offs

Latency vs. Naturalness: Autoregressive models deliver superior prosody but decode slower than parallel FastSpeech-style models.
Memory Footprint: HiFi-GAN vocoder ~2 M params vs WaveRNN 24 M; choose based on device constraints.
Voice Consistency: Few-shot speaker adaption can drift over long passages; require speaker embeddings stabilization.

Current Trends (2025)

Zero-shot voice cloning via large audio language models (e.g., Voicebox) with 2-second reference clips.
Expressive TTS: Style tokens and emotion embeddings enable whispered, shouted, or singing outputs.
On-device TTS: Apple Neural Engine and Qualcomm Sensing Hub run 30 ms latency offline voices.
Multilingual Models: Seamless code-switching between 30+ languages in a single checkpoint.

Implementation Tips

Normalize text (numbers, abbreviations) before G2P to avoid pronunciation errors.
Cache vocoder kernels on GPU for low tail latencies.
Post-process with dynamic range compression to equalize loudness across utterances.

Valle et al., FastPitch: Parallel TTS with Pitch Prediction, 2021. ↩

Command Palette

Architecture Overview

Key Quality Metrics

Design Trade-offs

Current Trends (2025)

Implementation Tips

References