Speech-to-text (STT), also called automatic speech recognition (ASR), converts spoken language into machine-readable text. State-of-the-art systems pair self-supervised acoustic encoders with transformer decoders to reach word error rates below 5 % on conversational English.
Architecture Overview
Quality Metrics
Design Trade-offs
- Offline vs. Streaming: RNN-T supports 200 ms chunk latency; Whisper-large-v3 is offline but higher accuracy.
- Model Size: Tiny checkpoints (~120 M params) fit on mobile yet trail by ~6 % WER.
- Domain Adaptation: Fine-tuning on in-domain audio can cut WER by 25 %.
Current Trends (2025)
- Multi-modal ASR: Fusing lip-reading frames reduces WER to 1.5 % in noisy cafés.
- Contextual Biasing: Shallow-fusion injects custom vocabulary such as contact names on the fly.
- Unified STT → Translation → TTS: SeamlessM4T outputs transcript and translated speech in one pass.
Implementation Tips
- Record at 16 kHz, 16-bit PCM; resampling artifacts hurt accuracy.
- Run voice-activity detection (VAD) to skip silence and lower costs.
- Request timestamps to align subtitles.
- Cache encoder states across overlapping windows for real-time streaming.