Speech-to-Text

Speech-to-text (STT), also called automatic speech recognition (ASR), converts spoken language into machine-readable text. State-of-the-art systems pair self-supervised acoustic encoders with transformer decoders to reach word error rates below 5 % on conversational English.

Architecture Overview

Stage	Function	Representative Models
Feature Extraction	Waveform → log-Mel, spectrogram, or raw audio tokens	wav2vec 2.0 frontend
Acoustic Encoder	Encodes features into latent sequence	Whisper encoder, Conformer-RNN-T
Decoder / Language Model	Generates tokens via CTC, Transducer, or Seq2Seq	Whisper decoder, RNN-T head
Post-processing	Punctuation, capitalization, diarization	Punctuator-2, pyannote-audio

Quality Metrics

Metric	Typical Value	Notes
Word Error Rate (WER)	2 – 7 % on LibriSpeech test-clean	Lower is better
Real-time Factor (RTF)	0.2 – 1.0	RTF < 1 ⇒ faster than real time
Latency 90th	<300 ms streaming	End-to-end

Design Trade-offs

Offline vs. Streaming: RNN-T supports 200 ms chunk latency; Whisper-large-v3 is offline but higher accuracy.
Model Size: Tiny checkpoints (~120 M params) fit on mobile yet trail by ~6 % WER.
Domain Adaptation: Fine-tuning on in-domain audio can cut WER by 25 %.

Current Trends (2025)

Multi-modal ASR: Fusing lip-reading frames reduces WER to 1.5 % in noisy cafés.
Contextual Biasing: Shallow-fusion injects custom vocabulary such as contact names on the fly.
Unified STT → Translation → TTS: SeamlessM4T outputs transcript and translated speech in one pass.

Implementation Tips

Record at 16 kHz, 16-bit PCM; resampling artifacts hurt accuracy.
Run voice-activity detection (VAD) to skip silence and lower costs.
Request timestamps to align subtitles.
Cache encoder states across overlapping windows for real-time streaming.

Command Palette

Architecture Overview

Quality Metrics

Design Trade-offs

Current Trends (2025)

Implementation Tips