Command Palette

Search for a command to run...

Speech-to-Text

Benched.ai Editorial Team

Speech-to-text (STT), also called automatic speech recognition (ASR), converts spoken language into machine-readable text. State-of-the-art systems pair self-supervised acoustic encoders with transformer decoders to reach word error rates below 5 % on conversational English.

  Architecture Overview

StageFunctionRepresentative Models
Feature ExtractionWaveform → log-Mel, spectrogram, or raw audio tokenswav2vec 2.0 frontend
Acoustic EncoderEncodes features into latent sequenceWhisper encoder, Conformer-RNN-T
Decoder / Language ModelGenerates tokens via CTC, Transducer, or Seq2SeqWhisper decoder, RNN-T head
Post-processingPunctuation, capitalization, diarizationPunctuator-2, pyannote-audio

  Quality Metrics

MetricTypical ValueNotes
Word Error Rate (WER)2 – 7 % on LibriSpeech test-cleanLower is better
Real-time Factor (RTF)0.2 – 1.0RTF < 1 ⇒ faster than real time
Latency 90th<300 ms streamingEnd-to-end

  Design Trade-offs

  • Offline vs. Streaming: RNN-T supports 200 ms chunk latency; Whisper-large-v3 is offline but higher accuracy.
  • Model Size: Tiny checkpoints (~120 M params) fit on mobile yet trail by ~6 % WER.
  • Domain Adaptation: Fine-tuning on in-domain audio can cut WER by 25 %.

  Current Trends (2025)

  • Multi-modal ASR: Fusing lip-reading frames reduces WER to 1.5 % in noisy cafés.
  • Contextual Biasing: Shallow-fusion injects custom vocabulary such as contact names on the fly.
  • Unified STT → Translation → TTS: SeamlessM4T outputs transcript and translated speech in one pass.

  Implementation Tips

  1. Record at 16 kHz, 16-bit PCM; resampling artifacts hurt accuracy.
  2. Run voice-activity detection (VAD) to skip silence and lower costs.
  3. Request timestamps to align subtitles.
  4. Cache encoder states across overlapping windows for real-time streaming.