Tokenization is the process of breaking raw text, audio, or other sequential data into discrete units—tokens—that a model can ingest. In large language models these tokens commonly represent sub-word fragments, but they may also map to phonemes, bytes, or image patches depending on modality. Choosing the right tokenizer shapes model capacity, vocabulary efficiency, and downstream inference cost.
Definition and Scope
A tokenizer is any deterministic (or stochastic) algorithm that maps an input stream into an ordered sequence of integer IDs from a fixed vocabulary. The mapping must be invertible or near-invertible so that generated tokens can be reconstructed into the original domain. Tokenization spans three tasks:
- Normalization (Unicode NFC, lower-casing, punctuation rules)
- Segmentation (how to split text into candidate units)
- Encoding (assigning IDs; often with added meta-tokens such as <s> or <pad>)
Taxonomy of Algorithms
Sub-word Models
Sub-word approaches dominate LLMs because they compress vocabulary while preserving morphological hints. BPE iteratively merges frequent pairs, whereas Unigram LM starts with an oversized vocab and prunes low-probability pieces by maximizing likelihood on training corpora.
Vocabulary Size Effects
- Small vocab (<8 k) ⇒ longer sequences, higher attention cost.
- Large vocab (>100 k) ⇒ shorter sequences, but bigger embedding tables and more softmax work.
- Empirical sweet spot for English GPT-style models lies around 32 k–50 k tokens6.
Performance Metrics
Implementation Tips
- Cache tokenizer outputs for recurrent prompts to avoid redundant work.
- Align tokenizer normalization with downstream evaluation metrics (BLEU, ROUGE) to prevent mismatched scores.
- Emit offset mappings when serving API responses so front-end code can highlight source spans.
Current Trends (2025)
- Multi-lingual Joint Vocabularies: 256 k byte-level tokens shared across 200 languages simplify deployment10.
- Audio & Vision Tokenizers: OpenAI Whisper maps 30 k audio units into text-compatible IDs, while ViT-like models quantize image patches into 16×16 tokens.
- Neural Tokenizers: DiffTok and DeepTokenizer learn segmentation jointly with the model, reducing pre-training mismatch11.
References
-
Rico Sennrich, Neural Machine Translation of Rare Words with Subword Units, ACL 2016. ↩
-
Taku Kudo, Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates, ACL 2018. ↩
-
Radford et al., Language Models are Unsupervised Multitask Learners, OpenAI 2019. ↩
-
Benchmarked on Intel Xeon 8380, spaCy 3.6 tokenizer. ↩
-
Internal OpenAI analysis, 2024. ↩