Tokenization

Tokenization is the process of breaking raw text, audio, or other sequential data into discrete units—tokens—that a model can ingest. In large language models these tokens commonly represent sub-word fragments, but they may also map to phonemes, bytes, or image patches depending on modality. Choosing the right tokenizer shapes model capacity, vocabulary efficiency, and downstream inference cost.

Definition and Scope

A tokenizer is any deterministic (or stochastic) algorithm that maps an input stream into an ordered sequence of integer IDs from a fixed vocabulary. The mapping must be invertible or near-invertible so that generated tokens can be reconstructed into the original domain. Tokenization spans three tasks:

Normalization (Unicode NFC, lower-casing, punctuation rules)
Segmentation (how to split text into candidate units)
Encoding (assigning IDs; often with added meta-tokens such as <s> or <pad>)

Taxonomy of Algorithms

Category	Representative Algorithm	Notable Traits
Character	Simple char split	Robust to OOV¹ but long sequences, poor semantics
Whitespace & Punctuation	spaCy tokenizer	Fast, language-aware rules, still OOV issues
Sub-word	Byte-Pair Encoding (BPE)²	Balances vocab size and sequence length
Unigram LM	SentencePiece³	Probabilistic, supports multiple segmentations
Byte-level	GPT-2 tokenizer⁴	Language agnostic, UTF-8 safe
Radix / Hybrid	TikToken-R⁵	Radix tree with merges; optimized for transformer cache reuse

Sub-word Models

Sub-word approaches dominate LLMs because they compress vocabulary while preserving morphological hints. BPE iteratively merges frequent pairs, whereas Unigram LM starts with an oversized vocab and prunes low-probability pieces by maximizing likelihood on training corpora.

Vocabulary Size Effects

Small vocab (<8 k) ⇒ longer sequences, higher attention cost.
Large vocab (>100 k) ⇒ shorter sequences, but bigger embedding tables and more softmax work.
Empirical sweet spot for English GPT-style models lies around 32 k–50 k tokens⁶.

Performance Metrics

Metric	Typical Range	Impact
Tokens per Second	50 k–200 k on CPU⁷	Affects preprocessing throughput
Tokens per Dollar	1.2× difference between byte-level and BPE at 8 k context⁸	Influences inference cost
Compression Ratio	3.7 chars/token for GPT-4 English corpus⁹	Drives context window utilization

Implementation Tips

Cache tokenizer outputs for recurrent prompts to avoid redundant work.
Align tokenizer normalization with downstream evaluation metrics (BLEU, ROUGE) to prevent mismatched scores.
Emit offset mappings when serving API responses so front-end code can highlight source spans.

Current Trends (2025)

Multi-lingual Joint Vocabularies: 256 k byte-level tokens shared across 200 languages simplify deployment¹⁰.
Audio & Vision Tokenizers: OpenAI Whisper maps 30 k audio units into text-compatible IDs, while ViT-like models quantize image patches into 16×16 tokens.
Neural Tokenizers: DiffTok and DeepTokenizer learn segmentation jointly with the model, reducing pre-training mismatch¹¹.

nlp.stanford.edu ↩
Rico Sennrich, Neural Machine Translation of Rare Words with Subword Units, ACL 2016. ↩
Taku Kudo, Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates, ACL 2018. ↩
Radford et al., Language Models are Unsupervised Multitask Learners, OpenAI 2019. ↩
github.com ↩
arxiv.org ↩
Benchmarked on Intel Xeon 8380, spaCy 3.6 tokenizer. ↩
Internal OpenAI analysis, 2024. ↩
huggingface.co ↩
Google PaLM-3 tokenizer technical report, 2025. ↩
arxiv.org ↩

Command Palette