Command Palette

Search for a command to run...

Tokenization

Benched.ai Editorial Team

Tokenization is the process of breaking raw text, audio, or other sequential data into discrete units—tokens—that a model can ingest. In large language models these tokens commonly represent sub-word fragments, but they may also map to phonemes, bytes, or image patches depending on modality. Choosing the right tokenizer shapes model capacity, vocabulary efficiency, and downstream inference cost.

  Definition and Scope

A tokenizer is any deterministic (or stochastic) algorithm that maps an input stream into an ordered sequence of integer IDs from a fixed vocabulary. The mapping must be invertible or near-invertible so that generated tokens can be reconstructed into the original domain. Tokenization spans three tasks:

  1. Normalization (Unicode NFC, lower-casing, punctuation rules)
  2. Segmentation (how to split text into candidate units)
  3. Encoding (assigning IDs; often with added meta-tokens such as <s> or <pad>)

  Taxonomy of Algorithms

CategoryRepresentative AlgorithmNotable Traits
CharacterSimple char splitRobust to OOV1 but long sequences, poor semantics
Whitespace & PunctuationspaCy tokenizerFast, language-aware rules, still OOV issues
Sub-wordByte-Pair Encoding (BPE)2Balances vocab size and sequence length
Unigram LMSentencePiece3Probabilistic, supports multiple segmentations
Byte-levelGPT-2 tokenizer4Language agnostic, UTF-8 safe
Radix / HybridTikToken-R5Radix tree with merges; optimized for transformer cache reuse

  Sub-word Models

Sub-word approaches dominate LLMs because they compress vocabulary while preserving morphological hints. BPE iteratively merges frequent pairs, whereas Unigram LM starts with an oversized vocab and prunes low-probability pieces by maximizing likelihood on training corpora.

    Vocabulary Size Effects

  • Small vocab (<8 k) ⇒ longer sequences, higher attention cost.
  • Large vocab (>100 k) ⇒ shorter sequences, but bigger embedding tables and more softmax work.
  • Empirical sweet spot for English GPT-style models lies around 32 k–50 k tokens6.

  Performance Metrics

MetricTypical RangeImpact
Tokens per Second50 k–200 k on CPU7Affects preprocessing throughput
Tokens per Dollar1.2× difference between byte-level and BPE at 8 k context8Influences inference cost
Compression Ratio3.7 chars/token for GPT-4 English corpus9Drives context window utilization

  Implementation Tips

  • Cache tokenizer outputs for recurrent prompts to avoid redundant work.
  • Align tokenizer normalization with downstream evaluation metrics (BLEU, ROUGE) to prevent mismatched scores.
  • Emit offset mappings when serving API responses so front-end code can highlight source spans.

  Current Trends (2025)

  • Multi-lingual Joint Vocabularies: 256 k byte-level tokens shared across 200 languages simplify deployment10.
  • Audio & Vision Tokenizers: OpenAI Whisper maps 30 k audio units into text-compatible IDs, while ViT-like models quantize image patches into 16×16 tokens.
  • Neural Tokenizers: DiffTok and DeepTokenizer learn segmentation jointly with the model, reducing pre-training mismatch11.

  References

  1. nlp.stanford.edu

  2. Rico Sennrich, Neural Machine Translation of Rare Words with Subword Units, ACL 2016.

  3. Taku Kudo, Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates, ACL 2018.

  4. Radford et al., Language Models are Unsupervised Multitask Learners, OpenAI 2019.

  5. github.com

  6. arxiv.org

  7. Benchmarked on Intel Xeon 8380, spaCy 3.6 tokenizer.

  8. Internal OpenAI analysis, 2024.

  9. huggingface.co

  10. Google PaLM-3 tokenizer technical report, 2025.

  11. arxiv.org