What is Tokenization?

Here's something that surprises most people: when you feed text to an LLM, the model doesn't process whole words. It processes tokens, which are subword units. Sometimes a token is a whole word. Sometimes it's part of a word. Sometimes it's punctuation. The process of converting text to tokens is tokenization, and it's more important to understand than most people realize.

English text tokenizes roughly as: one token per word. But "engineer" might be two tokens: "eng" and "ineer." "Tokenization" is often three tokens. Special characters, punctuation, numbers each get their own tokens. A simple sentence like "I love machine learning" might be 4 tokens. A sentence with technical jargon like "SVM implementation on CUDA" might be 8 tokens because "CUDA" and "CUDA" are unusual. The tokenization algorithm struggles with words outside its training distribution.

Different models use different tokenizers. GPT uses byte-pair encoding (BPE). Other models use sentencepiece. The differences matter for cost and performance. Some tokenizers handle code better than others. Some handle non-English languages better. GPT-4 uses a different tokenizer than GPT-3.5, which actually changes how many tokens certain text encodes to. The same text costs different amounts in different models.

Here's why this matters: you pay per token. If your prompt is 10,000 tokens, you pay for 10,000 tokens. If the tokenizer encodes your data inefficiently, you're paying more than necessary. Japanese text tokenizes poorly in many English-centric tokenizers, creating more tokens. Code tokenizes inefficiently. So a Japanese programmer will pay more per token to process the same amount of meaning.

Tokenization also affects model behavior. A word split across tokens behaves differently than a word as a single token. This is one reason why unusual company names or technical terms can confuse models. The tokenizer breaks them into unfamiliar token sequences, and the model has to reason about something it's never seen before.

There's also a phenomenon called tokenization leakage. If the tokenizer splits a word, then handles parts of it based on patterns learned from that partial word's context, the model might handle the complete word poorly. Anagrams and word variants can leak through tokenization artifacts. This shouldn't happen in theory but does in practice.

You can check tokenization for your model (OpenAI provides a tokenizer tool). Try tokenizing some text you plan to use in production. "Machine learning" might be 2 tokens. "ML" might be 1 token. "machine-learning" might be 4 tokens (hyphenated splits). These differences affect your actual costs.

The context window constraints discussed earlier are measured in tokens, not characters or words. This is why tokenization matters to context window. Your 200k token context window might be more or less depending on what language or technical terms you're processing. Tokenization efficiency varies.

Most users never think about tokenization because it's invisible. The model and API abstract it away. But if you're building production systems, especially ones handling diverse languages, code, or technical content, understanding tokenization helps you estimate costs, predict behavior, and debug unexpected patterns.

There's ongoing work on more efficient tokenizers that handle diverse languages and technical content better. But until tokenization becomes truly language-agnostic, these quirks persist.

Why It Matters

Tokenization directly affects both cost and performance. Text that tokenizes inefficiently is more expensive to process. Technical content, code, and non-English text often tokenize poorly, inflating costs. On the performance side, unusual words split across tokens can confuse the model. Understanding tokenization helps predict costs, estimate token budgets, and debug why a model struggles with certain inputs. For high-volume applications or multilingual systems, tokenization efficiency can significantly impact economics.

Example

A company processes customer feedback in English, Spanish, and Japanese. The English tokenizer is optimized for English words, so English feedback tokenizes efficiently. Spanish feedback is relatively efficient. Japanese feedback, using the same tokenizer, fragments into many more tokens because the tokenizer doesn't understand Japanese character patterns. The same amount of meaning costs 3-5x more to process in Japanese. Awareness of tokenization inefficiency guides decisions: translate to English before processing, use a multilingual tokenizer, or accept the cost and adjust budgets.

Tokenization

Why It Matters

Example

Related Terms

Context Window

Prompt Engineering

LLM (Large Language Model)