Tokenization is the process of breaking down text into smaller pieces called tokens. These tokens might be words, subwords, or characters depending on the tokenization strategy. Modern language models like GPT, BERT, and other transformers use subword tokenization methods (like BPE, WordPiece, or SentencePiece) to balance vocabulary size and handling of rare words.
Number of tokens: 0
Average token length: 0 characters
❌ Word-level: Large vocabulary, can't handle unknown words
❌ Character-level: Loses semantic meaning, very long sequences
✅ Subword: Balance between the two approaches
In real transformer models, vocabularies typically contain 30,000 to 50,000 tokens.