dm.cs.tu-dortmund.de/mlbits/text-mining-subword-tokenization/
Subword Tokenization – Lecture Notes
unknown words, emoji, etc.
many 7B models around 30.000, GPT-2 used up to 50.257, GPT-3 uses 100k, GPT-4o 200k
Subword tokenization : use multiple tokens for each word.
Learned from the training data: frequent …