dm.cs.tu-dortmund.de/mlbits/text-mining-subword-tokenization/
Subword Tokenization – Lecture Notes
accurate and may miss words
need to handle unknown words, emoji, etc.
many 7B models around 30.000, GPT-2 used up to 50.257, GPT-3 uses 100k, GPT-4o 200k
Subword tokenization : use multiple tokens for each [...] NAACL-HLT (2019), 4171–4186.
[Gage94]
Gage, P. 1994. A new algorithm for data compression. C Users J. 12, 2 (Feb. 1994), 23–38.
[Kudo18]
Kudo, T. 2018. Subword regularization: Improving neural network translation …