You Should Probably Pay Attention to Tokenizers
Tokenization is the process of breaking down a piece of text into smaller pieces called tokens. These tokens are then assigned an integer value for identification within the tokenizer vocabulary, a set of all possible tokens used in the tokenizer training. There are different types of tokenizers. It's a good idea to be aware of which one is used by the large language model you are trying to use. This article discusses the different types of tokenizers and how they can influence processing.