Click here for the results of pretraining 16 language models on different tokenizers including TokenMonster, GPT-2 & tiktoken p50k_base.
Click here for an interactive benchmark for all pretrained vocabularies, plus LLaMa Tokenizer, GPT2 Tokenizer, and tiktoken.
Click here for a chart comparing optimization mode, chr/tok and vocab size for all TokenMonster vocabularies.
Range | |
---|---|
TokenMonster | 5.0 - 13 MB / second |
tiktoken | 4.0 - 7.0 MB / second |
Hugging Face | 0.2 - 0.5 MB / second |
The LLaMa Tokenizer running on Hugging Face Transformers used 35210511
tokens to tokenize "instruct" and 38383671
for "scifi", at an average speed of 0.29 MB/s. The TokenMonster import of LLaMa Tokenizer (with the exact same vocabulary), used 35083428
tokens for "instruct" and 38124152
for scifi (0.5% less tokens) and ran at an average speed of 12.5 MB/s.
Comparing to tiktoken's own benchmarks, they are claiming 6.2 MB/s with GPT2 Tokenizer. I got the same performance for GPT2 on tiktoken. TokenMonsters' import of GPT2 Tokenizer performed at 13.3 MB/s on "instruct" and 11.3 MB/s on "the_pile".
The Python TokenMonster implementation calls the Go implementation for tokenization, so they are the same speed, but Python has overhead due to serialization and deserialization. The real-world performance of TokenMonster trained vocabularies, i.e. not gpt2
or llama
(which are simpler vocabularies) is closer to the 5 - 7 MB/s range, which is comparable with the tiktoken performance. It's worth mentioning that tiktoken is tokenizing greedily whilst TokenMonster achieves comparable or better speed whilst calculating up to 6 branches at any point in time.
For a fair test, the benchmarks were performed on datasets that the TokenMonster vocabularies had not previously seen.
the_pile
is the test dataset from The Pile, with the text extracting using extract_text_from_jsonl_parquet.py. Size is 1.3 GB.
github
is this random file (1.7 GB direct download) from urls.txt from Red Pajama. It was also extracted using extract_text_from_jsonl_parquet.py. It represents code. Extracted size is 1.5 GB.
instruct
is a bunch of chat & instruct finetunes from WizardLM's alpaca_evol_instruct_70k.json. This was used as-is and respresents chatbot conversational text. Size is 137 MB.
scifi
is Scifi Stories Text Corpus. I thought sci-fi would be a good test for the fiction tokenizing capability because the training datasets don't contain much sci-fi. Size is 149 MB.
github
and the_pile
were further process with onlyvalidlatin.go
to remove any invalid UTF-8 and non-Latin characters (e.g. Chinese). I made this decision because all of the pretrained vocabularies were trained with -only-latin
and -only-valid
parameters, hence they must use single byte tokens to tokenize any non-Latin characters. Because github
and the_pile
contained a lot of non-Latin script, whilst scifi
and instruct
did not, this would otherwise skew the benchmarks.
.