Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable pretty-print when saving tokenizer.json files #1656

Open
xenova opened this issue Oct 7, 2024 · 1 comment
Open

Disable pretty-print when saving tokenizer.json files #1656

xenova opened this issue Oct 7, 2024 · 1 comment

Comments

@xenova
Copy link

xenova commented Oct 7, 2024

Feature request

As the vocabulary of newer models, like Llama 3 or Gemma, increases in size, so does the size of the tokenizer, which includes the vocabulary as JSON (and merges for BPE tokenizers). Pretty-printing these files for serialization introduces a significant overhead as whitespace around the vocabulary and/or merges is added to the file.

This issue is even worse after the new BPE serialization update, which replaces merges like "s1 s2" with ["s1", "s2"], which is now formatted to be on separate lines:

image

From quick testing, not pretty-printing the tokenizer.json reduces the file size from 17MB to 7MB.

Understandably, pretty-printing the file can help with debugging, but for those cases, it's probably better that the default is not formatted (and have a flag for outputting with formatting).

cc @ArthurZucker
(PS: I can move this to huggingface/tokenizers if it is more applicable there.

Motivation

To reduce the file sizes (and bandwidth) of downloading, serializing, and uploading these files. In particular, this will greatly benefit Transformers.js users, where bandwidth is important.

Your contribution

@ArthurZucker
Copy link
Collaborator

We already have a pretty argument in tokenizers but we should give a bit more granularity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants