-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deserializing BPE tokenizer failure #1541
Comments
As part of my search, I found that I was setting A json file with this change works
Thus, I think that this is where the bug is. I need to check the rust code to see what the expected type of If someone confirms that this is the root cause, I will close this issue and open another about the documentation + including |
Having looked into the rust code (see below), my previous comment does seem to be true (it only accepts tokenizers/tokenizers/src/models/bpe/model.rs Lines 138 to 142 in 25aee8b
|
Thanks for the detailed issue, and yeah it makes sense to have dropout 0 equivalent to None |
Thanks for the comment. I opened #1550 to address this. |
I am trying to serialize and deserialize a tokenizer and am getting an error:
First, I want to clarify that I have seen the other related issues to this (e.g., #1342, #566, #909, #1297, etc.), and none of the fixes in them apply here (I will detail that below). I have tested this on two different corpora and over 3 different versions of tokenizers (
v0.12
,v0.13
, andv0.19
) with the same results.Here are the things I have checked/tried (to confirm that I have checked the other bugs):
Whitespace()
,WhitespaceSplit()
, andSplit(pattern='\w+|[^\w\s]+', behavior='isolated')
post_processor
and aTemplate
processorMy model training code is:
which was taken basically verbatim from the documentation.
EDIT: I actually had
tokenizer.model.dropout = 0.0
in my code, which was the cause of the failure (see the closing comment).Immediately reloading
Tokenizer.from_file("tokenizer.json")
fails with the above error.Any ideas on how to work around this?
Below is an example json output from the tokenizer training code above. Note that I set the vocabulary size to be such that only one merge was added. I did not modify anything in the json file. Please try to load this (and let me know if it works, it fails for me).
The text was updated successfully, but these errors were encountered: