-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception upon attempting to load a Tokenizer from file #566
Comments
Hi @joepalermo, would you mind sharing the resulting |
@n1t0 Thanks for your help. GitHub isn't letting me attach a .json file to a comment, so I'll just paste the contents of it here: {"version":"1.0","truncation":null,"padding":null,"added_tokens":[],"normalizer":null,"pre_tokenizer":null,"post_processor":null,"decoder":null,"model":{"dropout":null,"unk_token":null,"continuing_subword_prefix":null,"end_of_word_suffix":null,"fuse_unk":false,"vocab":{"\n":0," ":1,"(":2,")":3,"":4,"+":5,",":6,"-":7,".":8,"/":9,"0":10,"1":11,"2":12,"3":13,"4":14,"5":15,"6":16,"7":17,"8":18,"9":19,";":20,"=":21,"?":22,"C":23,"D":24,"F":25,"G":26,"I":27,"L":28,"S":29,"W":30,"a":31,"b":32,"c":33,"d":34,"e":35,"f":36,"g":37,"h":38,"i":39,"j":40,"k":41,"l":42,"m":43,"n":44,"o":45,"p":46,"q":47,"r":48,"s":49,"t":50,"u":51,"v":52,"w":53,"x":54,"y":55,"z":56," -":57,"e ":58,"t ":59," +":60," =":61," + ":62," - ":63,". ":64,";\n":65,"**":66,"Le":67,"Let ":68," = ":69,".;\n":70,"s ":71,"th":72," = -":73,"iv":74,"the ":75,"2":76,"r ":77,"of":78,". Let ":79,"d ":80,"?;\n":81,"at":82,"2":83,"of ":84,"3":85,"de":86,"or ":87,"4":88,"os":89,"pos":90,"(-":91,"5*":92,"Su":93,"ppos":94,"Suppos":95,"is ":96,"n ":97,"be ":98,"nd ":99,"co":100," a":101,"at ":102,"Wh":103,"What ":104,"ul":105," be ":106," - 1":107," + 1":108,"e -":109,"com":110,"3":111,"st ":112,") = ":113,"What is ":114,"ac":115,"act":116," f":117,"So":118,"lv":119,"Solv":120,"al":121,"ive ":122,") = -":123,"ate ":124,"mo":125,"commo":126,"common ":127,"in":128,"0":129,"Suppose ":130,"Cal":131,"cul":132,"Calcul":133,"Calculate ":134,"div":135,"divi":136," for ":137,"What is the ":138,"riv":139,"ative ":140,"deriv":141,"derivative ":142," and ":143,")/":144,"re":145,"or of ":146,"Is ":147,"). ":148,", ":149,"he":150,"im":151,"pr":152,"prim":153,"2 + ":154,"st common ":155,"fact":156,").;\n":157,"Suppose -":158,"Calculate the ":159," - 2":160,"6":161,"prime ":162," = 0":163," + 2":164,"Solve ":165,"2 - ":166,"or":167,", -":168,"derivative of ":169,"4":170,"10":171,"7":172,"ir":173,"y ":174,"r w":175,"d b":176,"ain":177,"main":178,"the prime ":179,"der w":180,"ded b":181,"is divi":182,"remain":183,"factor":184,"the prime factor":185,"der whe":186,"is divided b":187,"remainder whe":188,"the prime factors ":189,"12":190,"remainder when ":191,"the prime factors of ":192,"is divided by ":193,"min":194,"ti":195,"er":196," is divided by ":197,"Solve -":198,") be ":199,") be the ":200," w":201,"). Let ":202,"le ":203,"mul":204,"ple ":205," - 3":206,"tiple ":207,"multiple ":208,"rt ":209,"multiple of ":210,"8":211," + 3":212,"of -":213,"est common ":214,"11":215," a ":216," wrt ":217," - 2":218,"/2":219,". Suppose ":220," + 2":221,"(-2":222,". Is ":223,"9":224,". What is the ":225,"Fi":226,"Find ":227,"(-1":228,")?;\n":229," - 4":230,"/3":231,"derivative of -":232," + 4":233," - 3":234,"5":235,"eco":236,"seco":237,"second ":238," + 3":239,"0 = ":240,"0 = -":241,"Find the ":242," - -":243,"thir":244,"third ":245,"15":246,". Calculate the ":247,"13":248," + 4":249,"sor of ":250,"divisor of ":251," + -":252,"14":253," - 4*":254,"ghe":255,"hi":256,"ghest common ":257,"highest common ":258,". D":259,"no":260,"deno":261,"common deno":262,"minat":263,"common denominat":264,". Suppose -":265,"1*":266,"ar":267,"What ar":268,"What are ":269,"e?;\n":270,"16":271,"ber":272,"mber":273,"nu":274,"What are the prime factors of ":275,"mber?;\n":276,"number?;\n":277,"Li":278,"List ":279},"merges":[" -","e ","t "," +"," ="," + "," - ",". ","; \n","* ","L e","Le t "," = ",". ;\n","s ","t h"," = -","i v","th e ","2 ","r ","o f",". Let ","d ","? ;\n","a t"," 2","of ","3 ","d e","o r ","4 ","o s","p os","( -","5 ","S u","p pos","Su ppos","i s ","n ","b e ","n d ","c o"," a","a t ","W h","Wh at ","u l"," be "," - 1"," + 1","e -","co m"," 3","s t ",") = ","What is ","a c","ac t"," f","S o","l v","So lv","a l","iv e ",") = -","at e ","m o","com mo","commo n ","i n","0 ","Suppos e ","C al","c ul","Cal cul","Calcul ate ","d iv","div i"," f or ","What is the ","r iv","at ive ","de riv","deriv ative "," a nd ",") /","r e","or of ","I s ",") . ",", ","h e","i m","p r","pr im","2 + ","st common ","f act",") .;\n","Suppos e -","Calculate the "," - 2","6 ","prim e "," = 0"," + 2","Solv e ","2 - ","o r",", -","derivative of "," 4","1 0","7 ","i r","y ","r w","d b","a in","m ain","the prime ","de r w","de d b","is divi","re main","fact or","the prime factor","der w he","is divi ded b","remain der whe","the prime factor s ","1 2","remainder whe n ","the prime factors of ","is divided b y ","m in","t i","e r"," is divided by ","Solv e -",") be ",") be the "," w",") . Let ","l e ","m ul","p le "," - 3","ti ple ","mul tiple ","r t ","multiple of ","8 "," + 3","of -","e st common ","1 1"," a "," w rt "," - 2","/ 2",". Suppose "," + 2","(- 2",". Is ","9 ",". What is the ","F i","Fi nd ","(- 1",") ?;\n"," - 4","/ 3","derivative of -"," + 4"," - 3"," 5","e co","s eco","seco nd "," + 3","0 = ","0 = -","Find the "," - -","th ir","thir d ","1 5",". Calculate the ","1 3"," + 4","s or of ","divi sor of "," + -","1 4"," - 4","g he","h i","ghe st common ","hi ghest common ",". D","n o","de no","common deno","min at","common deno minat",". Suppose -","1 *","a r","What ar","What ar e ","e ?;\n","1 6","b er","m ber","n u","What are the prime factors of ","mber ?;\n","nu mber?;\n","L i","Li st "]}} |
This is really confusing because I don't think I'm doing anything unusual. Also note, I tried unpickling the tokenizer object and it gives a similar error: |
I've had the same issue. Try adding a pre_tokenizer: from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=280)
tokenizer.train(trainer, ["preprocessing/corpus/corpus.txt"])
save_to_filepath = 'preprocessing/tokenizer.json'
tokenizer.save(save_to_filepath)
tokenizer = Tokenizer.from_file(save_to_filepath) |
any update to this problem? I've had the same issue |
Have you tried the solution proposed by @lukas-blecher to use a pre-tokenizer? I believe this issue is related to this one: #645 |
Yes, I've used a pre-tokenizer. I find this problem is caused by more than one spaces in tokenizer's merge mentioned in #645. |
Having same problem. I already have a pre-tokenizer added. |
After some fiddling, the problem occurs only when I remove |
In case this might be of help to others:
If anyone is getting this error, I recommend also taking a look at the dependency requirements (e.g., which version of the tokenizers libraries is required). |
Yes, @ejohb is right. The problem occurs when using pre_tokenizers.Split() :/ |
@duskybomb Does the problem still exist on latest |
@Narsil yes, it is still there in Also, I am not sure if this is desired or not -- but the vocab had |
Do you have a simple reproducible script ? from tokenizers import trainers, models, Tokenizer, pre_tokenizers
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(
special_tokens=["<unk>", "<pad>", "<sep>"],
vocab_size=8000,
)
tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern="\w+|[^\w\s]+", behavior="isolated")
tokenizer.add_special_tokens(["<sep>"])
tokenizer.add_tokens(["<sep>"])
def iterator_over_seqs():
with open("data/big.txt", "r") as f:
for line in f:
yield "ABCEFGH"
tokenizer.train_from_iterator(iterator=iterator_over_seqs(), trainer=trainer)
tokenizer.save("tok.json", pretty=True)
encoded = tokenizer.encode("ABCD<sep>EFGH")
tok = Tokenizer.from_file("tok.json") # This is what is supposed to fail no ? It doesn't here.
print(encoded.ids)
``` |
I also encountered the same problem, the json file is as follow, pleaseplease transform txt to json |
hi @yechong316 , It seems your file contains merges which are not acceptable in the current deployed version of Those merges contain multiple spaces:
|
just to complement on Narsil: also, in case ye removed some of your vocab's, be sure all merges are still possible- in case some can't be resolved after altering it, it would throw the same error.. |
Hi, I'm running into the same issue. However, I explicitly want to have multiple whitespaces in my merges. Could someone point me in the right direction on how I could do this? |
This is still an issue in To reproduce: from tokenizers import Tokenizer, models, trainers
bpe_model = models.BPE(unk_token="[UNK]")
tokenizer = Tokenizer(model=bpe_model)
tokenizer.train_from_iterator(
iterator=["test~ing lick~ing kick~ing"],
trainer=trainers.BpeTrainer(),
)
path = "my_tokenizer.json"
tokenizer.save(path)
tok_loaded = Tokenizer.from_file(path) In this particular case, |
Have you checked out the PR that fixes it ? Which not going to merge anytime soon since it changes the on-disk format of the tokenizer, so we need a compelling reason for going through the pain of making this change. If any model that requires it gets merged into In the meantime, the PR should work. |
Hi @Narsil : I think I've a very weird issue, which seems similar to the same above error stack trace in this issue. Here are the steps how it goes:
Is this expected? Are tokenizers supposed to be backwards incompatible across different transformer lib versions? Installing from scratch python 3.7 isn't trivial on this instance, hence request you to please help if anything can be done here as a workaround. While training the tokenizer I didn't do any extravagant - initialised a Strangely the trained model on python 3.7 instance is loading perfectly on python 3.6 instance. So the issue is only with the tokenizer. @Narsil request your help on this^. I can't post the same tokenizer here due to confidentiality reasons. But if you need any other info from me to help with this, please feel free to request right away. |
Can you check your
I can't tell you exactly what's going on, but the It's probably not too hard to modify the 3.7 version to be loadable in your 3.6 environment. Just train a dummy model in the same fashion and look at how it's saved on disk in the old version. Can you do exactly the same thing ? I'm not sure it depends on your options you choose, and if they were only implemented later. Have your tried using Does it make sense ? If you happen to modify a JSON manaully, please double check the output of the tokenizer afterwards, it's easy to introduce subtle bugs without realizing. |
woohoo editing the JSON worked! :D many thanks! @Narsil as a suggestion: should this forward compatibility changes across tokenizer versions be more specifically documented somewhere, so it's accessible easily? FYI -- I just had to add |
There's a changelog + releases : https://github.com/huggingface/tokenizers/releases?page=2 Should be enough (but not necessarily easily discoverable. Please triple check the output ids before claiming victory :) |
Sorry what do you mean by output ids? Output ids of a tokenised sentence in
python 3.7 instance and python 3.6 instance should assert to be equal - do
you mean that?
…On Tue, 7 Mar 2023 at 22:59, Nicolas Patry ***@***.***> wrote:
should this forward compatibility changes across tokenizer versions be
more specifically documented somewhere, so it's accessible easily?
There's a changelog + releases :
https://github.com/huggingface/tokenizers/releases?page=2 Should be
enough (but not necessarily easily discoverable.
Please triple check the output ids before claiming victory :)
—
Reply to this email directly, view it on GitHub
<#566 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLRHEMSKDXWZUJQNCRSM7LW25V7LANCNFSM4U6S3ODQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I mean that the encodings are exactly the same on a larger enough subset of text. (tokenizer.encode(mystring)) |
I am having this problem. Here is the reproducible script: from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Split
# https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
t = """First Citizen:
Before we proceed any further, hear me speak.
..."""
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]"], vocab_size=1000, min_frequency=2)
tokenizer.pre_tokenizer = Split("\w+|[^\w\s]+", behavior="isolated")
tokenizer.train_from_iterator(
iterator=[t],
trainer=trainer,
)
tokenizer.save("tokenizer.json") Works fine if I use trained tokenizer directly (not loading from the file) print(tokenizer.encode("""especially against Caius Marcius?
All:
Against""").tokens) Output: But loading the tokenizer from the file fails.tokenizer = Tokenizer.from_file("tokenizer.json") ---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[88], line 1
----> 1 tokenizer = Tokenizer.from_file("tokenizer.json")
Exception: data did not match any variant of untagged enum ModelWrapper at line 382 column 3 Version: |
Can you open a new issue please ? It's not really good practice to resurrect old threads as it pollutes searches with potentially irrelevant content, and makes your issue which is likely a new bug less discoverable for others. (Ofc it's good to search beforehand to prevent duplicates, but when the thread is super old or closed, you can most likely create a new thread, and link the old one you found just it case we want to merge) |
Ok looked at this issue (I will copy it into a new issue once there's one). The error is because of the current tokenizer format which expects the This wasn't implemented at the time, because changing the format is a pretty risky change for backward compatibility, and there didn't seem to be any real world use case. |
I had the same error when loading LLama 2 models. Upgrading to transformers==4.33.2 and tokenizers==0.13.3 solved it for me. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Hi, I'm attempting to simply serialize and then unserialize a trained tokenizer. When I run the following code:
I get the following traceback:
The text was updated successfully, but these errors were encountered: