-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GemmaTokenizerFast word_ids() returns only zeros #31437
Comments
Hey! Will have a look thanks for reporting |
It seems that we need this: tokenizer._tokenizer.pre_tokenizer = Sequence([Split("▁","merged_with_next")])
encoded = tokenizer(sentence, return_tensors="pt")
print(encoded.word_ids())
[None, 0, 1, 2, 3] |
#32191 should fix this ! |
Not included in #32191 since the proposed fix breaks encoding. Will need to do in a follow-up PR :) |
Small update: pre_tokenzier = pre_tokenizers.Split(Regex('(?<!▁)▁'), "merged_with_next")
pre_tokenzier.pre_tokenize_str(sentence.replace(" ", "▁"))
Out[41]: [('I', (0, 1)), ('▁love', (1, 6)), ('▁my', (6, 9)), ('▁cat', (9, 13))]
tokenizer(sentence).tokens()
Out[48]: ['<bos>', 'I', '▁love', '▁my', '▁cat']
tokenizer(sentence).word_ids()
Out[49]: [None, 0, 1, 2, 3] Gives somewhat acceptable results, |
But I don't recommend word ids, separations are "brittle" and this changes the output of the tokenization |
Is there a better way to link the tokens to words than |
Offsets! |
The offset mapping give you the exact place where the token corresponds to in the original string |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.41.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The method
word_ids()
does only return a list of zeros instead of the correct word_ids.I tried several variations of configurations stated in the linked issues in #28881 , but for Gemma it doesn't change the result. The llama3 tokenizer outputs the correct values with this code.
Expected behavior
The output of
word_ids
should look like[None, 0, 1, 2, 3]
The text was updated successfully, but these errors were encountered: