GemmaTokenizerFast word_ids() returns only zeros #31437

Alienmaster · 2024-06-15T10:15:55Z

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.41.2
Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.23.1
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The method word_ids() does only return a list of zeros instead of the correct word_ids.

sentence = "I love my cat"
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("google/Gemma-7b") #-version a0eac5b
encoded = tokenizer(sentence, return_tensors="pt")
print(encoded.word_ids())
# [None, 0, 0, 0, 0]

I tried several variations of configurations stated in the linked issues in #28881 , but for Gemma it doesn't change the result. The llama3 tokenizer outputs the correct values with this code.

Expected behavior

The output of word_ids should look like
[None, 0, 1, 2, 3]

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-19T11:59:45Z

Hey! Will have a look thanks for reporting

ArthurZucker · 2024-07-16T08:42:58Z

It seems that we need this:

tokenizer._tokenizer.pre_tokenizer = Sequence([Split("▁","merged_with_next")])
encoded = tokenizer(sentence, return_tensors="pt")
print(encoded.word_ids())
[None, 0, 1, 2, 3]

ArthurZucker · 2024-07-30T19:55:12Z

#32191 should fix this !

xenova · 2024-07-30T21:37:41Z

Not included in #32191 since the proposed fix breaks encoding. Will need to do in a follow-up PR :)

ArthurZucker · 2024-07-31T07:21:37Z

Small update:

pre_tokenzier = pre_tokenizers.Split(Regex('(?<!▁)▁'), "merged_with_next")
pre_tokenzier.pre_tokenize_str(sentence.replace(" ", "▁"))
Out[41]: [('I', (0, 1)), ('▁love', (1, 6)), ('▁my', (6, 9)), ('▁cat', (9, 13))]
tokenizer(sentence).tokens()
Out[48]: ['<bos>', 'I', '▁love', '▁my', '▁cat']
tokenizer(sentence).word_ids()
Out[49]: [None, 0, 1, 2, 3]

Gives somewhat acceptable results,

ArthurZucker · 2024-07-31T09:36:28Z

But I don't recommend word ids, separations are "brittle" and this changes the output of the tokenization

Alienmaster · 2024-08-01T12:15:10Z

Is there a better way to link the tokens to words than word_ids?

ArthurZucker · 2024-08-01T12:18:02Z

Offsets!

ArthurZucker · 2024-08-01T12:18:26Z

The offset mapping give you the exact place where the token corresponds to in the original string

github-actions · 2024-08-26T08:05:37Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface deleted a comment from github-actions bot Jul 16, 2024

github-actions bot closed this as completed Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GemmaTokenizerFast word_ids() returns only zeros #31437

GemmaTokenizerFast word_ids() returns only zeros #31437

Alienmaster commented Jun 15, 2024

ArthurZucker commented Jun 19, 2024

ArthurZucker commented Jul 16, 2024 •

edited

Loading

ArthurZucker commented Jul 30, 2024

xenova commented Jul 30, 2024

ArthurZucker commented Jul 31, 2024

ArthurZucker commented Jul 31, 2024

Alienmaster commented Aug 1, 2024

ArthurZucker commented Aug 1, 2024

ArthurZucker commented Aug 1, 2024

github-actions bot commented Aug 26, 2024

GemmaTokenizerFast word_ids() returns only zeros #31437

GemmaTokenizerFast word_ids() returns only zeros #31437

Comments

Alienmaster commented Jun 15, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jun 19, 2024

ArthurZucker commented Jul 16, 2024 • edited Loading

ArthurZucker commented Jul 30, 2024

xenova commented Jul 30, 2024

ArthurZucker commented Jul 31, 2024

ArthurZucker commented Jul 31, 2024

Alienmaster commented Aug 1, 2024

ArthurZucker commented Aug 1, 2024

ArthurZucker commented Aug 1, 2024

github-actions bot commented Aug 26, 2024

ArthurZucker commented Jul 16, 2024 •

edited

Loading