`word_to_chars()` doesn't work as expected for Llama3.1-8b #33904

neurothew · 2024-10-03T05:12:17Z

System Info

transformers version: 4.45.1
Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35
Python version: 3.12.4
Huggingface_hub version: 0.25.1
Safetensors version: 0.4.3
Accelerate version: 0.32.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA RTX 6000 Ada Generation

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am trying to retrieve the "word" as defined by word_ids() by retrieving the character span.

from transformers import AutoTokenizer
model_name = "meta-llama/Meta-Llama-3.1-8B"
this_tokenizer = AutoTokenizer.from_pretrained(model_name)

this_sent = "Hello World!"
this_encode = this_tokenizer.encode_plus(this_sent)
print(this_encode.word_to_chars(0))

And the output is:

CharSpan(start=0, end=0)

It doesn't happen with some other models such as BERT:

model_name = "bert-base-uncased"
this_tokenizer = AutoTokenizer.from_pretrained(model_name)

this_sent = "Hello World!"
this_encode = this_tokenizer.encode_plus(this_sent)
print(this_encode.word_to_chars(0))

With the output being:

CharSpan(start=0, end=5)

And the word "Hello" can be extracted via this_sent[0:5] easily. I wonder if it might have something to do with the tokenizer? I have tried BERT, RoBERTa, GPT-2, Qwen2.5 so far, and there were no problems.

For Llama models, I have tried llama3-8b, llama3.1-8b, llama3.2-1b and llama3.2-3b without success.

Expected behavior

word_to_chars() should give the correct character span for llama models.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-10-03T15:20:57Z

Hey! This is a duplicate of #33675 and should now be fixed. I'll push a new version of tokenizers to propagate this (huggingface/tokenizers#1640)

ArthurZucker · 2024-10-10T09:56:57Z

https://github.com/huggingface/tokenizers/releases/tag/v0.20.1 Release is out, closing!

neurothew added the bug label Oct 3, 2024

ArthurZucker closed this as completed Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`word_to_chars()` doesn't work as expected for Llama3.1-8b #33904

`word_to_chars()` doesn't work as expected for Llama3.1-8b #33904

neurothew commented Oct 3, 2024

ArthurZucker commented Oct 3, 2024

ArthurZucker commented Oct 10, 2024

word_to_chars() doesn't work as expected for Llama3.1-8b #33904

word_to_chars() doesn't work as expected for Llama3.1-8b #33904

Comments

neurothew commented Oct 3, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Oct 3, 2024

ArthurZucker commented Oct 10, 2024

`word_to_chars()` doesn't work as expected for Llama3.1-8b #33904

`word_to_chars()` doesn't work as expected for Llama3.1-8b #33904