word_to_chars()
doesn't work as expected for Llama3.1-8b
#33904
Labels
word_to_chars()
doesn't work as expected for Llama3.1-8b
#33904
System Info
transformers
version: 4.45.1Who can help?
@ArthurZucker @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am trying to retrieve the "word" as defined by
word_ids()
by retrieving the character span.And the output is:
It doesn't happen with some other models such as BERT:
With the output being:
And the word "Hello" can be extracted via
this_sent[0:5]
easily. I wonder if it might have something to do with the tokenizer? I have tried BERT, RoBERTa, GPT-2, Qwen2.5 so far, and there were no problems.For Llama models, I have tried llama3-8b, llama3.1-8b, llama3.2-1b and llama3.2-3b without success.
Expected behavior
word_to_chars()
should give the correct character span for llama models.The text was updated successfully, but these errors were encountered: