Tokenizer does not split text according to newly added input tokens #35447

jiongjiongli · 2024-12-29T09:31:19Z

System Info

transformers version: 4.47.1
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.27.0
Safetensors version: 0.4.5
Accelerate version: 1.2.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.1+cu121 (False)
Tensorflow version (GPU?): 2.17.1 (False)
Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
Jax version: 0.4.33
JaxLib version: 0.4.33

Who can help?

@ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Repro Steps:

Initialize a tokenizer with use_fast=False
Add new tokens "red" and "e" to the tokenizer vocabulary using add_tokens()
Try to tokenize the word "read"

Expected: The tokenizer should split "read" into ['r', 'e', 'ad'] since "e" is now a token
Actual: The tokenizer keeps "read" as a single token, ignoring the newly added vocabulary

Repro Code:

from transformers import AutoTokenizer


def test(text, input_tokens):
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
    tokenizer.add_tokens(input_tokens)
    output_tokens = tokenizer.tokenize(text)
    print(f"Output tokens: {output_tokens}")


model_id = "google-bert/bert-base-cased"
test("read", ["red", "e"])

Expected Output: Output tokens: ['r', 'e', 'ad']
Actual Output: Output tokens: ['read']

Expected behavior

The tokenizer should split "read" into ['r', 'e', 'ad'] since "e" is now a token.

The text was updated successfully, but these errors were encountered:

…g to newly added input tokens The root reason is Trie.split method didn't ignore partial match that should be removed

…newly added input tokens The root reason is Trie.split method didn't ignore partial match that should be removed

…newly added input tokens The root reason is Trie.split method didn't ignore partial match that should be removed Add test case to token split

jiongjiongli added the bug label Dec 29, 2024

jiongjiongli mentioned this issue Dec 29, 2024

Fix #35447 Tokenizer does not split text according to newly added input tokens #35448

Closed

jiongjiongli changed the title ~~LlamaTokenizer does not split text according to newly added input tokens~~ Tokenizer does not split text according to newly added input tokens Dec 29, 2024

jiongjiongli added a commit to jiongjiongli/transformers that referenced this issue Dec 29, 2024

Fix bug huggingface#35447 Tokenizer does not split text according to …

52a3820

…newly added input tokens The root reason is Trie.split method didn't ignore partial match that should be removed

LysandreJik added the Core: Tokenization Internals of the library; Tokenization. label Dec 29, 2024

jiongjiongli mentioned this issue Dec 29, 2024

Fix #35447 Tokenizer does not split text according to newly added input tokens #35455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer does not split text according to newly added input tokens #35447

Tokenizer does not split text according to newly added input tokens #35447

jiongjiongli commented Dec 29, 2024 •

edited

Loading

Tokenizer does not split text according to newly added input tokens #35447

Tokenizer does not split text according to newly added input tokens #35447

Comments

jiongjiongli commented Dec 29, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

jiongjiongli commented Dec 29, 2024 •

edited

Loading