Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer does not split text according to newly added input tokens #35447

Open
2 of 4 tasks
jiongjiongli opened this issue Dec 29, 2024 · 0 comments
Open
2 of 4 tasks
Labels
bug Core: Tokenization Internals of the library; Tokenization.

Comments

@jiongjiongli
Copy link

jiongjiongli commented Dec 29, 2024

System Info

  • transformers version: 4.47.1
  • Platform: Linux-6.1.85+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.27.0
  • Safetensors version: 0.4.5
  • Accelerate version: 1.2.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu121 (False)
  • Tensorflow version (GPU?): 2.17.1 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
  • Jax version: 0.4.33
  • JaxLib version: 0.4.33

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Repro Steps:

  1. Initialize a tokenizer with use_fast=False
  2. Add new tokens "red" and "e" to the tokenizer vocabulary using add_tokens()
  3. Try to tokenize the word "read"
  • Expected: The tokenizer should split "read" into ['r', 'e', 'ad'] since "e" is now a token
  • Actual: The tokenizer keeps "read" as a single token, ignoring the newly added vocabulary

Repro Code:

from transformers import AutoTokenizer


def test(text, input_tokens):
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
    tokenizer.add_tokens(input_tokens)
    output_tokens = tokenizer.tokenize(text)
    print(f"Output tokens: {output_tokens}")


model_id = "google-bert/bert-base-cased"
test("read", ["red", "e"])
  • Expected Output: Output tokens: ['r', 'e', 'ad']
  • Actual Output: Output tokens: ['read']

Expected behavior

The tokenizer should split "read" into ['r', 'e', 'ad'] since "e" is now a token.

jiongjiongli added a commit to jiongjiongli/transformers that referenced this issue Dec 29, 2024
…g to newly added input tokens

The root reason is Trie.split method didn't ignore partial match that should be removed
@jiongjiongli jiongjiongli changed the title LlamaTokenizer does not split text according to newly added input tokens Tokenizer does not split text according to newly added input tokens Dec 29, 2024
jiongjiongli added a commit to jiongjiongli/transformers that referenced this issue Dec 29, 2024
…newly added input tokens

The root reason is Trie.split method didn't ignore partial match that should be removed
jiongjiongli added a commit to jiongjiongli/transformers that referenced this issue Dec 29, 2024
…newly added input tokens

The root reason is Trie.split method didn't ignore partial match that should be removed

Add test case to token split
@LysandreJik LysandreJik added the Core: Tokenization Internals of the library; Tokenization. label Dec 29, 2024
jiongjiongli added a commit to jiongjiongli/transformers that referenced this issue Dec 29, 2024
…newly added input tokens

The root reason is Trie.split method didn't ignore partial match that should be removed

Add test case to token split
jiongjiongli added a commit to jiongjiongli/transformers that referenced this issue Dec 31, 2024
…newly added input tokens

The root reason is Trie.split method didn't ignore partial match that should be removed

Add test case to token split
jiongjiongli added a commit to jiongjiongli/transformers that referenced this issue Jan 1, 2025
…newly added input tokens

The root reason is Trie.split method didn't ignore partial match that should be removed

Add test case to token split
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

No branches or pull requests

2 participants