Custom type symbols add by LlamaTokenizer, LlamaTokenizerFast fails to tokenize them correctly. #26670

qiugen · 2023-10-08T12:07:51Z

System Info

transformers-cli env

- `transformers` version: 4.33.0.dev0                                                                   │
- Platform: Linux-4.18.0-240.el8.x86_64-x86_64-with-glibc2.10                                           │
- Python version: 3.8.0                                                                                 │
- Huggingface_hub version: 0.15.1                                                                       │
- Safetensors version: 0.3.1                                                                            │
- Accelerate version: 0.20.3                                                                            │
- Accelerate config:    not found                                                                       │
- PyTorch version (GPU?): 2.0.1+cu117 (True)                                                            │
- Tensorflow version (GPU?): not installed (NA)                                                         │
- Flax version (CPU?/GPU?/TPU?): not installed (NA)                                                     │
- Jax version: not installed                                                                            │
- JaxLib version: not installed                                                                         │
- Using GPU in script?: <fill in>                                                                       │
- Using distributed or parallel set-up in script?: <fill in>

The vocabulary was extended using the following reference method.
https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py
parts of USER_DEFINED character list show as follow

<pad>
[INST]
[/INST]
[REWARD]
<<SYS>>
<</SYS>>
[CLS]
[SEP]
[RESERVED_0]
[RESERVED_1]
[RESERVED_2]
[RESERVED_3]
[RESERVED_4]
[RESERVED_5]
[RESERVED_6]
[RESERVED_7]
[RESERVED_8]
[RESERVED_9]

I found that LlamaTokenizerFast cannot segment correctly

text='''<<SYS>>背诵<</SYS>>[SEP]白日依山尽，黄河入海流。欲穷千里目，更上一层楼。通过学习这首诗掌握不了䏦䮰
The primary use of LLaMA is research on large language models, including[CLS]
[INST]test[/INST]
test of [REWARD]
test sp1 [RESERVED_0] 
test sp2 [RESERVED_1]
test sp2 [RESERVED_11]
<pad>

The text show before, Tokenized by LlamaTokenizer, the result is:

['▁', '<<SYS>>', '背', '诵', '<</SYS>>', '[SEP]', '白', '日', '依', '山', '尽', '，', '黄', '河', '入', '海', '流', '。', '欲', '穷', '千', '里', '目', '，', '更', '上', '一', '层', '楼', '。', '通', '过', '学', '习', '这', '首', '诗', '掌', '握', '不', '了', '<0xE4>', '<0x8F>', '<0xA6>', '<0xE4>', '<0xAE>', '<0xB0>', '<0x0A>', 'The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including', '[CLS]', '<0x0A>', '[INST]', 'test', '[/INST]', '<0x0A>', 'test', '▁of', '▁', '[REWARD]', '<0x0A>', 'test', '▁sp', '1', '▁', '[RESERVED_0]', '▁', '<0x0A>', 'test', '▁sp', '2', '▁', '[RESERVED_1]', '<0x0A>', 'test', '▁sp', '2', '▁[', 'RE', 'SER', 'V', 'ED', '_', '1', '1', ']', '<0x0A>', '<pad>', '<0x0A>']

Tokenized by LlamaTokenizerFast, by LlamaTokenizer, the result is:

['<s>', '▁<<', 'SY', 'S', '>>', '背', '诵', '<', '</', 'SY', 'S', '>>', '[', 'SE', 'P', ']', '白', '日', '依', '山', '尽', '，', '黄', '河', '入', '海', '流', '。', '欲', '穷', '千', '里', '目', '，', '更', '上', '一', '层', '楼', '。', '通', '过', '学', '习', '这', '首', '诗', '掌', '握', '不', '了', '<0xE4>', '<0x8F>', '<0xA6>', '<0xE4>', '<0xAE>', '<0xB0>', '<0x0A>', 'The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including', '[', 'CL', 'S', ']', '<0x0A>', '[', 'INST', ']', 'test', '[', '/', 'INST', ']', '<0x0A>', 'test', '▁of', '▁[', 'RE', 'W', 'ARD', ']', '<0x0A>', 'test', '▁sp', '1', '▁[', 'RE', 'SER', 'V', 'ED', '_', '0', ']', '▁', '<0x0A>', 'test', '▁sp', '2', '▁[', 'RE', 'SER', 'V', 'ED', '_', '1', ']', '<0x0A>', 'test', '▁sp', '2', '▁[', 'RE', 'SER', 'V', 'ED', '_', '1', '1', ']', '<0x0A>', '<', 'pad', '>', '<0x0A>']

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Expected behavior

for text, LlamaTokenizer, LlamaTokenizerFast work the same at USER_DEFINED. tokenizer the same pieces.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-11-08T09:32:58Z

cc @ArthurZucker

ArthurZucker · 2023-11-08T10:21:04Z

Hey, this is a duplicate of #27132, #26871, #25232, #23833. The token's normalized field should be set to False instead of True

github-actions · 2023-12-03T08:04:45Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface deleted a comment from github-actions bot Nov 8, 2023

github-actions bot closed this as completed Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom type symbols add by LlamaTokenizer, LlamaTokenizerFast fails to tokenize them correctly. #26670

Custom type symbols add by LlamaTokenizer, LlamaTokenizerFast fails to tokenize them correctly. #26670

qiugen commented Oct 8, 2023 •

edited

Loading

amyeroberts commented Nov 8, 2023

ArthurZucker commented Nov 8, 2023

github-actions bot commented Dec 3, 2023

Custom type symbols add by LlamaTokenizer, LlamaTokenizerFast fails to tokenize them correctly. #26670

Custom type symbols add by LlamaTokenizer, LlamaTokenizerFast fails to tokenize them correctly. #26670

Comments

qiugen commented Oct 8, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Nov 8, 2023

ArthurZucker commented Nov 8, 2023

github-actions bot commented Dec 3, 2023

qiugen commented Oct 8, 2023 •

edited

Loading