You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found that LlamaTokenizerFast cannot segment correctly
text='''<<SYS>>背诵<</SYS>>[SEP]白日依山尽,黄河入海流。欲穷千里目,更上一层楼。通过学习这首诗掌握不了䏦䮰
The primary use of LLaMA is research on large language models, including[CLS]
[INST]test[/INST]
test of [REWARD]
test sp1 [RESERVED_0]
test sp2 [RESERVED_1]
test sp2 [RESERVED_11]
<pad>
The text show before, Tokenized by LlamaTokenizer, the result is:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers-cli env
The vocabulary was extended using the following reference method.
https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py
parts of USER_DEFINED character list show as follow
I found that LlamaTokenizerFast cannot segment correctly
The text show before, Tokenized by LlamaTokenizer, the result is:
Tokenized by LlamaTokenizerFast, by LlamaTokenizer, the result is:
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
for text, LlamaTokenizer, LlamaTokenizerFast work the same at USER_DEFINED. tokenizer the same pieces.
The text was updated successfully, but these errors were encountered: