You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I faced a similar issue as raised by a question in the HF forum where the OP trainer the tokenizer with user_defined_symbols while in my case I added to the SentencePiece model file directly without training.
Noted that I can just use the add_tokens method to achieve the same outcome but because of another issue that I raised #28218 , I would like to avoid the use of add_tokens method if possible.
The text was updated successfully, but these errors were encountered:
Hey! Few things here. What you are trying to do is outside the scope of the supported features. Adding a token should be done using tokenizer.add_tokens function.
The fast version is for me more right than what you expect. If there are no merges, then there is absolutely no reason for the BPE model to fuse '▁super', 'long', 'word' into superlongword. Thus the slow version seems more wrong, and specifically because sentencepiece does not really allow adding tokens that way.
System Info
transformers
version: 4.35.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
I faced a similar issue as raised by a question in the HF forum where the OP trainer the tokenizer with user_defined_symbols while in my case I added to the SentencePiece model file directly without training.
Noted that I can just use the
add_tokens
method to achieve the same outcome but because of another issue that I raised #28218 , I would like to avoid the use ofadd_tokens
method if possible.The text was updated successfully, but these errors were encountered: