You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sample and full training and testing data contain tokenized sentences (by TweetTokenizer I suppose): what are you doing for a living ? i am a admin .
instead of not tokenized: what are you doing for a living? i am a admin.
During inference, the model output seems to be correct (detokenized), no matter if the input is tokenized or not.
3rd party decoding scripts in README does not use any tokenization.
What is the correct way to use the model? Should I tokenize the input or detokenize the output? Is the tokenizer exactly the same as GPT2 tokenizer or is it trained on the reddit data from scratch?
The text was updated successfully, but these errors were encountered:
Sample and full training and testing data contain tokenized sentences (by TweetTokenizer I suppose):
what are you doing for a living ? i am a admin .
instead of not tokenized:
what are you doing for a living? i am a admin.
During inference, the model output seems to be correct (detokenized), no matter if the input is tokenized or not.
3rd party decoding scripts in README does not use any tokenization.
What is the correct way to use the model? Should I tokenize the input or detokenize the output? Is the tokenizer exactly the same as GPT2 tokenizer or is it trained on the reddit data from scratch?
The text was updated successfully, but these errors were encountered: