tokenization (punctuation) during training and inference #59

kubapok · 2021-01-29T16:04:07Z

Sample and full training and testing data contain tokenized sentences (by TweetTokenizer I suppose):
what are you doing for a living ? i am a admin .
instead of not tokenized:
what are you doing for a living? i am a admin.

During inference, the model output seems to be correct (detokenized), no matter if the input is tokenized or not.
3rd party decoding scripts in README does not use any tokenization.

What is the correct way to use the model? Should I tokenize the input or detokenize the output? Is the tokenizer exactly the same as GPT2 tokenizer or is it trained on the reddit data from scratch?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization (punctuation) during training and inference #59

tokenization (punctuation) during training and inference #59

kubapok commented Jan 29, 2021

tokenization (punctuation) during training and inference #59

tokenization (punctuation) during training and inference #59

Comments

kubapok commented Jan 29, 2021