You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We deleted sentencepiece vocab file because sentencepiece mode file is purely self-contained, and vocab file is never used in the tokenizer. To the best of my knowledge, the vocab file itself is not very useful. Here is a simple vocab file:
<unk> 0
<s> 0
</s> 0
, -3.39764
. -3.53133
▁the -3.56031
s -3.70819
▁ -3.82609
▁I -3.90308
▁to -4.04041
▁a -4.08637
ed -4.16661
▁and -4.26836
▁of -4.27461
t -4.31782
e -4.43336
d -4.44333
ing -4.46929
a -4.53839
▁in -4.64852
o -4.71318
▁was -4.77909
▁" -4.81017
i -4.86229
...
@gpengzhi
But how to use the model file in PairedTextData?
The model file seems only can be used to restore a tokenizer, so I created my own "PairedTextData" with two DataSource to use SentencePieceTokenizer in my project.
Is there anyway more simple to use?
Could you write down how you integrate tokenizer with pairedtextdata? There is another related issue #256 I think we should provide the interface to use tokenizer instead of vocab. Do you think if you can contribute to this feature enhancement? A feature enhancement pull request is welcome!
I want to use vocab file in PairedDataloader, but the the save_vocab function of SentencePieceTokenizer only save the model file.
The model file can't be load by Dataloader because of decoding error.
In sentencepiece_tokenizer.py, I saw you delete the vocab file.
The text was updated successfully, but these errors were encountered: