Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a Tokenizer from large texts read from memory #465

Closed
nlpravi opened this issue Oct 14, 2020 · 5 comments
Closed

Training a Tokenizer from large texts read from memory #465

nlpravi opened this issue Oct 14, 2020 · 5 comments

Comments

@nlpravi
Copy link

nlpravi commented Oct 14, 2020

I have around 800M files where the text field is one of the fields in the file. I would like to train a new tokenizer only on the text field. I cannot extract the text and write to new files because of the huge volume of files. Is there anyway to train a new tokenizer on this data by reading only the text fields from each of these files and passing them to the training process?

Thanks,
Ravi.

@Narsil
Copy link
Collaborator

Narsil commented Oct 15, 2020

Sorry no, not at the moment. How much does you 800M files represent of memory ? How much of that will your "text"represent ?

Training algorithms need to have the whole dataset readable to be able to learn (the preprocessing helps not having the whole array in memory if it's possible to do) so you would have to fit the whole thing in memory anyway. If it fits in memory, how come it can't fit on disk ? Usually disk is cheaper than RAM, no ? Just trying to understand your use case so we might design training API better in the future.

@nlpravi
Copy link
Author

nlpravi commented Oct 15, 2020

Thanks Narsil. The 800M represent around 800G on disk. I'm not sure about the memory. Since there is no way to load them in memory directly, We can definitely write the files to disk and train a tokenizer.

Thanks again for your quick response.

@Narsil
Copy link
Collaborator

Narsil commented Oct 15, 2020

Also quick tip, usually you don't need to train the tokenizer on the whole huge dataset.

1Go - 10Go should be more than enough to get good heuristics for your tokenizer. Going to the full dataset will only yield very marginally better results.
Better to do more runs and choose carefully the normalizers,pre_tokenizers and so on..

@nlpravi
Copy link
Author

nlpravi commented Oct 15, 2020

Thanks for the tip Narsil. That would really help.

@n1t0
Copy link
Member

n1t0 commented Oct 20, 2020

Duplicate of #198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants