Training a Tokenizer from large texts read from memory #465

nlpravi · 2020-10-14T19:47:24Z

I have around 800M files where the text field is one of the fields in the file. I would like to train a new tokenizer only on the text field. I cannot extract the text and write to new files because of the huge volume of files. Is there anyway to train a new tokenizer on this data by reading only the text fields from each of these files and passing them to the training process?

Thanks,
Ravi.

Narsil · 2020-10-15T08:24:22Z

Sorry no, not at the moment. How much does you 800M files represent of memory ? How much of that will your "text"represent ?

Training algorithms need to have the whole dataset readable to be able to learn (the preprocessing helps not having the whole array in memory if it's possible to do) so you would have to fit the whole thing in memory anyway. If it fits in memory, how come it can't fit on disk ? Usually disk is cheaper than RAM, no ? Just trying to understand your use case so we might design training API better in the future.

nlpravi · 2020-10-15T17:40:12Z

Thanks Narsil. The 800M represent around 800G on disk. I'm not sure about the memory. Since there is no way to load them in memory directly, We can definitely write the files to disk and train a tokenizer.

Thanks again for your quick response.

Narsil · 2020-10-15T17:51:32Z

Also quick tip, usually you don't need to train the tokenizer on the whole huge dataset.

1Go - 10Go should be more than enough to get good heuristics for your tokenizer. Going to the full dataset will only yield very marginally better results.
Better to do more runs and choose carefully the normalizers,pre_tokenizers and so on..

nlpravi · 2020-10-15T18:28:02Z

Thanks for the tip Narsil. That would really help.

n1t0 · 2020-10-20T22:55:13Z

Duplicate of #198

n1t0 marked this as a duplicate of #198 Oct 20, 2020

n1t0 closed this as completed Oct 20, 2020

morphpiece mentioned this issue Jun 4, 2024

"Solution" to memory hogging in train_new_from_iterator with a hack #1546

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a Tokenizer from large texts read from memory #465

Training a Tokenizer from large texts read from memory #465

nlpravi commented Oct 14, 2020

Narsil commented Oct 15, 2020 •

edited

Loading

nlpravi commented Oct 15, 2020

Narsil commented Oct 15, 2020

nlpravi commented Oct 15, 2020

n1t0 commented Oct 20, 2020

Training a Tokenizer from large texts read from memory #465

Training a Tokenizer from large texts read from memory #465

Comments

nlpravi commented Oct 14, 2020

Narsil commented Oct 15, 2020 • edited Loading

nlpravi commented Oct 15, 2020

Narsil commented Oct 15, 2020

nlpravi commented Oct 15, 2020

n1t0 commented Oct 20, 2020

Narsil commented Oct 15, 2020 •

edited

Loading