You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The generate_vocab step keeps on running for some datasets. Interestingly, this happens for HPLT datasets mostly as we sub-sample 10M sentences for larger datasets to generate vocab. However for HPLT we don't have that many sentences and we end up using all sentences for generating vocabulary.
This is a dataset related issue but I feel our pipeline should be robust enough to handle such problems.
The text was updated successfully, but these errors were encountered:
The
generate_vocab
step keeps on running for some datasets. Interestingly, this happens for HPLT datasets mostly as we sub-sample 10M sentences for larger datasets to generate vocab. However for HPLT we don't have that many sentences and we end up using all sentences for generating vocabulary.This is a dataset related issue but I feel our pipeline should be robust enough to handle such problems.
The text was updated successfully, but these errors were encountered: