You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm wondering if it would be possible to use the quality filter to filter out documents under a certain length. For example, I'm looking to assemble a dataset where each sequence is between 64-128k in context.
Is this easily configurable in the quality filter?
Would this filter be applied before or after tokenization?
Thank you for your help.
The text was updated successfully, but these errors were encountered:
Hi there, thank you for this great release!
I'm wondering if it would be possible to use the quality filter to filter out documents under a certain length. For example, I'm looking to assemble a dataset where each sequence is between 64-128k in context.
Thank you for your help.
The text was updated successfully, but these errors were encountered: