-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script for MDS conversion of bucket of text files #570
Conversation
c22102f
to
432bf45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're already in process of doing this, but would like a manual test that this script produces data we can train on, and expected dataset size looks right before merging.
535ce37
to
fa4759b
Compare
…_mds.py Co-authored-by: Daniel King <[email protected]>
Co-authored-by: Daniel King <[email protected]>
fa4759b
to
36c1ab5
Compare
a9fbf91
to
3d7f9c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending a few small comments. Also I requested review from Karan to make sure the streaming stuff looks ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank You!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Lets start trying it out!
Similar to the other data conversion scripts in foundry, this one is aimed at a continued pretraining API, which will accept a remote bucket containing text files and pretokenconcat and convert to mds for continued pretraining.
https://databricks.atlassian.net/browse/GRT-2273
Manual Test
https://gist.github.com/irenedea/7256479e61519b73bf7f231c7597610c
total tokens found when writing to mds: 3028992 (This is equivalent to 5 full batches with a 6th batch that has remaining samples)
Without data duplication (StreamingDatatset
num_canonical_nodes=1
):time/token
is 3028992 at the end of trainingWith data duplication:
time/token
is 3145728 at the end of trainingnum of tokens for 6 full batches: 256 *6 * 2048 = 3145728