Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script for MDS conversion of bucket of text files #570

Merged
merged 38 commits into from
Sep 15, 2023

Conversation

irenedea
Copy link
Contributor

@irenedea irenedea commented Aug 31, 2023

Similar to the other data conversion scripts in foundry, this one is aimed at a continued pretraining API, which will accept a remote bucket containing text files and pretokenconcat and convert to mds for continued pretraining.

https://databricks.atlassian.net/browse/GRT-2273

Manual Test

  • Trained mpt-125m on the first 100 text files in a remote s3 bucket for 1 epoch (i.e. 6 batches)
  • Text files were processed with 4 processes, EleutherAI/gpt-neox-20b tokenizer
  • Tested with both mds shards uploaded to s3 and local mds shards.

https://gist.github.com/irenedea/7256479e61519b73bf7f231c7597610c

total tokens found when writing to mds: 3028992 (This is equivalent to 5 full batches with a 6th batch that has remaining samples)

Without data duplication (StreamingDatatset num_canonical_nodes=1):
time/token is 3028992 at the end of training

With data duplication:
time/token is 3145728 at the end of training

num of tokens for 6 full batches: 256 *6 * 2048 = 3145728

@irenedea irenedea marked this pull request as ready for review September 1, 2023 16:44
@irenedea irenedea requested a review from dakinggg September 1, 2023 17:19
@irenedea irenedea force-pushed the convert_mds_script branch 2 times, most recently from c22102f to 432bf45 Compare September 1, 2023 20:03
setup.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're already in process of doing this, but would like a manual test that this script produces data we can train on, and expected dataset size looks right before merging.

setup.py Outdated Show resolved Hide resolved
scripts/data_prep/utils.py Outdated Show resolved Hide resolved
scripts/data_prep/utils.py Outdated Show resolved Hide resolved
scripts/data_prep/utils.py Outdated Show resolved Hide resolved
scripts/data_prep/utils.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Show resolved Hide resolved
tests/test_convert_text_to_mds.py Outdated Show resolved Hide resolved
tests/test_convert_text_to_mds.py Outdated Show resolved Hide resolved
@irenedea irenedea requested a review from dakinggg September 11, 2023 18:42
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending a few small comments. Also I requested review from Karan to make sure the streaming stuff looks ok

llmfoundry/utils/data_prep_utils.py Outdated Show resolved Hide resolved
tests/test_convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
@dakinggg dakinggg requested a review from karan6181 September 15, 2023 00:25
@irenedea irenedea requested a review from dakinggg September 15, 2023 04:11
llmfoundry/utils/data_prep_utils.py Show resolved Hide resolved
llmfoundry/utils/data_prep_utils.py Outdated Show resolved Hide resolved
llmfoundry/utils/data_prep_utils.py Outdated Show resolved Hide resolved
llmfoundry/utils/data_prep_utils.py Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Outdated Show resolved Hide resolved
scripts/data_prep/convert_text_to_mds.py Show resolved Hide resolved
Copy link
Contributor

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank You!

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Lets start trying it out!

@irenedea irenedea merged commit c308d10 into mosaicml:main Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants