-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to support multi-threaded parallel data preprocessing? #870
Comments
Agree, this would be very useful. Would it be possible to implement sharding for |
I think the example conversion script is perhaps not very good. One thing that helps a lot is to use the Datasets Also, there is a bug in |
The text to MDS conversion script (https://github.com/mosaicml/llm-foundry/blob/main/scripts/data_prep/convert_text_to_mds.py) is parallelized, is that what you are looking for (or at least a good starting point)? |
Thanks, I will look into it. |
Isn't enough to just run the script in parallel, and merge the mds shards with this method?
Currently, I am trying it like this. I have large jsonl file. I used Lastly, I hope it will be enough to just call the mentioned merge method on (Will update once the progress is finished.). |
Yes @MFajcik , that should work! |
It does work! Preprocessing was done in notime. Training is running right now. Thanks for the hint! |
I changed def __iter__(self) -> Iterable[Dict[str, bytes]]:
buffer = []
# self.write_batch_size = 10_000
shards = self.hf_dataset.num_rows // self.write_batch_size + 1
for i in range(shards):
shard = self.hf_dataset[
i * self.write_batch_size : (i + 1) * self.write_batch_size
]
encoded_shard = self.tokenizer(
shard["text"], truncation=False, padding=False
)
for encoded in encoded_shard["input_ids"]:
iids = encoded # ['input_ids']
buffer = buffer + self.bos_tokens + iids + self.eos_tokens
while len(buffer) >= self.max_length:
concat_sample = buffer[: self.max_length]
buffer = buffer[self.max_length :] if self.should_wrap else []
yield {
# convert to bytes to store in MDS binary format
"tokens": np.asarray(concat_sample).tobytes(),
"num_tokens": len(concat_sample),
} Processing 7B tokens takes around 20 hours with the original code and 30 min with this change. It's not very robust though and doesn't scale very well: a fast tokenizer hangs after a while with very long text and more than 16 threads seem not to give you any speedup. |
Thanks for your update! Do you modify other files to enable multithread? |
Yes sorry, I also removed |
It helps a lot. I can process 100B tokens within in 7 hours with your code! :) |
I want to pretrain an LLM with 2T tokens using llm-foundry. But before training, the data processing time is too long. Is there any way to accelerate it?
The text was updated successfully, but these errors were encountered: