Faster/More efficient duplicate removal for exact/fuzzy dedup. #335

ayushdg · 2024-10-29T21:12:55Z

Is your feature request related to a problem? Please describe.
The current deduplication examples suggest compute on the list of duplicate documents produced via exact/fuzzy deduplication and use the computed list to filter out input documents. This doesn't work in cases where the duplicate list is too large and doesn't fit on the client.
Ideally curator can provide additional classes/methods to remove duplicates from the list of duplicates list more efficiently.

Describe the solution you'd like
A broadcast merge approach like the one suggested by @VibhuJawa works good enough at the 4-8TB scales where the duplicate list is small enough to be broadcasted to each worker and is worth implementing first.
Longer term there might be a need for smarter partitioning of the duplicate list so that different files/subset can handle their own list of duplicates differently.

Describe alternatives you've considered
N/A

Additional context
The Zyda-2 tutorial and pre-training data tutorial both contain alternate approaches to compute since it's memory intensive.

The text was updated successfully, but these errors were encountered:

VibhuJawa · 2024-10-30T18:05:46Z

A broadcast merge approach like the one suggested by @VibhuJawa works good enough at the 4-8TB scales where the duplicate list is small enough to be broadcasted to each worker and is worth implementing first.

Examples of removing using merge:

https://gist.github.com/VibhuJawa/7c780209bdcad9ac7615bd84b86cde58

Longer term there might be a need for smarter partitioning of the duplicate list so that different files/subset can handle their own list of duplicates differently.

My best suggestion here if we want to skip doing a broadcast merge is to do a batched index merge (like we do in the CC stage) for this, i think thats most scalable.

We can also do this based on a heuristic as the list of IDs is all ready in distributed GPU memory, so we can switch b/w the two, this will mean we dont compromise on perf on the short term

We can also play tricks on filtering if we wanted to by first creating a map of file->dataset_ids and then doing a filter on a dataset.

https://gist.github.com/VibhuJawa/749045921b9e5a81b42a4b41cd4b03dc

ayushdg added the enhancement New feature or request label Oct 29, 2024

praateekmahajan self-assigned this Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster/More efficient duplicate removal for exact/fuzzy dedup. #335

Faster/More efficient duplicate removal for exact/fuzzy dedup. #335

ayushdg commented Oct 29, 2024

VibhuJawa commented Oct 30, 2024

Faster/More efficient duplicate removal for exact/fuzzy dedup. #335

Faster/More efficient duplicate removal for exact/fuzzy dedup. #335

Comments

ayushdg commented Oct 29, 2024

VibhuJawa commented Oct 30, 2024