Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster/More efficient duplicate removal for exact/fuzzy dedup. #335

Open
ayushdg opened this issue Oct 29, 2024 · 1 comment
Open

Faster/More efficient duplicate removal for exact/fuzzy dedup. #335

ayushdg opened this issue Oct 29, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@ayushdg
Copy link
Collaborator

ayushdg commented Oct 29, 2024

Is your feature request related to a problem? Please describe.
The current deduplication examples suggest compute on the list of duplicate documents produced via exact/fuzzy deduplication and use the computed list to filter out input documents. This doesn't work in cases where the duplicate list is too large and doesn't fit on the client.
Ideally curator can provide additional classes/methods to remove duplicates from the list of duplicates list more efficiently.

Describe the solution you'd like
A broadcast merge approach like the one suggested by @VibhuJawa works good enough at the 4-8TB scales where the duplicate list is small enough to be broadcasted to each worker and is worth implementing first.
Longer term there might be a need for smarter partitioning of the duplicate list so that different files/subset can handle their own list of duplicates differently.

Describe alternatives you've considered
N/A

Additional context
The Zyda-2 tutorial and pre-training data tutorial both contain alternate approaches to compute since it's memory intensive.

@ayushdg ayushdg added the enhancement New feature or request label Oct 29, 2024
@VibhuJawa
Copy link
Collaborator

A broadcast merge approach like the one suggested by @VibhuJawa works good enough at the 4-8TB scales where the duplicate list is small enough to be broadcasted to each worker and is worth implementing first.

Examples of removing using merge:

https://gist.github.com/VibhuJawa/7c780209bdcad9ac7615bd84b86cde58

Longer term there might be a need for smarter partitioning of the duplicate list so that different files/subset can handle their own list of duplicates differently.

My best suggestion here if we want to skip doing a broadcast merge is to do a batched index merge (like we do in the CC stage) for this, i think thats most scalable.

We can also do this based on a heuristic as the list of IDs is all ready in distributed GPU memory, so we can switch b/w the two, this will mean we dont compromise on perf on the short term

We can also play tricks on filtering if we wanted to by first creating a map of file->dataset_ids and then doing a filter on a dataset.

@praateekmahajan praateekmahajan self-assigned this Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants