Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful handling when no LSH duplicates found. #381

Open
davzoku opened this issue Nov 19, 2024 · 2 comments
Open

Graceful handling when no LSH duplicates found. #381

davzoku opened this issue Nov 19, 2024 · 2 comments
Labels
duplicate This issue or pull request already exists

Comments

@davzoku
Copy link
Contributor

davzoku commented Nov 19, 2024

In the current implementation, the __call__ method of nemo_curator/modules/fuzzy_dedup.py, it assumes that at least one LSH duplicate will be found, and the results will be saved as a parquet file. However, if the dataset is clean or too small to have any fuzzy duplicates, the code will throw an error when trying to read the non-existent parquet file.

    def __call__(self, dataset: DocumentDataset) -> DocumentDataset:
        df = dataset.df

        write_path = os.path.join(self.cache_dir, "_buckets.parquet")
        t0 = time.time()
        with performance_report_if_with_ts_suffix(self.profile_dir, f"lsh-profile"):
            self.lsh(write_path=write_path, df=df)
        self._logger.info(
            f"Time taken for LSH = {time.time() - t0}s and output written at {write_path}"
        )

        buckets_df = dask_cudf.read_parquet(write_path, split_row_groups=False)
        return DocumentDataset(buckets_df)

A simple enhancement will be to throw a warning or gracefully handle the situation for those who are unfamiliar with the code base.

eg.

def __call__(self, dataset: DocumentDataset) -> DocumentDataset:
    df = dataset.df

    write_path = os.path.join(self.cache_dir, "_buckets.parquet")
    t0 = time.time()
    with performance_report_if_with_ts_suffix(self.profile_dir, f"lsh-profile"):
        self.lsh(write_path=write_path, df=df)
    self._logger.info(
        f"Time taken for LSH = {time.time() - t0}s and output written at {write_path}"
    )

    if not os.path.exists(write_path):
        self._logger.warning("No LSH duplicates found.")
        return DocumentDataset(dask_cudf.from_cudf(cudf.DataFrame(), npartitions=1))

    buckets_df = dask_cudf.read_parquet(write_path, split_row_groups=False)
    return DocumentDataset(buckets_df)
@ayushdg
Copy link
Collaborator

ayushdg commented Nov 19, 2024

Thanks for raising @davzoku. I'm working on this as a part of the refactor in #326. Feel free to share any opinions you might have on how the behavior might be handling in that PR.

@davzoku
Copy link
Contributor Author

davzoku commented Nov 19, 2024

I see, @ayushdg! i will take a look.

mentioning issue: #67 as the current issue might be a duplicate of this existing issue.

@ayushdg ayushdg added the duplicate This issue or pull request already exists label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants