Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation failing with infinite time for imbalanced datasets #22

Open
omaralvarez opened this issue Nov 2, 2024 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@omaralvarez
Copy link

omaralvarez commented Nov 2, 2024

First of all, thanks for your contribution. I am using your model for dataset rebalancing, and with datasets with low and imbalanced samples I am facing a problem. Generation fails due to reaching the maximum amount of tries, I have tried several approaches like increasing the epochs or trying to use #7, to no avail, generation never completes. One of the datasets in which is happening is:

https://imbalanced-learn.org/stable/datasets/index.html

from imblearn.datasets import fetch_datasets

ecoli = fetch_datasets()['ecoli']
ecoli.data.shape
@omaralvarez omaralvarez changed the title Generation failing with infinite time for unbalanced datasets Generation failing with infinite time for imbalanced datasets Nov 2, 2024
@omaralvarez
Copy link
Author

I think I found the problem, it has to do with negative numeric target labels. Using strings it works. The same thing is happening in your other model CTABGAN.

@zhao-zilong
Copy link
Contributor

Hi @omaralvarez thanks for your comment. What do you mean by negative numeric target labels? like "-1" and "-5.78" in the target label column?

@omaralvarez
Copy link
Author

I finally found the issue, sometimes the original pandas datatypes are not given back when the model generates samples. It returns objects (I think strings). So that was causing a bug in my code. A simple:

        sample = self.synthesizer.sample(self.batch_size)

        return self.data_prep.inverse_prep(sample).astype(
            dtype=self.raw_df.dtypes.to_dict()
        )

Fixes the issue.

@zhao-zilong
Copy link
Contributor

OK, cool, thanks @omaralvarez Would you like to create a pull request to improve this part of code?

@zhao-zilong zhao-zilong added the enhancement New feature or request label Nov 11, 2024
@omaralvarez
Copy link
Author

Yep, no problem. Right now I am tight on time, but as soon as I can I will whip out a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants