-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory while fit #1381
Comments
Hi @saswat0, I'm curious if you can speak more about your use case. What are you hoping to use the synthetic data for? We can definitely look into why this is happening with CTGAN. But do note that the SDV offers multiple different synthesizers outside of just CTGAN. Depending on your use case, another synthesizer might be a better option and it would still allow you to use all of the SDV features such as constraints, conditional sampling, etc. Resources
|
@npatki Thanks for the response I have a dataset with sensitive (PII) information on customer data, and I wanted to synthesise a new dataset from this to make it public. The generated data must be drawn from the same distribution, and the new rows should be indistinguishable (since some ML models are trained and perform well on the real data, and this new data should also keep their results consistent). I used |
I'm facing the same issue with |
Hi @saswat0, thanks for the details. One thing you can try with from rdt.transformers.categorical import LabelEncoder
synthesizer = CTGANSynthesizer(metadata)
synthesizer.auto_assign_transformers(data)
synthesizer.update_transformers(column_name_to_transformer={
'categorical_column_name': LabelEncoder(add_noise=True),
'categorical_column_name_2': LabelEncoder(add_noise=True),
...
}) I'm not sure yet what effect this would have on the quality. Do let us know if you experiment with this! I'd always recommend checking your quality using the SDMetrics quality report. (This blog post may be useful too.) |
Hi @npatki I tried using this method as well but it failed. I had some success upon reducing the real dataset's size and moving to a bigger machine. Is there any provision to use a distributed setup (spark cluster with several nodes and shared memory) for this? |
Hi @npatki , I have the same issue that SDV still seems to use one-hot encoding for categorical columns even specifying LabelEncoder as the transformer. Thank you! ;) |
Hi @saswat0, I don't believe the CTGAN model is currently setup to make use of distributed infra. I see you filed CTGAN issue 290, which we can continue to keep open to discuss this.
@thaddywu this is unexpected! Do you have any code or examples to suggest that the |
Hi everyone, I think this original discussion has been split into several different issues that are currently being tracked -- so I'm closing this off as a duplicate. Feel free to reply to any of the below issue based on your feedback. See CTGAN #290 for multi GPU support. See #1450, for issues when applying the See #1451, as an umbrella issue for performance improvements to CTGAN Synthesizer. |
Environment details
If you are already running SDV, please indicate the following details about the environment in
which you are running it:
Problem description
I'm trying to generate some synthetic data using SDV's CTGAN. But the code terminates midway due to memory overflow. My dataset size is 195634 (rows) x 24 (columns) and my system has a 64 GiB memory capacity. While running fit, the memory overflows
What I already tried
This is my error
How to use SDV for datasets of this size then? Is there any provision for training the model incrementally on smaller subsets instead?
The text was updated successfully, but these errors were encountered: