Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory while fit #1381

Closed
saswat0 opened this issue Apr 19, 2023 · 8 comments
Closed

Out of memory while fit #1381

saswat0 opened this issue Apr 19, 2023 · 8 comments
Labels
question General question about the software resolution:duplicate This issue or pull request already exists

Comments

@saswat0
Copy link

saswat0 commented Apr 19, 2023

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 1.0.0
  • Python version: 3.8.16
  • Operating System: Ubuntu 18.04.6 LTS

Problem description

I'm trying to generate some synthetic data using SDV's CTGAN. But the code terminates midway due to memory overflow. My dataset size is 195634 (rows) x 24 (columns) and my system has a 64 GiB memory capacity. While running fit, the memory overflows

What I already tried

from sdv.single_table import CTGANSynthesizer

synthesizer = CTGANSynthesizer(metadata, verbose=True)
synthesizer.fit(df)

synthetic_data = synthesizer.sample(num_rows=10)

This is my error

---------------------------------------------------------------------------
TerminatedWorkerError                     Traceback (most recent call last)
Cell In[37], line 4
      1 from sdv.single_table import CTGANSynthesizer
      3 synthesizer = CTGANSynthesizer(metadata, verbose=True)
----> 4 synthesizer.fit(df)
      6 synthetic_data = synthesizer.sample(num_rows=10)

File [~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:457](https://vscode-remote+ssh-002dremote-002bcompute-002e1689064242436753780.vscode-resource.vscode-cdn.net/home/synthetic_data/~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:457), in BaseSynthesizer.fit(self, data)
    455 self._random_state_set = False
    456 processed_data = self._preprocess(data)
--> 457 self.fit_processed_data(processed_data)

File [~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:441](https://vscode-remote+ssh-002dremote-002bcompute-002e1689064242436753780.vscode-resource.vscode-cdn.net/home/synthetic_data/~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:441), in BaseSynthesizer.fit_processed_data(self, processed_data)
    434 def fit_processed_data(self, processed_data):
    435     """Fit this model to the transformed data.
    436 
    437     Args:
    438         processed_data (pandas.DataFrame):
    439             The transformed data used to fit the model to.
    440     """
--> 441     self._fit(processed_data)
    442     self._fitted = True
    443     self._fitted_date = datetime.datetime.today().strftime('%Y-%m-%d')
...
    392         self = None

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

How to use SDV for datasets of this size then? Is there any provision for training the model incrementally on smaller subsets instead?

@saswat0 saswat0 added new Automatic label applied to new issues question General question about the software labels Apr 19, 2023
@npatki
Copy link
Contributor

npatki commented Apr 20, 2023

Hi @saswat0, I'm curious if you can speak more about your use case. What are you hoping to use the synthetic data for?

We can definitely look into why this is happening with CTGAN. But do note that the SDV offers multiple different synthesizers outside of just CTGAN. Depending on your use case, another synthesizer might be a better option and it would still allow you to use all of the SDV features such as constraints, conditional sampling, etc.

Resources

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Apr 20, 2023
@saswat0
Copy link
Author

saswat0 commented Apr 20, 2023

@npatki Thanks for the response

I have a dataset with sensitive (PII) information on customer data, and I wanted to synthesise a new dataset from this to make it public. The generated data must be drawn from the same distribution, and the new rows should be indistinguishable (since some ML models are trained and perform well on the real data, and this new data should also keep their results consistent).

I used GaussianCopulaSynthesizer as per your advice, and it gave reasonable results. But since the quality of generated data is of utmost concern, I'm leaning toward NN models rather than statistical ones.

@saswat0
Copy link
Author

saswat0 commented Apr 20, 2023

I'm facing the same issue with TVAESynthesizer

@npatki
Copy link
Contributor

npatki commented Apr 25, 2023

Hi @saswat0, thanks for the details.

One thing you can try with CTGANSynthesizer is to preprocess all the categorical columns. You may want to try the LabelEncoder. This notebook has some more information.

from rdt.transformers.categorical import LabelEncoder

synthesizer = CTGANSynthesizer(metadata)
synthesizer.auto_assign_transformers(data)

synthesizer.update_transformers(column_name_to_transformer={
    'categorical_column_name': LabelEncoder(add_noise=True),
    'categorical_column_name_2': LabelEncoder(add_noise=True),
    ...
})

I'm not sure yet what effect this would have on the quality. Do let us know if you experiment with this!

I'd always recommend checking your quality using the SDMetrics quality report. (This blog post may be useful too.)

@saswat0
Copy link
Author

saswat0 commented May 4, 2023

Hi @npatki I tried using this method as well but it failed. I had some success upon reducing the real dataset's size and moving to a bigger machine. Is there any provision to use a distributed setup (spark cluster with several nodes and shared memory) for this?

@thaddywu
Copy link

thaddywu commented May 12, 2023

Hi @npatki , I have the same issue that SDV still seems to use one-hot encoding for categorical columns even specifying LabelEncoder as the transformer. Thank you! ;)

@npatki
Copy link
Contributor

npatki commented May 15, 2023

Hi @saswat0, I don't believe the CTGAN model is currently setup to make use of distributed infra. I see you filed CTGAN issue 290, which we can continue to keep open to discuss this.

SDV still seems to use one-hot encoding for categorical columns even specifying LabelEncoder as the transformer

@thaddywu this is unexpected! Do you have any code or examples to suggest that the LabelEncoder is being ignored and that one hot encoding is being used instead? I tried this with the demo data and it appears that the synthesizer is correctly preprocessing the data using label encoding. We can file a separate issue to look into this.

@npatki
Copy link
Contributor

npatki commented Jun 5, 2023

Hi everyone, I think this original discussion has been split into several different issues that are currently being tracked -- so I'm closing this off as a duplicate. Feel free to reply to any of the below issue based on your feedback.

See CTGAN #290 for multi GPU support.

See #1450, for issues when applying the LabelEncoder to CTGAN (this includes a workaround that you can use in the meantime)

See #1451, as an umbrella issue for performance improvements to CTGAN Synthesizer.

@npatki npatki closed this as completed Jun 5, 2023
@npatki npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants