Out of memory while fit #1381

saswat0 · 2023-04-19T03:41:43Z

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

SDV version: 1.0.0
Python version: 3.8.16
Operating System: Ubuntu 18.04.6 LTS

Problem description

I'm trying to generate some synthetic data using SDV's CTGAN. But the code terminates midway due to memory overflow. My dataset size is 195634 (rows) x 24 (columns) and my system has a 64 GiB memory capacity. While running fit, the memory overflows

What I already tried

from sdv.single_table import CTGANSynthesizer

synthesizer = CTGANSynthesizer(metadata, verbose=True)
synthesizer.fit(df)

synthetic_data = synthesizer.sample(num_rows=10)

This is my error

---------------------------------------------------------------------------
TerminatedWorkerError                     Traceback (most recent call last)
Cell In[37], line 4
      1 from sdv.single_table import CTGANSynthesizer
      3 synthesizer = CTGANSynthesizer(metadata, verbose=True)
----> 4 synthesizer.fit(df)
      6 synthetic_data = synthesizer.sample(num_rows=10)

File [~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:457](https://vscode-remote+ssh-002dremote-002bcompute-002e1689064242436753780.vscode-resource.vscode-cdn.net/home/synthetic_data/~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:457), in BaseSynthesizer.fit(self, data)
    455 self._random_state_set = False
    456 processed_data = self._preprocess(data)
--> 457 self.fit_processed_data(processed_data)

File [~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:441](https://vscode-remote+ssh-002dremote-002bcompute-002e1689064242436753780.vscode-resource.vscode-cdn.net/home/synthetic_data/~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:441), in BaseSynthesizer.fit_processed_data(self, processed_data)
    434 def fit_processed_data(self, processed_data):
    435     """Fit this model to the transformed data.
    436 
    437     Args:
    438         processed_data (pandas.DataFrame):
    439             The transformed data used to fit the model to.
    440     """
--> 441     self._fit(processed_data)
    442     self._fitted = True
    443     self._fitted_date = datetime.datetime.today().strftime('%Y-%m-%d')
...
    392         self = None

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

How to use SDV for datasets of this size then? Is there any provision for training the model incrementally on smaller subsets instead?

The text was updated successfully, but these errors were encountered:

npatki · 2023-04-20T14:45:53Z

Hi @saswat0, I'm curious if you can speak more about your use case. What are you hoping to use the synthetic data for?

We can definitely look into why this is happening with CTGAN. But do note that the SDV offers multiple different synthesizers outside of just CTGAN. Depending on your use case, another synthesizer might be a better option and it would still allow you to use all of the SDV features such as constraints, conditional sampling, etc.

Resources

List of all single table synthesizers, including the pros and cons of different approahces
Gaussian Copula Synthesizer API, this is my recommendation for what you can try
Demos, some notebooks that you can use to get started. In particular, I recommend the the Quickstart and Gaussian Copula demos

saswat0 · 2023-04-20T17:26:33Z

@npatki Thanks for the response

I have a dataset with sensitive (PII) information on customer data, and I wanted to synthesise a new dataset from this to make it public. The generated data must be drawn from the same distribution, and the new rows should be indistinguishable (since some ML models are trained and perform well on the real data, and this new data should also keep their results consistent).

I used GaussianCopulaSynthesizer as per your advice, and it gave reasonable results. But since the quality of generated data is of utmost concern, I'm leaning toward NN models rather than statistical ones.

saswat0 · 2023-04-20T17:37:18Z

I'm facing the same issue with TVAESynthesizer

npatki · 2023-04-25T20:07:15Z

Hi @saswat0, thanks for the details.

One thing you can try with CTGANSynthesizer is to preprocess all the categorical columns. You may want to try the LabelEncoder. This notebook has some more information.

from rdt.transformers.categorical import LabelEncoder

synthesizer = CTGANSynthesizer(metadata)
synthesizer.auto_assign_transformers(data)

synthesizer.update_transformers(column_name_to_transformer={
    'categorical_column_name': LabelEncoder(add_noise=True),
    'categorical_column_name_2': LabelEncoder(add_noise=True),
    ...
})

I'm not sure yet what effect this would have on the quality. Do let us know if you experiment with this!

I'd always recommend checking your quality using the SDMetrics quality report. (This blog post may be useful too.)

saswat0 · 2023-05-04T10:14:43Z

Hi @npatki I tried using this method as well but it failed. I had some success upon reducing the real dataset's size and moving to a bigger machine. Is there any provision to use a distributed setup (spark cluster with several nodes and shared memory) for this?

thaddywu · 2023-05-12T05:47:25Z

Hi @npatki , I have the same issue that SDV still seems to use one-hot encoding for categorical columns even specifying LabelEncoder as the transformer. Thank you! ;)

npatki · 2023-05-15T13:52:38Z

Hi @saswat0, I don't believe the CTGAN model is currently setup to make use of distributed infra. I see you filed CTGAN issue 290, which we can continue to keep open to discuss this.

SDV still seems to use one-hot encoding for categorical columns even specifying LabelEncoder as the transformer

@thaddywu this is unexpected! Do you have any code or examples to suggest that the LabelEncoder is being ignored and that one hot encoding is being used instead? I tried this with the demo data and it appears that the synthesizer is correctly preprocessing the data using label encoding. We can file a separate issue to look into this.

npatki · 2023-06-05T20:39:38Z

Hi everyone, I think this original discussion has been split into several different issues that are currently being tracked -- so I'm closing this off as a duplicate. Feel free to reply to any of the below issue based on your feedback.

See CTGAN #290 for multi GPU support.

See #1450, for issues when applying the LabelEncoder to CTGAN (this includes a workaround that you can use in the meantime)

See #1451, as an umbrella issue for performance improvements to CTGAN Synthesizer.

saswat0 added new Automatic label applied to new issues question General question about the software labels Apr 19, 2023

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Apr 20, 2023

npatki mentioned this issue Apr 20, 2023

CopulaGAN : Out of memory error #1382

Closed

thaddywu mentioned this issue May 17, 2023

ArrayMemoryError when using CTGAN (assuming numerical columns are discrete) #1433

Closed

npatki closed this as completed Jun 5, 2023

npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory while fit #1381

Out of memory while fit #1381

saswat0 commented Apr 19, 2023

npatki commented Apr 20, 2023

saswat0 commented Apr 20, 2023

saswat0 commented Apr 20, 2023

npatki commented Apr 25, 2023

saswat0 commented May 4, 2023 •

edited

Loading

thaddywu commented May 12, 2023 •

edited

Loading

npatki commented May 15, 2023

npatki commented Jun 5, 2023

Out of memory while fit #1381

Out of memory while fit #1381

Comments

saswat0 commented Apr 19, 2023

Environment details

Problem description

What I already tried

npatki commented Apr 20, 2023

Resources

saswat0 commented Apr 20, 2023

saswat0 commented Apr 20, 2023

npatki commented Apr 25, 2023

saswat0 commented May 4, 2023 • edited Loading

thaddywu commented May 12, 2023 • edited Loading

npatki commented May 15, 2023

npatki commented Jun 5, 2023

saswat0 commented May 4, 2023 •

edited

Loading

thaddywu commented May 12, 2023 •

edited

Loading