DataTransformer init parameters #146

FlorentRamb · 2021-04-16T15:32:36Z

This PR solve issue #7, it allows two things:

ability to fit gaussian mixtures on a subsample (help to scale with big datasets while losing only little accuracy)
ability to pass init arguments to the DataTransformer through CTGANSynthesizer.fit (and so to change other parameters as max_clusters).

CLAassistant · 2021-04-16T15:32:41Z

All committers have signed the CLA.

…ing GMs

fealho

In general I think this looks good. @pvk-developer @amontanez24 what do you think?

ctgan/data_transformer.py

fealho · 2021-04-21T16:42:56Z

tests/integration/test_ctgan.py

@@ -184,3 +184,14 @@ def test_wrong_sampling_conditions():

    with pytest.raises(ValueError):
        ctgan.sample(1, 'discrete', "d")
+
+
+def test_ctgan_data_transformer_params():


I think you should also add a performance test, something simple just to make sure that our results are not worse than before because of this change.

I'm not sure about this one, do you think about a performance test of the gaussian mixture model or CTGAN ? In terms of speed or accuracy ?

Accuracy for CTGAN. Basically, just a test to make sure the changes don't break the code. So something like changing your continuous column to be a normal distribution, instead of random, then sample from the model (after you fit) and make sure the samples loosely follow a normal distribution.

pvk-developer · 2021-04-22T11:00:21Z

ctgan/synthesizers/ctgan.py

@@ -267,7 +267,8 @@ def _validate_discrete_columns(self, train_data, discrete_columns):
        if invalid_columns:
            raise ValueError('Invalid columns found: {}'.format(invalid_columns))

-    def fit(self, train_data, discrete_columns=tuple(), epochs=None):
+    def fit(self, train_data, discrete_columns=tuple(), epochs=None,
+            data_transformer_params={}):


The data_transformer_params should be moved to the __init__ and be asigned as self.data_transformer_params. (Use deepcopy if needed).

pvk-developer · 2021-04-22T13:36:14Z

ctgan/data_transformer.py


    def _fit_continuous(self, column_name, raw_column_data):
        """Train Bayesian GMM for continuous column."""
+        if self._max_gm_samples <= raw_column_data.shape[0]:
+            raw_column_data = np.random.choice(raw_column_data,


I think that when it comes to this kind of line breaking this indentation is better:

raw_column_data = np.random.choice( raw_column_data, size=self._max_gm_samples, replace=False )

candalfigomoro · 2023-02-09T16:30:13Z

@fealho @pvk-developer
Can we merge this? It's basically impossible to fit the CTGAN on a large dataset because the gaussian mixture is a huge bottleneck (even using dozens of CPUs). This PR would allow to speedup the gaussian mixture step. Thanks

fealho · 2023-02-09T18:20:16Z

@npatki not sure what you want to do with this?

candalfigomoro · 2023-02-13T13:14:45Z

Meanwhile the library code has changed so the PR should be updated.

For example, the _fit_continuous method now receives a pandas DataFrame, so np.random.choice() can be replaced by something like data = data.sample(self._max_gm_samples, replace=False, random_state=SEED).

Also, I wonder if ClusterBasedNormalizer could not be optionally replaced by a power transform, which might be faster (although it might impact the quality of the generated data), see sdv-dev/RDT#613

FlorentRamb changed the title ~~Gh 7 feat gmparams~~ DataTransformer init parameters Apr 19, 2021

FlorentRamb added 3 commits April 19, 2021 10:42

add max_gm_samples param and subsample continuous columns before fitt…

4b0d505

…ing GMs

add data_transformer_params to have control over data_transformer

4560e78

add test to check max_gm_samples

96a6321

FlorentRamb force-pushed the gh-7-feat-gmparams branch from a4c4d5b to 96a6321 Compare April 19, 2021 08:43

fealho reviewed Apr 21, 2021

View reviewed changes

pvk-developer self-requested a review April 21, 2021 16:53

fix docs

8669599

pvk-developer requested changes Apr 22, 2021

View reviewed changes

FlorentRamb added 2 commits April 26, 2021 09:38

fix indentation

2a3222f

move data_transformers args to init

1b40159

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataTransformer init parameters #146

DataTransformer init parameters #146

FlorentRamb commented Apr 16, 2021

CLAassistant commented Apr 16, 2021 •

edited

Loading

fealho left a comment

fealho Apr 21, 2021

FlorentRamb Apr 22, 2021

fealho Apr 22, 2021

pvk-developer Apr 22, 2021

pvk-developer Apr 22, 2021

candalfigomoro commented Feb 9, 2023

fealho commented Feb 9, 2023 •

edited

Loading

candalfigomoro commented Feb 13, 2023

DataTransformer init parameters #146

Are you sure you want to change the base?

DataTransformer init parameters #146

Conversation

FlorentRamb commented Apr 16, 2021

CLAassistant commented Apr 16, 2021 • edited Loading

fealho left a comment

Choose a reason for hiding this comment

fealho Apr 21, 2021

Choose a reason for hiding this comment

FlorentRamb Apr 22, 2021

Choose a reason for hiding this comment

fealho Apr 22, 2021

Choose a reason for hiding this comment

pvk-developer Apr 22, 2021

Choose a reason for hiding this comment

pvk-developer Apr 22, 2021

Choose a reason for hiding this comment

candalfigomoro commented Feb 9, 2023

fealho commented Feb 9, 2023 • edited Loading

candalfigomoro commented Feb 13, 2023

CLAassistant commented Apr 16, 2021 •

edited

Loading

fealho commented Feb 9, 2023 •

edited

Loading