Skip to content

SynDiffix Performance

Paul Francis edited this page Sep 18, 2023 · 15 revisions

Performance of SynDiffix

This section compares the performance of SynDiffix with one open source implementation of GAN synthesis, CTGAN, and three proprietary commercial products, gretel.ai, mostly.ai, and Synthpop.

Synthesis methods

We use the open source SDV (Synthetic Data Vault) implementation of CTGAN from Datacebo. Note that we also tested other synthesizers from SDV, including Gaussian Copula, Copula GAN, and TVAE. We found CTGAN to be consistently the best of these (with TVAE a close second), and so show only results for CTGAN.

The two proprietary methods were chosen primarily because they are prominent venders in the commercial market. Each service was run with the default settings. Synthpop was selected because it has seen significant academic usage, and because it represents a different technology base.

CTGAN, SynDiffix, and Synthpop were executed on the same hardware. The commercial systems were run on their servers.

Synthpop was run with method = 'cart' (the default), smoothing = 'spline', cart.minbucket = 5, and maxfaclevels = 500000. The smoothing prevented Synthpop from directly placing sampled values into the synthetic data (which would break anonymity for columns with unique values). maxfaclevels was set to allow Synthpop to run for columns with many distinct values. Column order matters with Synthpop. We ordered the columns from least to highest cardinality.

Datasets

We compare the three techniques over two separate sets of data:

  • 24 artificial 2-column datasets, 12 with 7K rows and 12 with 28K rows. They have a variety of continuous and categorical columns with varying marginal distributions and levels of inter-column dependence. Each group of 12 has the same set of distributions: the only difference is the larger number of rows. These are used for marginal and pairs quality measures. (download here)
  • 13 based on real data taken from a variety of sources. These are used for ML efficacy measures:
Dataset Rows Columns Source
adult 32561 15 SDV
age-weight-sex 3205 3 Kaggle Male and Female height and weight
alarm 20000 37 SDV
BankChurners 10127 22 Kaggle Bank Churners
census 299285 41 SDV
census_extended 65122 19 SDV
child 20000 20 SDV
credit 284807 30 SDV
expedia_hotel_logs 1000 25 SDV
fake_hotel_guests 1000 9 SDV
insurance 20000 27 SDV
intrusion 494021 41 SDV
KRK_v1 1000 9 SDV

While most of the datasets are available from SDV, some of the SDV datasets were not used because they are one-hot encoded. SynDiffix performs better with tables that are not one-hot encoded, and so including these would not give a fair representation of SynDiffix' performance. Other than this, there was no attempt to select datasets that favor SynDiffix. Note that we had also selected two other datasets, the New York City taxi dataset and a dataset of traffic violations from the city of Moers in Germany, but none of the ML scores on these original datasets were of high enough quality to include in our measures.

Each of the above datasets can be downloaded here mpi-sws.org/~francis/sampleDatasets/<dataset_name>.tar.gz.

Measures

We use the SDMetrics library for marginal and pairs quality measures and ML efficacy measures. SDMetrics is also from Datacebo (the same group that produces the CTGAN synthesizer).

For its marginal column quality measures, SDMetrics uses the Kolmogorov-Smirnov statistic for continuous data, and the Total Variation Distance for categorical data.

For its column pairs quality measures, SDMetrics uses a Correlation Similarity score for continuous data, and a Contingency Similarity for categorical data.

The range of scores for both marginals and column pairs quality ranges from 0.0 (worst quality) to 1.0 (best quality). Note that a measure of 1.0 does not mean that the synthetic data is a perfect replica of the original data, and in particular does not necessarily imply weak anonymity.

To measure ML efficacy, we split the initial dataset into randomly selected original and test datasets at a ratio of 70/30. We generate a synthetic dataset from the original dataset.

SDMetrics measures ML efficacy by training an ML model over the synthetic data, and then measuring how well the model performs over the test dataset. SDMetrics supports four models for binary classification, two models for multiclass classification (categorical data), and two models for regressions (continuous data). They are described here.

The ML efficacy scores for binary and categorical data is the F1 test score, which combines precision and recall, and ranges from 0.0 (worst) to 1.0 (perfect). The ML efficacy score for continuous data is the Coefficient of determination. A score of 1.0 for means 100% accurate, but a bad score can be arbitrarily low (less than 0.0).

Synthetic data samples

For all tests, the number of synthetic data samples produced is set to the number of rows in the original data.

The current implementation of SynDiffix does not allow the user to choose how many data samples are produced: it always tries to model the original data completely. This limitation, however, is not fundamental to SynDiffix. It could be designed to produce fewer or more samples.

High-accuracy synthesis (2-column tables)

We start by answering the question "how accurate is synthetic data when looking at a small number of columns?" This applies in cases where an analyst is interested in common statistics over specific columns of interest, as opposed to building ML prediction models over many columns. SynDiffix' anonymization method safely allows for multiple views of the same data, thus enabling this kind of statistical analytics.

A fundamental characteristic of any data anonymization is that increasing the number of columns decreases data accuracy. Measuring marginals or column pair accuracy from many columns produces poorer quality. It is for this reason that we limit ourselves to 2-column tables here.

The following figure gives the marginals and pairs data quality for all five methods. This is a standard Seaborn boxplot with 0, 25, 50, 75, and 100 percentile ticks plus outliers. The marginals and pairs measures were taken using the SDMetrics library for quality reports. The measures are for the overall marginals and pairs quality scores from SDMetrics.

image

This table gives the median marginals and pairs overall quality scores.

Method Marginal Data Quality Pairs Data Quality
CTGAN 0.958 0.950
gretel.ai 0.942 0.988
mostly.ai 0.983 0.992
SynDiffix 0.996 0.9994
Synthpop 0.991 0.998

We can compute the improvement in accuracy of one method over another as how much closer it is to a perfect score. So for instance 0.99 is 2x more accurate than 0.98, and 0.995 is 4x more accurate than 0.98. This measure is computed as (1 - lower_score) / (1 - higher_score). The following table gives the quality improvement of SynDiffix over the other methods.

Method Marginals Improvement Pairs Improvement
CTGAN 12x 79x
gretel.ai 16x 19x
mostly.ai 5x 13x
Synthpop 2.6x 3x

In all of these measures, we see that SynDiffix is far more accurate than the other methods.

To get a better sense of what these measures mean, the following figure plots the actual synthetic data for one of the 28k row, 2-column datasets (2dimAirportcluster.csv). The black points are the original data, and the blue-green points are the synthetic data (these plots are generated by SDMetrics).

overviewfig2

From this we see that a very small difference in the quality score represents a large difference in the synthetic data quality. How much this impacts the analytic task at hand depends on the task itself, but the difference could often be important.

TODO: These SDMetrics quality scores are overly compressed into the high end, which is perhaps good for marketing but not ideal for understanding data quality. May consider using other measures.

This article compares SynDiffix, CTGAN, and mostly.ai synthesized over one day of the New York City taxi data.

ML Efficacy

The value of a given synthetic dataset for ML modeling depends on a variety of factors. For instance, the amount of predictive precision need for the target application, the type of ML model used, and so on. Taking all of these factors into account is not feasible for this initial study. Rather, we simply try for a fair comparison between the different methods.

To test ML efficacy, we use a set of table / target column / ML model combinations (e.g. the BinaryAdaBoostClassifier targeting the land column of the intrusion.csv table). We derive this set by running every appropriate ML model over every column of every table, and then discarding those where the ML efficacy measure is less than 0.7. In this way, we only measure ML efficacy on those cases where there is a chance of good ML performance. This avoids negatively biasing the synthetic data efficacy measure with models that don't work well in any event.

For all of the measures, prior to running the SDMetrics measure itself, we ran recursive feature elimination with cross-validation to select the K most important features for the target column, and removed the remaining columns. This improved the median ML efficacy scores for all methods.

For SynDiffix, this feature selection improved the median from 0.903 to 0.920, while for mostly.ai the median improved from 0.920 to 0.924. In other words, the benefit to SynDiffix by selecting the K features is stronger than the benefit to mostly.ai, which is the only method that has better median ML efficacy scores than SynDiffix. We don't know exactly how mostly.ai works, but SynDiffix does not correlate the non-K features columns with the K features columns when it synthesizes. It makes sense that the benefit to SynDiffix of excluding the non-K features is stronger.

The following gives the ML efficacy scores (286 scores per boxplot).

overviewfig3

The noAnon synMethod gives the scores for ML measures on the original data. This shows roughly the best expected ML scores against which the synthetic data methods can be compared. Note that some of the measures on the original data have scores lower than 0.7. This is because some of the ML scores using only K features are worse than the corresponding score without K features (although on average the ML scores with only K features are better).

Note also that scores for Synthpop are not shown because there is too much missing data. For the scores we do have, Synthpop is very close to SynDiffix.

Note finally that there are a few very bad scores in all of the methods (including noAnon). We suspect that at least a few of these are due to measurement quirks, but we have not yet chased them down.

The following table gives the median ML efficacy scores, and the corresponding SynDiffix improvement (we use a negative value to indicate that SynDiffix is worse than the other method).

Method Median ML efficacy ML efficacy improvement
CTGAN 0.822 2.13
gretel.ai 0.909 1.07
mostly.ai 0.916 -1.04
noAnon 0.928 -1.26
SynDiffix 0.914 ---

The main take-away here is that mostly.ai, SynDiffix, and gretel.ai all perform comparably well, with mostly.ai slightly better and gretel.ai slightly worse than SynDiffix. An important caveat here is that these ML efficacy measures require multiple synthesis operations for SynDiffix (one per target column), while they require only one synthesis operation for the other methods.

Note that it is not the case that one method performs consistently better than another. The following is a scatterplot of the individual mostly.ai and SynDiffix ML efficacy scores. The red line denotes where the two scores are equivalent (it is not a fitted curve).

overviewfig4

This scatterplot shows that, while SynDiffix and mostly.ai have similar scores most of the time, sometimes one is substantially better than the other.

Execution time

Boxplots of the execution times are shown here, for 2-column and real datasets. Execution times are measured from after the original data is loaded to before the synthesized data is output. Note the log scale.

overviewfig5

The following table gives the median elapsed time improvement of SynDiffix over the other methods, computed as method_elapsed_time / syndiffix_elapsed_time, for both 2-column tables (representative of descriptive analytics use cases) and real tables (representative of ML use cases).

Method 2-column tables improvement Real tables improvement
CTGAN 324x 309x
gretel.ai 153x 111x
mostly.ai 16x 33x

Across datasets sets of different sizes and shapes, SynDiffix is consistently an order of magnitude faster than mostly.ai, and two orders of magnitude faster than CTGAN.

Finally, note that SynDiffix is highly parallelizable. Substantial improvements over these results are possible.

Missing measures

There are additional measures that would be useful.

Regarding data quality, it would be useful to directly measure various statistical properties like average, standard deviation, correlation, confidence intervals, and so on.

Regarding ML, it would be useful to measure the effectiveness of synthetic data for specific applications.

It would also be useful to measure memory usage.

Clone this wiki locally