-
Notifications
You must be signed in to change notification settings - Fork 0
SynDiffix: Overview
This two-part article introduces SynDiffix V1, a new approach to creating statistically-accurate and strongly anonymous synthetic data from structured data. SynDiffix is based on Diffix anonymization which was co-developed by the Max Planck Institute for Software Systems (MPI-SWS) and the former Aircloak GmbH, and used in commercial settings. SynDiffix is developed by MPI-SWS and Open Diffix.
Compared to CTGAN and mostly.ai, the median performance of this first version of SynDiffix over a variety of datasets:
- is many times more accurate for single-column and column pair data accuracy (for few-column tables),
- has better ML efficacy than CTGAN, but worse than mostly.ai (for many-column tables),
- and executes one and two orders of magnitude faster than CTGAN and mostly.ai respectively (for all tables).
The following table shows how the median scores for SynDiffix compare to CTGAN and mostly.ai for the datasets we tested:
How SynDiffix compares to: | CTGAN | mostly.ai |
---|---|---|
Single-column data accuracy (2-columns) | 12x better | 5x better |
Column-pair data accuracy (2-columns) | 80x better | 12x better |
ML Efficacy (many-columns) | 70% better | 60% worse |
Execution time | 120x faster | 9x faster |
These improvements are possible because, compared to existing approaches to synthetic data, the anonymization principles behind SynDiffix are both fundamentally more sound and better integrated into the data synthesis architecture.
All of the methods tested have very strong anonymity (GDPR strength certainly). We demonstrate this using the Anonymeter privacy risk measuring tool (github). Nevertheless, there is a rare but certainly possible inference attack that works against CTGAN and mostly.ai but not against SynDiffix. During its six years of development, there has been ongoing thorough analyses of Diffix vulnerabilities, including two bounty programs, and at this time no effective attacks are known. Other than the attack mentioned here, we are also unaware of any effective attacks against CTGAN and mostly.ai.
SynDiffix' better anonymization principles lead to two important advantages:
- For any given synthesis operation, SynDiffix is able to better push the boundaries of accuracy while remaining strongly anonymous. It is able to make fine-grained adjustments to precision and noise in different parts of the data to maximize accuracy.
- It safely allows multiple different synthesis operations over the same data. SynDiffix remains anonymous even with multiple synthesis operations over different combinations of columns or different, overlapping ranges of the data, or for that matter repeated instances of the same synthesis operation.
We cannot overemphasize this second point. The ability to safely get multiple different views of the data, for instance focusing on individual columns or pairs of columns on one hand, or taking a view of many columns together on the other, is key to good data analysis. SynDiffix' dramatic improvement in data quality combined with the ability to get multiple views enables an entire class of new use cases for synthetic data; those that require basic statistical functions like count, sum, average, median, standard deviation, correlation, and so on. SynDiffix' improvement in execution time opens the door to interactive data exploration applications.
Indeed, SynDiffix' anonymization is strong enough that opening a SynDiffix query interface to the public is a realistic possibility.
SynDiffix' smarter anonymization lays the foundation for substantial improvements over the version 1. Our current approach to producing synthetic data for ML applications is frankly simplistic. We have a long list of improvements in mind, and believe there is a good chance that we can go well beyond the ML efficacy of Generative Adversarial Network (GAN) learning approaches as well as continue to improve data quality and execution time.
In the following sections of Part 1, we look at the performance of SynDiffix V1, present the Anonymeter privacy risk scores, describe the key anonymization concepts and briefly discuss the inference attack on CTGAN and mostly.ai and why it doesn't work against SynDiffix, and end with a discussion of what kinds of improvements we can expect.
Note that the purpose of the measurements presented here are for making a rough comparison of SynDiffix with other state-of-the-art synthetic data. Any of the methods may or may not work well for a given use case.
Part 2 explores SynDiffix and Diffix anonymization mechanisms in more detail.
We start by looking at the performance of SynDiffix V1, and compare it with two current popular GAN-based alternatives, CTGAN and mostly.ai.
We use the open source SDV (Synthetic Data Vault) implementation of CTGAN from Datacebo. Note that we also tested other synthesizers from SDV, including Gaussian Copula, Copula GAN, and TVAE. We found CTGAN to be consistently the best of these (with TVAE a close second), and so show only results for CTGAN.
We use the commercial product mostly.ai partly from convenience: they offer a free service. However, mostly.ai performs substantially and consistently better than CTGAN. Mostly.ai has received a total of $33M in funding and has roughly 40 employees.
CTGAN and mostly.ai were run with the default settings (300 epochs for CTGAN, 200 epochs and the "Medium" model size for mostly.ai). Syndiffix was also run on its default settings.
CTGAN and SynDiffix were executed on the same hardware. mostly.ai was executed on mostly.ai servers.
Note that we also did initial comparisons against the synthpop R package for synthetic data using the non-parametric CART method. While the utility of synthpop is extremely good, it has not really been designed for strong anonymity by GDPR criteria. We found cases of high-confidence singling-out and inference. Since we are interested only in methods with strong anonymization, we chose not to include synthpop in our analysis.
We compare the three techniques over two separate sets of data:
- 24 artificial 2-column datasets, 12 with 7K rows and 12 with 28K rows. They have a variety of continuous and categorical columns with varying marginal distributions and levels of inter-column dependence. Each group of 12 has the same set of distributions: the only difference is the larger number of rows. These are used for marginal and pairs quality measures. (Download here)
- 14 based on real data taken from a variety of sources. These are used for ML efficacy measures:
Dataset | Rows | Columns | Source |
---|---|---|---|
adult | 32561 | 15 | SDV |
age-weight-sex | 3205 | 3 | Kaggle Male and Female height and weight |
alarm | 20000 | 37 | SDV |
BankChurners | 10127 | 22 | Kaggle Bank Churners |
census | 299285 | 41 | SDV |
census_extended | 65122 | 19 | SDV |
child | 20000 | 20 | SDV |
credit | 284807 | 30 | SDV |
expedia_hotel_logs | 1000 | 25 | SDV |
fake_hotel_guests | 1000 | 9 | SDV |
insurance | 20000 | 27 | SDV |
intrusion | 494021 | 41 | SDV |
KRK_v1 | 1000 | 9 | SDV |
taxi-one-day | 440257 | 22 | NYC Taxi City Data |
Each of the above datasets can be downloaded from mpi-sws.org/~francis/sampleDatasets/<dataset_name>.tar.gz.
We use the SDMetrics library for marginal and pairs quality measures and ML efficacy measures. SDMetrics is also from Datacebo (the same group that produces the CTGAN synthesizer).
For its marginal column quality measures, SDMetrics uses the Kolmogorov-Smirnov statistic for continuous data, and the Total Variation Distance for categorical data.
For its column pairs quality measures, SDMetrics uses a Correlation Similarity score for continuous data, and a Contingency Similarity for categorical data.
The range of scores for both marginals and column pairs quality ranges from 0.0 (worst quality) to 1.0 (best quality). Note that a measure of 1.0 does not mean that the synthetic data is a perfect replica of the original data, and in particular does not necessarily imply weak anonymity.
To measure ML efficacy, we split the initial dataset into randomly selected original
and test
datasets at a ratio of 70/30. We generate a synthetic dataset from the original dataset.
SDMetrics measures ML efficacy by training an ML model over the synthetic data, and then measuring how well the model performs over the test dataset. SDMetrics supports four models for binary classification, two models for multiclass classification (categorical data), and two models for regressions (continuous data). They are described here.
The ML efficacy scores for binary and categorical data is the F1 test score, which combines precision and recall, and ranges from 0.0 (worst) to 1.0 (perfect). The ML efficacy score for continuous data is the Coefficient of determination. A score of 1.0 for means 100% accurate, but a bad score can be arbitrarily low (less than 0.0).
For all tests, the number of synthetic data samples produced is set to the number of rows in the original data.
SynDiffix V1 itself does not allow the user to choose how many data samples are produced: it always tries to model the original data completely (though the final number of rows in the synthetic data is not exactly the same as the original number of rows due to added noise). This limitation, however, is not fundamental to SynDiffix. It could be designed to produce fewer or more samples.
We start by answering the question "how accurate is synthetic data when looking at a small number of columns?" This applies in cases where an analyst is interested in common statistics over specific columns of interest, as opposed to building ML prediction models over many columns. SynDiffix' anonymization method safely allows for multiple views of the same data, thus enabling this kind of statistical analytics.
A fundamental characteristic of any data anonymization is that increasing the number of columns decreases data accuracy. Measuring marginals or column pair accuracy from a table with many columns would therefore produce an overly pessimistic measure. It is for this reason that we limit ourselves to 2-column tables here.
The following figure gives an overview of the marginals and pairs data quality for all three methods. This is a standard Seaborn boxplot with 0, 25, 50, 75, and 100 percentile ticks plus outliers. The marginals and pairs measures were taken using the SDMetrics library for quality reports. The measures are for the overall marginals and pairs quality scores from SDMetrics.
This table presents the median marginals and pairs overall quality scores (for the 7k and 28k datasets combined).
Method | Marginal Data Quality | Pairs Data Quality |
---|---|---|
CTGAN | 0.958 | 0.950 |
mostly.ai | 0.983 | 0.992 |
SynDiffix | 0.996 | 0.999 |
This table gives the improvement in quality of SynDiffix over CTGAN and mostly.ai measured as (1 - method1_score)/(1 - method2_score)
, where method1_score
is the better of the two scores being compared.
Method | Marginals Improvement | Pairs Improvement |
---|---|---|
CTGAN | 12.0 | 79.2 |
mostly.ai | 4.8 | 12.6 |
In all of these measures, we see that SynDiffix is far more accurate than mostly.ai, which is in turn far more accurate than CTGAN.
To get a better sense of what these measures mean, the following figure plots the actual synthetic data for one of the 28k row, 2-column datasets (2dimAirportcluster.csv
). The black points are the original data, and the blue-green points are the synthetic data (these plots are generated by SDMetrics).
From this we see that a very small difference in the quality score represents a large difference in the synthetic data quality. How much this impacts the analytic task at hand depends on the task itself, but the difference could often be important.
TODO: These SDMetrics quality scores are overly compressed into the high end, which is perhaps good for marketing but not ideal for understanding data quality.
To test ML efficacy, we start by finding all of the cases where ML models over the original data perform well, having an SDMetrics score of 0.8 or better. The synthetic data ML efficacy measures are based on these high-quality cases only. This avoids negatively biasing the synthetic data efficacy measure with models that don't work well in any event.
The following plots all of the individual efficacy measures (one model on one target column) as boxplots. The plot on the right further separates the data into models for binary, categorical, and continuous columns.
The following table gives the median ML efficacy score and the improvement of SynDiffix compared to the other methods for the real datasets.
Method | Median ML efficacy | SynDiffix improvement |
---|---|---|
CTGAN | 0.800 | 1.71 |
mostly.ai | 0.928 | -1.62 |
SynDiffix (focus) | 0.883 | --- |
Note that the SynDiffix method in the plots above is labeled syndiffix_focus
. This refers to a "focused" mode of SynDiffix whereby the user can specify which column will be the target of the ML model. This improves the ML efficacy for that model relative to the "general" mode of SynDiffix, which does not favor any particular column.
Mostly.ai efficacy score is 60% more accurate than SynDiffix', which is in turn 70% more accurate than CTGAN.
Boxplots of the execution times are shown here, for 2-column and real datasets. Execution times are measured from after the original data is loaded to before the synthesized data is output. Note the log scale.
Across datasets sets of different sizes and shapes, SynDiffix is consistently almost an order of magnitude faster than mostly.ai, and two orders of magnitude faster than CTGAN.
(It is on our TODO list to add memory usage measures.)
We use the Anonymeter tool to measure privacy. Anonymeter works by running attacks against the synthesized data and reporting how effective they are relative to a statistical baseline. We ran singling-out and inference attacks, which are two of the three GDPR criteria for anonymity (the third being linkability, which doesn't apply well to this setting).
The Anonymeter privacy risk score is a measure of the attack's precision over that of a statistical baseline. For example, suppose that the attack is trying to infer sex, and the dataset has 50% males and females. If the attack has 70% precision, then this is an improvement of 0.4 over the baseline of 50%.
When analyzing Diffix privacy, we used a similar precision improvement measure, and argued that any privacy risk below around 0.5 demonstrates acceptable anonymity: at 0.5, the target of the attach has substantial plausible deniability. Privacy risk below 0.25 or so is very good.
The following are boxplots of privacy risks for all three methods. These represent a mix of inference and singling-out attacks, with roughly 200 different attacks for each method. For inference attacks, we assume that the attacker knows everything about the target except the secret column. Anonymeter also computes confidence bounds, and we excluded those with confidence bounds higher than 0.2. (These usually occurred in inference attacks on columns where a large majority of rows have a single value.)
All three methods demonstrate very low privacy risk. SynDiffix has no attacks with a privacy risk above 0.5, and only three (of 195 attacks) with a privacy risk above 0.2. CTGAN has only one above 0.5, and mostly.ai has no attacks above 0.2.
SynDiffix is built on the anonymization mechanisms of Diffix. Diffix itself is quite mature, having gone through multiple iterations over six years, been used in demanding commercial settings, been evaluated by Data Protection Officers (DPO) and one Data Protection Authority (DPA) as being GDPR compliant, and having gone through rigorous internal privacy analyses as well as two bounty programs.
The key mechanisms of Diffix anonymization are aggregation, suppression, and noise.
Rather than release individual data points, Diffix releases aggregates like count or sum. Aggregation is the most common anonymization mechanism. The results of an election, for instance, are conveyed as aggregates (1029 votes for candidate A, 1262 votes for candidate B).
To help ensure that aggregates don't reveal information about individuals, Diffix suppresses aggregates that pertain to too few individuals. A candidate with one or two votes (or zero votes) would simply not be mentioned. Finally, Diffix adds noise to ensure that information about individuals cannot be inferred through intersection attacks. Revealing that 1262 people voted for B, and that 1261 men voted for B allows one to infer that one woman voted for B. Instead Diffix might say that 1264 people voted for B, and that 1259 men voted for B.
A simple and intuitive description of these basic mechanisms can be found in this Open Diffix article.
Diffix has other mechanisms that protect against far more sophisticated attacks, some of them involving 100s of queries. They are discussed briefly in Part 2 of this article. A complete description and privacy analysis of the Elm version of Diffix is documented in this ArXiv paper. The anonymity criteria used in this analysis is the same as those defined by EU Article 29 Data Protection Working Party Opinion on Anonymization Techniques, singling-out, linkability, and inference.
A major drawback of Diffix (running alone, without SynDiffix) is that it forces the user to find good aggregates. If a user tries to release data pertaining to too few individuals, Diffix suppresses the output. The only way to get useful data from Diffix is to request aggregates that are large enough to avoid suppression. For instance, asking for exact salaries may lead to excessive suppression, whereas asking for salaries in aggregates of $10,000 avoids suppression except for the highest salaries.
To use Diffix effectively, a user needs to explore different-sized aggregates to find a good balance between precision and distortion (suppression and noise). This process works fine, but it can be tedious and time consuming.
SynDiffix relieves the user of this task by automating the exploration of aggregates, and then presenting the results as synthetic data (also known as microdata) instead of aggregates. Nevertheless, under the hood, SynDiffix is working with Diffix aggregates.
This is illustrated in the following figure. SynDiffix fits a set of aggregate bins to a column of continuous data values. The bins have more precision (smaller ranges) where the data is denser. Bins are large enough to avoid suppression. Noise is added to counts, and microdata values are assigned to values in the range of the aggregate bin.
This process is repeated for combinations of columns. See Part 2 of this article for more details, including how to overcome scaling problems due to too many column combinations.
CTGAN and mostly.ai use a GAN approach to data synthesis. The key idea is to run a series of approximations of the original data, with each approximation getting closer to replicating the real data. The process terminates, however, before the original data is exactly replicated (i.e. before over-fitting).
The following figure illustrates this process (images taken from this paper).
This GAN approach accomplishes two important common anonymization properties:
- It modifies or hides rare values.
- It effectively adds noise in the sense that counts of categorical values, or counts of rows in a histogram of numeric values, are no longer exact.
These are both powerful properties, and can be found in virtually every strong anonymization system (including SynDiffix, which hides rare values through aggregation, and adds noise explicitly). For almost any practical use case, GAN-based anonymization by itself is sufficiently strong so long as over-fitting is avoided.
Nevertheless, GAN-based anonymization in and of itself is not inherently anonymous. For this reason, mostly.ai for instance applies additional anonymizing mechanisms to the dataset as a pre-processing step.
An example of this is what mostly.ai calls Extreme Sequence Length Protection. This protects against revealing information about data subjects that have an extreme number of rows in an event dataset (i.e. time-series). This could allow an attacker to infer an attribute of the data subject by observing an unusual number of rows associated with that attribute.
CTGAN doesn't protect against this, and indeed in simple tests on artificial data, we could reliably run such an inference attack.
Because mostly.ai checks for this globally, they can miss cases where a data subject does not have an extreme number of rows relative to the full dataset, but does have an extreme number of rows relative to a portion of the dataset. Imagine, for instance, a medical dataset covering multiple hospitals, where a data subject has an extreme number of rows relative to a given hospital, but not relative to data subjects in other hospitals. One might be able to infer an attribute of that user by looking at synthetic data for that hospital. We could reliably run this attack against mostly.ai using artificial data.
SynDiffix has an analogous protection mechanism, flattening. Flattening, however, is applied at every bin generated by SynDiffix. As a result, a correctly configured SynDiffix is not vulnerable to any form of this attack.
We wish to stress that this does not mean that CTGAN or mostly.ai are not anonymous practically speaking. The probability that a user might run this attack for malicious purposes is exceedingly small. Such a thing would require:
- That the dataset has the condition (rare).
- That a user would know that the conditions exists and the identity of the associated data subject (rare, requires knowledge of both the data subjects number of rows and the fact that all other data subjects do not have a large number of rows).
- That in spite of having this knowledge the user would nevertheless not know the attribute that can be discovered because of the condition.
- That the user would have an interest in learning the attribute.
- That the user would know that the attribute can be discovered with this method.
That all these conditions would exist even once seems remote, much less exist often. The point we want to make here is not that GAN-based approaches have weak anonymity from a practical perspective, but that integrating strong anonymization mechanisms throughout the data-modeling process is an even sounder approach to synthetic data.
Development on SynDiffix started in late 2022. In 10 months, we've gone from initial concept to a working implementation that far outperforms best-in-class commercial and open source designs in data accuracy, and is getting close in ML modeling efficacy.
We've gone through a process of designing mechanisms, testing and finding issues, and re-designing. This process is by no means over. The current design is really just a snapshot in the middle of that process. The design is pretty good, but frankly some ideas are quite naive and we believe there is significant room for improvement. For each of the steps of the process (tree building, refining, dependence measuring, sub-table selection, microdata assignment, and stitching), we have ideas on how to improve the mechanism.
SynDiffix' approach to anonymization makes it well-suited for adding additional features safely. A core design principle for adding new features to Diffix is that, so long as the identity of protected entities is preserved in the new feature, then the basic Diffix anonymization mechanisms (suppression, proportional sticky noise, etc.) will protect the entities.
An example of this is automatically capturing the structure of a text column while ensuring anonymity. One good reason for doing this is to make synthetic data more realistic as system/software testing data (without having to manually specify the structure, as is typical today). In principle, SynDiffix could run anonymized queries over text properties, like substrings, string lengths, and frequency of characters. In this way, SynDiffix could automatically discover structure like the part of the credit card that identifies the carrier, and represent that structure statistically (i.e. more Visa numbers than American Express). Because the underlying structure and statistics of the text is determined in an anonymized fashion, post-processing can be freely done without regard for anonymization.
Among the features that could be safely added to SynDiffix are:
- More datatypes such as geographic and the full spectrum of datetime (currently we support numeric, text, and some datetime).
- Hierarchical categories.
- Auto-discovery of constraints between columns (e.g. start time is always less than end time).
- Better "look and feel" of data (same numeric precision as in original data, good distribution of text lengths, character frequency, and so on).
- Auto-discovery of structure in text.
- Accurate reconstruction of time-series data, including time between events and constraints between events (follow-up visit always comes after initial visit).
- Accurate statistics across the protected entity population (e.g. what fraction of individuals are responsible for 90% of violent crime).
In short, SynDiffix is a better basis for synthetic data, and has the potential to increase use cases for synthetic data as well as make synthetic data easier to create.