-
Notifications
You must be signed in to change notification settings - Fork 0
SynDiffix: Overview
This article introduces SynDiffix, a new approach to generating statistically-accurate and strongly anonymous synthetic data from structured data. Compared to existing open-source and proprietary commercial approaches, SynDiffix is
- many times more accurate,
- has equal or better ML efficacy,
- runs many times faster, and
- has equal or stronger anonymization.
In addition to existing use cases for synthetic data, these improvements open the door to descriptive analytics use cases, where existing synthetic data products perform poorly. These improvements can also lead to better ease-of-use for test data, better time-series data, among other possibilities.
The underlying anonymization mechanisms for SynDiffix are based on Diffix, which was co-developed by the former Aircloak GmbH and the Max Planck Institute for Software Systems (MPI-SWS) and used in commercial settings. SynDiffix is developed by MPI-SWS and Open Diffix.
Data accuracy and ML efficacy are the two most common metrics for evaluating synthetic data. The following table shows how the median scores for SynDiffix compare to several prominent open source and proprietary synthetic data products over a variety of datasets.
CTGAN | mostly.ai | gretel.ai | tonic.ai | |
---|---|---|---|---|
Single-column data accuracy* | 14x better | 5x better | 19x better | 7x better |
Column-pair data accuracy* | 250x better | 19x better | 25x better | 15x better |
ML Efficacy** | 2x better | 4% worse | 5% better | 29% better |
Execution time | 300x faster | 16x faster | 111x faster | N/A |
Privacy properties | more robust | equivalent | more robust | more robust |
* For descriptive analytics use cases (in this case, synthetic data with two columns)
** For ML use cases (in this case, full synthesis of real datasets)
A common approach to anonymization for the other data products is Generative Adversarial Networks (GAN). Although there are differences in how and to what extent each of these products use GAN, we will refer to them collectively as GAN synthesis.
This much improvement may seem surprising given all of the enthusiasm surrounding GAN synthesis in the last few years. The improvement is possible because SynDiffix takes a fundamentally different approach to anonymization. The two key differences are:
- The basic building block of SynDiffix is strongly anonymizing for all data types, utilizing aggregation, generalization, suppression, and noise.
- SynDiffix uses a stopping point that allows it to more accurately model the underlying data while maintaining strong anonymity.
- SynDiffix uses a sticky noise mechanism that safely accommodates multiple views of the data, allowing syntheses that are tailored to the downstream use case.
The basic building block of SynDiffix is anonymizing. It generalizes data values, aggregates data into bins, suppresses bins with too few individuals, and adds noise to bins, and then samples from the bins.
The basic building blocks of GAN synthesis are sampling from individual columns, and randomly combining column values (using an iterative process to model the statistical properties of the data). When sampling from continuous data, the values themselves are modified because curves are fitted to the data, and sampling if from the curves. When sampling from text or categorical columns, however, the individual values themselves are retained. If these values contain PII (personally identifying information like names or credit card numbers), then anonymity if broken.
Either GAN synthesis products must themselves implement additional pre/postprocessing mechanisms to protect against these retained values, or users of GAN synthesis must understand this failure mode and preprocess or postprocess data to prevent it. Existing GAN synthesis products that implement additional mechanisms do a good job, but nevertheless fail in rare corner cases.
All anonymization methods try to be as accurate as possible without revealing private information. A key aspect of any anonymization method is how it determines when to stop improving accuracy.
Machine learning models that replicate the training data too closely perform poorly because they don't generalize well to test data. GANs therefore have mechanisms for when to stop improving accuracy, relative to the training data, in order to avoid overfitting. Avoiding overfitting happens to also be good for privacy, because the original data is not exactly replicated and therefore re-identification is less likely.
GAN synthesis exploits this by using one mechanism, overfitting avoidance, to achieve both good data generalization and data anonymity. In other words, the privacy of GAN synthesis is a side effect of the normal overfitting avoidance of GANs.
If the downstream application is descriptive analytics, for instance histograms or basic statistics like average and standard deviation, then the stopping point for avoiding overfitting is more conservative than necessary and data accuracy suffers.
This is illustrated in the following figure. Preprocessing or postprocessing is needed to overcome the above-mentioned weaknesses in the GAN synthesis. If the downstream application is ML modeling, then overfitting avoidance is applied again. If the downstream application is not ML modeling, for instance descriptive analytics, then the overfitting avoidance of GAN synthesis is too conservative and unnecessarily degrades accuracy.
Rather than piggy-backing on a mechanism not designed for privacy per se, SynDiffix automatically applies the classic anonymization mechanisms of generalization, aggregation, suppression, and noise. These mechanisms are applied as needed according to the data itself to maximize data quality while protecting anonymity. For instance, there is more generalization where data is sparse (e.g. the ages of very old people). If the downstream application is ML modeling, then overfitting avoidance can be applied then.
GAN synthesis take a one-size-fits-all approach. The same synthetic dataset is generated regardless of the downstream use case.
This is illustrated in the figure above. With GAN synthesis, the intended use is that a single table is generated, and this is then used for multiple different purposes, for instance making a histogram of a single column A, or building a predictive ML model for a given target column F.
While SynDiffix can also operate in a one-size-fits-all mode, it is designed to operate in a way that tailors each data synthesis to the intended use case.
If the user wants a histogram of column A, then SynDiffix can maximize accuracy for that use case by synthesizing data for only column A. If the user wants a predictive ML model targeting column F, then SynDiffix can build a table that more accurately captures the relationships between column F and other columns. This synthetic data would sacrifice the accuracy of for instance column A in order to maximize the predictive power of the ML model.
In principle GAN synthesis could also generate multiple tailored syntheses by removing unnecessary columns before synthesizing. The problem with this is that multiple GAN syntheses of the same data erodes privacy. GAN synthesis uses randomness. Each random synthesized dataset reveals something different about the true data, and enough different views could allow re-identification of some of the true data.
For instance, if a user wanted to measure how column A correlates with the other columns, they could separately synthesize columns A and B, columns A and C, A and D, and so on. This would lead to many different views of column A and the subsequent loss of privacy.
To take advantage of tailored syntheses, the developers of GAN synthesis would either have to overcome this problem, or show that multiple syntheses doesn't lead to dangerous levels of re-identification.
SynDiffix doesn't have this problem. A key innovation of SynDiffix is sticky noise. The same noise value is applied to data each time it is released. Multiple syntheses of the same data doesn't erode privacy because the underlying noise is the same each time. SynDiffix is designed to remain anonymous even in use cases where a query interface to SynDiffix is open to the public.
- The Github repository for SynDiffix is here.
- This article gives the details of the performance measures for the five methods; SynDiffx, CTGAN, gretel.ai, mostly.ai, and tonic.ai.
- This article presents measures of anonymity for the 5 methods, and describes corner cases where some of the methods fail to protect privacy.
- This article describes how SynDiffix operates in more detail.
The open source release of SynDiffix is meant primarily as a proof-of-concept implementation for testing purposes. It lacks features like robust handling of different datetime formats and ease-of-use.
Despite the good performance of this release, we regard SynDiffix as a work in progress. We are still rapidly innovating. For instance, the prior release of 2 months ago had half the ML efficacy of mostly.ai. Now we are at parity and expect substantial improvements in the coming months.
Besides quality improvements, SynDiffix' ability to take multiple views of the data without eroding privacy makes it well-suited for adding additional features safely.
An example of this is automatically capturing the structure of a text column while ensuring anonymity. One good reason for doing this is to make synthetic data more realistic as system/software testing data (without having to manually specify the structure, as is typical today). In principle, SynDiffix could run anonymized queries over text properties, like substrings, string lengths, and frequency of characters. In this way, SynDiffix could automatically discover structure like the part of the credit card that identifies the carrier, and represent that structure statistically (i.e. more Visa numbers than American Express).
Among the features that could be safely added to SynDiffix are:
- More datatypes such as geographic and the full spectrum of datetime (currently we support numeric, text, and some datetime).
- Hierarchical categories.
- Auto-discovery of constraints between columns (e.g. start time is always less than end time).
- Better "look and feel" of data (same numeric precision as in original data, good distribution of text lengths, character frequency, and so on).
- Auto-discovery of structure in text.
- Accurate reconstruction of time-series data, including time between events and constraints between events (follow-up visit always comes after initial visit).
- Accurate statistics across the protected entity population (e.g. what fraction of individuals are responsible for 90% of violent crime).
In summary, SynDiffix is already competitive with competitive with existing products where the downstream use case is ML modeling. SynDiffix is already far superior where the downstream use case is descriptive analytics, and likely opens new business cases for descriptive analytics.