Skip to content

SynDiffix: Overview

Paul Francis edited this page Jul 17, 2023 · 23 revisions

SynDiffix: Better synthetic data quality from smarter anonymization

This article introduces SynDiffix, a new approach to generating statistically-accurate and strongly anonymous synthetic data from structured data. Compared to existing open-source and proprietary commercial approaches, SynDiffix is

  • many times more accurate,
  • has equal or better ML efficacy,
  • runs many times faster, and
  • has stronger anonymization.

The underlying anonymization mechanisms for SynDiffix are based on Diffix, which was co-developed by the former Aircloak GmbH and the Max Planck Institute for Software Systems (MPI-SWS) and used in commercial settings. SynDiffix is developed by MPI-SWS and Open Diffix.

Data accuracy and ML efficacy are the two most common metrics for evaluating synthetic data. The following table shows how the median scores for SynDiffix compare to several prominent open source and proprietary synthetic data products over a variety of datasets.

How SynDiffix compares to: CTGAN mostly.ai gretel.ai tonic.ai
Single-column data accuracy (2-columns) 12x better 5x better zzzz zzzz
Column-pair data accuracy (2-columns) 80x better 12x better zzzz zzzz
ML Efficacy (many-columns) 70% better 60% worse zzzz zzzz
Execution time 120x faster 9x faster zzzz zzzz

A common approach to anonymization for the other data products is Generative Adversarial Networks (GAN). Although there are differences in how and to what extent each of these products use GAN, we will refer to them collectively as GAN synthesis.

This much improvement may seem surprising given all of the enthusiasm surrounding GAN synthesis in the last few years. The improvement is possible because SynDiffix takes a fundamentally different approach to anonymization. The two key differences are:

  1. SynDiffix uses a stopping point that allows it to more accurately model the underlying data while maintaining strong anonymity.
  2. SynDiffix uses a sticky noise mechanism that safely accommodates multiple views of the data, allowing syntheses that are tailored to the downstream use case.

Better stopping point

All anonymization methods try to be as accurate as possible without revealing private information. A key aspect of any anonymization method is how it determines when to stop improving accuracy; the stopping criteria.

Machine learning models that replicate the training data too closely perform poorly because they don't generalize well to test data. GANs therefore have mechanisms for when to stop improving accuracy, relative to the training data, in order to avoid overfitting. Avoiding overfitting happens to also be good for privacy, because the original data is not exactly replicated and therefore re-identification is less likely.

GAN synthesis exploits this by using one mechanism, overfitting avoidance, to achieve both good data generalization and data anonymity. In other words, the privacy of GAN synthesis is a side effect of the normal overfitting avoidance of GANs.

There are two problems with coupling privacy and overfitting avoidance to one mechanism.

First, if the downstream application is descriptive analytics, for instance histograms or basic statistics like average and standard deviation, then the stopping point for avoiding overfitting is more conservative than necessary and data accuracy suffers.

Second, for data with certain characteristics, overfitting avoidance mechanisms fail to adequately protect anonymity. To deal with this, GAN synthesis products add a variety of pre-processing and post-processing mechanisms. While we believe that these mechanisms, combined with overfitting avoidance, adequately protect privacy in practice, there are in fact rare corner cases where they fail.

This is illustrated in the following figure. Preprocessing or postprocessing is needed to overcome weaknesses in the GAN synthesis. If the downstream application is ML modeling, then overfitting avoidance is applied again. If the downstream application is not ML modeling, for instance descriptive analytics, then the overfitting avoidance of GAN synthesis is too conservative and unnecessarily degrades accuracy.

overview14

Rather than piggy-backing on a mechanism not designed for privacy per se, SynDiffix automatically applies the classic anonymization mechanisms of generalization, aggregation, suppression, and noise. These mechanisms are applied as needed according to the data itself to maximize data quality while protecting anonymity. For instance, there is more generalization where data is sparse (e.g. the ages of very old people). If the downstream application is ML modeling, then overfitting avoidance can be applied then.

overview15

Tailored syntheses

GAN synthesis take a one-size-fits-all approach. The same synthetic dataset is generated regardless of the downstream use case.

overview12

This is illustrated in the figure above. With GAN synthesis, the intended use is that a single table is generated, and this is then used for multiple different purposes, for instance making a histogram of a single column A, or building a predictive ML model for a given target column F.

While SynDiffix can also operate in a one-size-fits-all mode, it is designed to operate in a way that tailors each data synthesis to the intended use case.

overview13

If the user wants a histogram of column A, then SynDiffix can maximize accuracy for that use case by synthesizing data for only column A. If the user wants a predictive ML model targeting column F, then SynDiffix can build a table that more accurately captures the relationships between column F and other columns. This synthetic data would sacrifice the accuracy of for instance column A in order to maximize the predictive power of the ML model.

In principle GAN synthesis could also generate multiple tailored syntheses by removing unnecessary columns before synthesizing. The problem with this is that multiple GAN syntheses of the same data erodes privacy. GAN synthesis uses randomness. Each random synthesized dataset reveals something different about the true data, and enough different views could allow re-identification of some of the true data.

For instance, if a user wanted to measure how column A correlates with the other columns, they could separately synthesize columns A and B, columns A and C, A and D, and so on. This would lead to many different views of column A and the subsequent loss of privacy.

To take advantage of tailored syntheses, the developers of GAN synthesis would either have to overcome this problem, or show that multiple syntheses doesn't lead to dangerous levels of re-identification.

SynDiffix doesn't have this problem. A key innovation of SynDiffix is sticky noise. The same noise value is applied to data each time it is released. Multiple syntheses of the same data doesn't erode privacy because the underlying noise is the same each time. SynDiffix is designed to remain anonymous even in use cases where a query interface to SynDiffix is open to the public.

For more information

  • The Github repository for SynDiffix is here.
  • This article gives the details of the performance measures for the five methods; SynDiffx, CTGAN, gretel.ai, mostly.ai, and tonic.ai.
  • This article presents a heatmap comparing the accuracy of SynDiffix, CTGAN, and mostly.ai for New York City taxi data.
  • This article presents measures of anonymity for the 5 methods, and describes corner cases where some of the methods fail to protect privacy.
  • This article describes how SynDiffix operates in more detail.

Project status

The open source release of SynDiffix is meant primarily as a proof-of-concept implementation for testing purposes. It lacks features like robust handling of different datetime formats and ease-of-use.

Despite the good performance of this release, we regard SynDiffix as a work in progress. We are still rapidly innovating. For instance, the prior release of 2 months ago had half the ML efficacy of mostly.ai. Now we are at parity and expect substantial improvements in the coming months.

Besides quality improvements, SynDiffix' ability to take multiple views of the data without eroding privacy makes it well-suited for adding additional features safely.

An example of this is automatically capturing the structure of a text column while ensuring anonymity. One good reason for doing this is to make synthetic data more realistic as system/software testing data (without having to manually specify the structure, as is typical today). In principle, SynDiffix could run anonymized queries over text properties, like substrings, string lengths, and frequency of characters. In this way, SynDiffix could automatically discover structure like the part of the credit card that identifies the carrier, and represent that structure statistically (i.e. more Visa numbers than American Express).

Among the features that could be safely added to SynDiffix are:

  • More datatypes such as geographic and the full spectrum of datetime (currently we support numeric, text, and some datetime).
  • Hierarchical categories.
  • Auto-discovery of constraints between columns (e.g. start time is always less than end time).
  • Better "look and feel" of data (same numeric precision as in original data, good distribution of text lengths, character frequency, and so on).
  • Auto-discovery of structure in text.
  • Accurate reconstruction of time-series data, including time between events and constraints between events (follow-up visit always comes after initial visit).
  • Accurate statistics across the protected entity population (e.g. what fraction of individuals are responsible for 90% of violent crime).

In summary, SynDiffix is already competitive with competitive with existing products where the downstream use case is ML modeling. SynDiffix is already far superior where the downstream use case is descriptive analytics, and likely opens new business cases for descriptive analytics.

Beyond this, SynDiffix can serve as the basis for a wide range of anonymous data use cases.

Old

compared to existing approaches to synthetic data, the anonymization principles behind SynDiffix are both fundamentally more sound and better integrated into the data synthesis architecture.

SynDiffix' better anonymization principles lead to two important advantages:

  1. For any given synthesis operation, SynDiffix is able to better push the boundaries of accuracy while remaining strongly anonymous. It is able to make fine-grained adjustments to precision and noise in different parts of the data to maximize accuracy.
  2. It safely allows multiple different synthesis operations over the same data. SynDiffix remains anonymous even with multiple synthesis operations over different combinations of columns or different, overlapping ranges of the data, or for that matter repeated instances of the same synthesis operation.

We cannot overemphasize this second point. The ability to safely get multiple different views of the data, for instance focusing on individual columns or pairs of columns on one hand, or taking a view of many columns together on the other, is key to good data analysis. SynDiffix' dramatic improvement in data quality combined with the ability to get multiple views enables an entire class of new use cases for synthetic data; those that require basic statistical functions like count, sum, average, median, standard deviation, correlation, and so on. SynDiffix' improvement in execution time opens the door to interactive data exploration applications.

SynDiffix' smarter anonymization lays the foundation for substantial improvements over the version 1. Our current approach to producing synthetic data for ML applications is frankly simplistic. We have a long list of improvements in mind, and believe there is a good chance that we can go well beyond the ML efficacy of Generative Adversarial Network (GAN) learning approaches as well as continue to improve data quality and execution time.

In the following sections of Part 1, we look at the performance of SynDiffix, present the Anonymeter privacy risk scores, describe the key anonymization concepts and briefly discuss the inference attack on CTGAN and mostly.ai and why it doesn't work against SynDiffix, and end with a discussion of what kinds of improvements we can expect.

Note that the purpose of the measurements presented here are for making a rough comparison of SynDiffix with other state-of-the-art synthetic data. Any of the methods may or may not work well for a given use case.

Part 2 explores SynDiffix and Diffix anonymization mechanisms in more detail.

Clone this wiki locally