Skip to content

SynDiffix Privacy

Paul Francis edited this page Jul 30, 2023 · 7 revisions

SynDiffix anonymity

As described here, SynDiffix builds on the anonymizing mechanisms of Diffix. All unique values or combinations of unique values are hidden through aggregation and suppression, outliers are suppressed, noise is added to prevent users with deep knowledge about parts of the data from making inferences, and these properties persist even with multiple synthesis operations.

Diffix itself is quite mature, having gone through multiple iterations over six years, been used in demanding commercial settings, been evaluated by Data Protection Officers (DPO) and one Data Protection Authority (DPA) as being GDPR compliant, and having gone through rigorous internal privacy analyses as well as two bounty programs.

There have been ongoing thorough analyses of Diffix vulnerabilities, including two bounty programs, and at this time no effective attacks are known. A complete description and privacy analysis of the Elm version of Diffix is documented in this ArXiv paper. The anonymity criteria used in this analysis is the same as those defined by EU Article 29 Data Protection Working Party Opinion on Anonymization Techniques, singling-out, linkability, and inference.

GAN synthesis anonymity

Whereas the core anonymization mechanisms of SynDiffix are aggregation, suppression, and noise etc., the core anonymization mechanism of GAN synthesis is the GAN learning process. The key elements are sampling of individual column values, and random assignment of combinations of column values with iterative learning and overfitting avoidance.

Note in particular that individual sampled values are released in the synthetic data. There is a danger of privacy loss when the sampled values are unique to individuals. Doing so violates the letter of the EU Article 29 singling-out criteria. If the sampled values are PII (e.g. email addresses or credit card numbers) then privacy is violated in practice.

If a column contains continuous values, then the values are protected because GAN synthesis fits curves to the value distributions, and samples from the curves not the actual values. If on the other hand a column contains categorical or text values, then the actual values are sampled and released in the output.

To deal with this, GAN synthesis products require additional mechanisms to remove or mask PII. Datacebo (CTGAN), gretel.ai, and tonic.ai provide tools to transform or mask PII or other sensitive data columns. So long as these tools are used correctly, these products are strongly anonymous.

Mostly.ai more deeply integrates these protections into its product. Like SynDiffix, it has the notion of a protected entity (see this article). It is able to detect when values unique to individual protected identities are being released, and masks those values. This makes mostly.ai less susceptible to accidental PII release. Also like SynDiffix, it does require that the column that contains the identifiers for protected entities is properly configured. If it is not, then PII is not protected.

Another GAN synthesis weaknesses (rare)

PII leakage is not the only limitation of GAN synthesis that requires additional mechanisms to protect against. In extremely rare cases, it is theoretically possible to make inferences about individuals in event datasets like time-series data. Because these cases are rare, we don't believe that they constitute a meaningful risk. Nevertheless, we mention them here to emphasize the fact that SynDiffix anonymization is fundamentally more sound than GAN synthesis.

Because of sampling, if a unique value appears in the synthetic data, one cannot tell if the value is unique in the original data because some samples may have been missed. Likewise because of random assignment of value combinations, one cannot tell if a unique combination of values appears in the original data or not.

A problem arises with event datasets for protected entities with an extreme number of events (i.e. rows in the dataset), because these protected entities will be over-sampled relative to other protected entities, and will have excessive influence on iterative learning therefore bypassing overfitting avoidance.

Put another way, GAN synthesis alone cannot tell if a repeated value (e.g. 1000 instances of "title=CEO") or combinations of values (e.g. 1000 instances of "zip=14939" and "occupation=florist") belongs to 1000 different individuals or a single individual.

Consider a case where an attacker knows the statistical distribution of a certain attribute (e.g. education level). Without additional protections, an event outlier (a protected entity with an extreme number of rows) would artificially raise the count for its education level. The education level of the event outlier could then be inferred as being the attribute value that most exceeds the expected value.

Mostly.ai protects against this with a feature they call Extreme Sequence Length Protection. This feature identifies event outliers as a preprocessing step and removes them from the dataset.

The other GAN synthesis products in this article do not identify protected entities, and therefore cannot identify event outliers. We demonstrated this attack on CTGAN using simulated datasets, and assume that it would work on the other GAN synthesis products other than mostly.ai.

Because mostly.ai checks for this globally, it can miss cases where a protected entity does not have an extreme number of rows relative to the full dataset, but does have an extreme number of rows relative to a portion of the dataset. Imagine, for instance, a medical dataset covering multiple hospitals, where a protected entity has an extreme number of rows relative to a given hospital, but not relative to protected entities in other hospitals. One might be able to infer an attribute of that user by looking at synthetic data for that hospital. We could reliably run this attack against mostly.ai using simulated datasets.

SynDiffix has an analogous protection mechanism, flattening. Flattening, however, is applied at every bin generated by SynDiffix. As a result, a correctly configured SynDiffix is not vulnerable to any form of this attack.

We wish to stress that this does not mean that GAN synthesis is anonymous practically speaking. The probability that a user might run this attack for malicious purposes is exceedingly small. Such a thing would require:

  1. That the dataset has the condition (rare).
  2. That a user would know that the conditions exists and the identity of the associated data subject (rare, requires knowledge of both the data subjects number of rows and the fact that all other data subjects do not have a large number of rows).
  3. That in spite of having this knowledge the user would nevertheless not know the attribute that can be discovered because of the condition.
  4. That the user would have an interest in learning the attribute.
  5. That the user would know that the attribute can be discovered with this method.

That all these conditions would exist even once seems remote, much less exist often. The point we want to make here is not that GAN-based approaches have weak anonymity from a practical perspective, but that integrating strong anonymization mechanisms throughout the data-modeling process is a sounder approach to synthetic data.

Anonymeter Privacy Risk Scores

When used correctly, SynDiffix and all of the GAN synthesis products have very strong anonymity.

In support of this statement, we use the Anonymeter tool to measure privacy. Anonymeter works by running attacks against the synthesized data and reporting how effective they are relative to a statistical baseline. We ran singling-out and inference attacks, which are two of the three GDPR criteria for anonymity (the third being linkability, which doesn't apply well to this setting).

The Anonymeter privacy risk score is a measure of the attack's precision over that of a statistical baseline. For example, suppose that the attack is trying to infer sex, and the dataset has 50% males and females. If the attack has 70% precision, then this is an improvement of 0.4 over the baseline of 50%.

Anonymeter does not test for every possible attack, but it covers the common set of attacks whereby the attacker has information about a target individual in the dataset as well as potentially other knowledge about the data, and tries to infer information about the target or determine if the target is in the dataset.

When analyzing Diffix privacy, we used a similar precision improvement measure, and argued that any privacy risk below around 0.5 demonstrates acceptable anonymity: at 0.5, the target of the attach has substantial plausible deniability. Privacy risk below 0.25 or so is very good.

The following are boxplots of privacy risks for SynDiffix, the four GAN synthesis products, and with no anonymization for contrast. The boxplots represent a mix of inference and singling-out attacks, with over 100 different attacks for each method. For inference attacks, we assume that the attacker knows everything about the target except the column being inferred. Anonymeter computes confidence bounds, and we excluded those with confidence bounds larger than 0.2 because they are not reliable measures. (These usually occurred in inference attacks on columns where a very large majority of rows have a single value.)

overview11

As expected, all methods demonstrate very low privacy risk. SynDiffix has no attacks with a privacy risk above 0.5, and only three with a privacy risk above 0.2. CTGAN, gretel.ai, and tonic.ai have only one each above 0.5, and mostly.ai has no attacks above 0.2.

Clone this wiki locally