-
Notifications
You must be signed in to change notification settings - Fork 0
SynDiffix Details
SynDiffix is built on the anonymization mechanisms of Diffix. To understand SynDiffix anonymization, it is useful to first understand Diffix anonymization. The Elm version of Diffix is documented here, including a thorough analysis of its anonymity. A variety of open source tools running Diffix Elm are released by the Open Diffix project.
Diffix is a query-based anonymization mechanism: users can submit any number of queries to Diffix, and the answers are anonymous even when the user attempts to combine multiple queries to try to break anonymity. Diffix uses a limited subset of SQL for its query language, and Elm is substantially more limited than prior versions of Diffix.
The target use case for Diffix is the release of aggregated data (also known as tabular data). If a user tries to release data pertaining to too few individuals, Diffix suppresses the output. The only way to get useful data from Diffix is to request aggregates that are large enough to avoid suppression. For instance, asking for exact salaries may lead to excessive suppression, whereas asking for salaries in aggregates of $10,000 avoids suppression except for the highest salaries.
To use Diffix effectively, a user needs to explore different-sized aggregates to find a good balance between precision and distortion (suppression and noise). This process works fine, but it can be tedious and time consuming.
SynDiffix relieves the user of this task by automating the exploration of aggregates, and then presenting the results as synthetic data (also known as microdata) instead of aggregates. Nevertheless, under the hood, SynDiffix is working with Diffix aggregates.
Here we briefly overview the set of anonymization mechanisms used by Diffix. We use the term protected entity to refer to the thing whose anonymity is being protected by Diffix. This is normally an individual person, but could be a mobile phone, a car, a household, or even a company.
A key characteristic of Diffix is that it recognizes which rows are associated with which protected entities (mostly.ai does this too). This association is a configuration parameter. With the exception of datasets with one row per protected entity (e.g. census or other survey data) Diffix requires that the column that identifies the protected entity be configured. Indeed, Diffix can protect multiple different protected entities, for instance sender and receiver of a financial transaction, doctor and patient, or user and device.
Aggregation: Diffix only releases data as aggregates, for instance counts, sums, or averages. (SynDiffix uses only row count aggergates from Diffix, so in what follows we refer to aggregate values as simply counts.)
Suppression: Diffix monitors the number of distinct protected entities in each aggregate, and suppresses those with too few.
Noise: Diffix distorts counts. For instance, a count of 28 might be modified to 30.
Proportional noise: The amount of noise added to counts is based on how much any given protected entity contributes to the count. The amount of noise is enough to hide the presence of any given user (after flattening).
Sticky noise: Any given aggregate, which may appear in the outputs of multiple queries, always has the same noise. This prevents the noise from being averaged away when the aggregate appears in multiple different outputs.
Sticky layered noise: Indeed there are two sticky noise values added to a count. One is defined by the columns and the value range (e.g. salaries between $10K and $20K). The other is defined by the distinct set of protected entities that contribute to the aggregate.
Snapping: All aggregate ranges are forced into predetermined sizes and offsets along exponentially growing values (... 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, ...). This prevents attacks that infer individual data values by incrementally sliding range values.
Outlier flattening: The row count contributed by a small number of extreme contributors is reduced before proportional noise is computed. This is done per aggregate, not just once globally. Flattening avoids situations where the amount of noise itself would otherwise reveal the presence of extreme contributors. The amount of reduction is enough to make the contribution similar to less extreme heavy contributors. This is similar to top and bottom coding.
Noisy thresholds: All thresholds (for instance the suppression threshold) are themselves noisy (and sticky) rather than hard-coded.
This article contains a simple and intuitive description of the most important of these mechanisms.
These mechanisms are deployed with a goal towards minimizing the amount of distortion while still providing strong anonymity protection. This paper describes in detail how we determine how much distortion is enough. The anonymity criteria used in this analysis is the same as those defined by EU Article 29 Data Protection Working Party Opinion on Anonymization Techniques, singling-out, linkability, and inference.
The average suppression threshold can be as low as four protected entities while still providing strong anonymity.
The (normally distributed) noise for aggregates with one protected entity per row can have a standard deviation as low as two while still providing strong anonymity.
In other words, Diffix can drill quite deeply into the data while still providing strong anonymity.
Diffix itself is quite mature, having gone through multiple iterations over six years, been used in demanding commercial settings, been evaluated by Data Protection Officers (DPO) and one Data Protection Authority (DPA) as being GDPR compliant, and having gone through rigorous internal privacy analyses as well as two bounty programs.
The goals of SynDiffix are to exploit the anonymization properties of Diffix as much as possible while
- being much easier to use than Diffix for descriptive analytics, and
- being useful for ML use cases.
A core design concept of SynDiffix is that, once the data is anonymized with Diffix mechanisms, then any operation over the anonymized data is safe.
At the core of SynDiffix are two main operations:
- Automate the process of finding good aggregates so that the user doesn't have to do it. The discovered aggregates strike a good balance between precision and distortion, and are more precise where data is dense. This is done both for individual columns and combinations of columns. Diffix anonymization mechanisms are applied to these aggregates (suppression is enforced, sticky noise is added, and so on).
- Generate microdata from the aggregates by assigning values within the range defined by the aggregates.
The first operation doesn't scale well with the number of columns. When using SynDiffix for descriptive analytics, this is not an issue because normally descriptive analytics is limited to a relatively small number of columns. For ML use cases or use cases where for whatever reason the user wants to create synthetic data with a lot of columns, scaling is an issue.
To deal with this, the find-aggregates-generate-microdata process described above is embedded in a higher-level process that partitions a high-dimension (many columns) table into multiple lower-dimension sub-tables. It finds aggregates and generates microdata for each of the sub-tables, and then stitches the microdata tables together to create the complete synthetic data table.
For individual columns (one dimension), good aggregates are discovered by building a binary tree. At each branch, the range of data is cut in half. Note that text values are converted to integers, and datetime values are converted to seconds since some past date. (These are converted back when microdata is assigned.)
The range of the root node in the tree is set to the smallest snapped power-of-two range that encompasses the range of data for the column after outliers have been flattened. This prevents inference of upper and lower values in the data by reverse engineering the nodes of the tree.
The tree is truncated when either the count (number of rows) in a node falls below a threshold, or a node would be suppressed according to Diffix anonymity rules. Note that these binary trees are not balanced: they are deeper and more precise where data is denser. This allows us to maximize precision while adhering to Diffix anonymization rules. This is illustrated in the following figure.
Because the offset and range of the root node is snapped to a power of two, and all subsequent nodes in the tree are halves of the parent node, any given data point appears in only a relatively small number of distinct offset/range bins. The noise associated with each such bin is sticky (always the same), and each such bin is always populated with at least a few other protected entities. As a result, no matter how many syntheses a given data point participates in, its real exposure is limited to a small number of bins.
For example, consider a single cell in an age
column with value age=21
. Because of snapping, the cell can appear in at most 8 bins. For any given synthesis operation, the bin selected depends on how dense the data is. For instance, if combined with marital_status=’divorced’
, the age=21
cell may appear in a larger bin than when combined with marital_status=’single’
.
This use of snapping, sticky noise, and truncation is analogous to the query budget in Differential Privacy. Both mechanisms serve to limit the exposure of data.
To discover good aggregates for a pair of columns (two dimensions), a 4-branch tree is built. At each branch, the range of each column is cut in half. The same truncation rules are applied to the 2-dimension tree. As a result, the tree is not as deep as the corresponding 1-dimensional trees, and is therefore less precise.
In order to assign values proportional to the 2-dimensional aggregates, but with better precision than the 2-dimensional aggregate allows by itself, we use information from both the 2-dimensional tree and the corresponding 1-dimensional trees. This is illustrated in the example below.
This figure shows portions of a 2-dimensional tree for columns C1 and C2, and the corresponding 1-dimensional trees for C1 and C2. We know from the 2-dimensional node on the left that we want to assign 42 rows (noisy count) to C1 range 0-4, and C2 range 24-32. In the absence of information from the 1-dimensional trees, one could simply assign values uniformly across the two ranges. However, from the C1 tree, we can see that values are assigned to range 0-2 with higher probability (76%) than to range 2-4 (24%). Likewise we learn from the C2 tree that assignments to C2 range 24-28 are more likely than to 28-32. We can therefore use the more precise 1-dimensional information to refine the 2-dimensional node into four 2-dimensional nodes with better precision.
This same concept applies at higher dimensions. 2-dimension trees can refine 3-dimension trees, and so on.
Because each individual tree is anonymous, the process of refinement does not weaken anonymity. Note also that the refinement process introduces additional error. It may not be, for instance, that the 77%/23% ratio for C2 ranges 24-28/28-32 hold perfectly at the two 2-dimensional nodes shown.
This added error leads to poorer accuracy with each additional dimension. Furthermore, high-dimension trees, with refinement from each lower-dimension tree, has an exponential scaling problem.
Because of the reduced accuracy and increased computational cost of high-dimension trees, we need to partition a high-dimension table into multiple smaller-dimension sub-tables, assign microdata, and then stitch the resulting sub-tables back together.
One limitation with this partitioning is that, when two columns are in separate sub-tables, then we lose information about how they relate to each other. If the two columns are in any event completely independent, then there is no information to lose. The higher the dependency, however, the more information lost.
We are still experimenting with how to best determine sub-tables. Our current approach differs on whether we are building one-size-fits-all synthetic data, or synthetic data for an ML use case that targets a specific column.
For the former case, we measure pair-wise dependence by first building every 1-dimension and 2-dimension tree. We then run a chi-square independence test on the nodes in the trees. In assigning sub-tables, we take several, sometimes conflicting, criteria into account.
- Sub-tables should not have have too many columns
- High-dependence columns pairs should be in the same sub-table
- Unless there is almost no dependence between columns in two different sub-tables, there should be some column overlap (two sub-tables should have one or more shared columns). This preserves some of the correlation between sub-tables.
For the latter case, we are using Recursive feature elimination with cross-validation to select the K most important features for the target column. We then group the features (columns) into sub-tables of appropriate size, with each sub-table containing the target column. The target column is used to stitch the sub-tables. Any feature that is not important is either removed or randomly stitched into the table.
After the refinement process, an N-dimensional tree will have a set of nodes, each with a range or single value associated with each of the node's dimensions, and with an associated count.
For each node, we create a number of rows equal to the count. The value of each column in the row is selected randomly if a range, or assigned as the single value otherwise.
After microdata assignment, there will be one or more sub-tables each with one or more columns.
If two sub-tables share one or more columns, then the sub-tables are stitched by pairing a row in one sub-table with a row in the other sub-table where the values of the overlapping columns are most similar. If the shared-column values don't exactly match, then values may chosen from either sub-table. The paired rows are concatenated.
If a sub-table has no common columns with other sub-tables, then its rows are randomly paired with rows from another sub-table and concatenated.