This repository contains the implementation, documentation and results of the research project "Slice it up: Unmasking User Identities in Smartwatch Health Data". Building on the findings of the project "Privacy at Risk: Exploiting Similarities in Health Data for Identity Inference", a novel and modular attack framework is presented, which includes four dynamic time warping based re-identification attacks.
- Abstract
- Installation
- Attack Framework
- Data Sets
- Signal Preprocessing
- Complexity Reduction
- Data Model
- DTW Attacks
- Evaluation Pipeline
- Result Overview
- In-Out Threshold Attacks
- Mitigating DTW-Attacks
- Multiple Distances
- Ethical Principles
- License
Wearables are widely used for health data collection due to their availability and advanced sensors, enabling smart health applications like stress detection. However, the sensitivity of personal health data raises significant privacy concerns. While user de-identification by removing direct identifiers such as names and addresses is commonly employed to protect privacy, the data itself can still be exploited to re-identify individuals. We introduce a novel framework for similarity-based Dynamic Time Warping (DTW) re-identification attacks on time series health data. Using the WESAD dataset and two larger synthetic datasets, we demonstrate that even short segments of sensor data can achieve perfect re-identification with our Slicing-DTW-Attack. Our attack is independent of training data and computes similarity rankings in about 2 minutes for 10,000 subjects on a single CPU core. These findings highlight that de-identification alone is insufficient to protect privacy. As a defense, we show that adding random noise to the signals significantly reduces re-identification risk while only moderately affecting usability in stress detection tasks, offering a promising approach to balancing privacy and utility.
Python 3.9 was used to perform the experiments. To install the required dependencies, run the following command:
pip install -r requirements.txt
The following figure shows an overview of the eight-stage attack framework. The functionality and the parameters to be specified for the individual stages are briefly described in the following sections. All further information is explained in detail in corresponding paper.
This project is based on a total of three data sets. The multimodal WESAD data set contains smartwatch health time series data from a total of 15 subjects. The sensors blood volume pulse (BVP), elecrodermal activity (EDA), skin temperature (TEMP) and acceleration (ACC) are covered. At the same time, two synthetic data sets generated by Generative Adversarial Networks (GAN), each with 1000 subjects, are integrated into the project to test the scalability of DTW attacks.
- WESAD data set: Please download the WESAD data set and save it unzipped under experiments/dataset/WESAD to carry out the attacks. Further pre-processing of the data record is performed automatically during the first execution. This can take a few minutes.
- Conditional GAN (cGAN): The cGAN data set was created with the implementation of the linked project. Using the WESAD data set as a basis, 1000 synthetic subjects were created. The created cGAN data set can be found zipped under /dataset/WESAD_cGAN (lfs).
- DoppelGANger (DGAN): The DGAN data set was created with the implementation of the linked project. Using the WESAD data set as a basis, 1000 synthetic subjects were created. The created DGAN data set can be found zipped under /dataset/WESAD_dGAN (lfs).
dataset = Wesad(dataset_size=15)
dataset = WesadCGan(dataset_size=1000, resample_factor=1000)
dataset = WesadDGan(dataset_size=1000, resample_factor=1000)
dataset = WesadPrivate(dataset_size=15, noise_multiplier=1.0)
Parameters:
- dataset_size: Specify the number of subjects in the data set. The Wesad and WesadPrivate data set can contain a maximum of 15 subjects. The WesadCGan and WeasdDGan dataset can contain a maximum of 1000 subjects.
- noise_multiplier: A noise_multiplier must also be specified for the WesadPrivate data set. This corresponds to the scale parameter of the Laplace distribution and specifies the level of random noise to be added to the data points of the signals.
The three-stage signal preprocessing pipeline processes the selected data set during the first execution and saves it as a pickle file. No parameters need to be specified for the execution. Note that for further data sets to be included, the pipeline in the corresponding data set class must be adapted to the conditions of the new data set.
To reduce the runtime of DTW attacks, three complexity-reducing methods are integrated into the framework:
With the downsampling method, the number of data points in the signal is reduced using a resampling method based on a fast fourier transform. A downsampling factor (DSF) is specified for this purpose, which reflects the degree of reduction.
resample_factor = 1 # no downsampling
resample_factor = 10 # 10 data points in the original signal are represented by one data point in the downsampled signal
resample_factor = 100 # 100 data points in the original signal are represented by one data point in the downsampled signal
With the DBA method, the number of sensor signals is reduced by forming an average of all the signals contained.
data_processing = DbaProcessing() # application of DBA
data_processing = StandardProcessing() # standard sensor ranking without application of DBA or PCA
With the PCA method, the first principal component for a subject's sensor signals is calculated. Only the first principal component is used as a reduced signal.
data_processing = PcaProcessing() # application of PCA
data_processing = StandardProcessing() # standard sensor ranking without application of DBA or PCA
The data model specifies the purpose for which the DTW attack is to be executed.
Parameters:
- additional_windows: The attack set is cut out of the corresponding data set for the experiments. Specify how many additional windows (data points) are to be cut from the data set at the edges of the attack set in order to avoid an alignment at the edges in the experiments.
The external mode is used in this project to carry out the runtime experiments. This uses an attack set (sample of signal data of the person to be attacked) and the data set to calculate DTW distances between the attack set and the persons in the data set. After ranking, it is determined which person in the data set most likely corresponds to the attacked target.
dataset = Wesad(dataset_size=15)
resample_factor = 1000
data_processing = StandardProcessing()
additional_windows = 1000 # standard value; will be automatically scaled with resample_factor -> additional_windows / resample_factor
simulate_isolated_dtw_attack(dataset=dataset, resample_factor=resample_factor, data_processing=data_processing, additional_windows=additional_windows) # is performed for all four attacks.
The simulation mode is used for all further experiments. Here, each subject in the data set is selected once as a target and a re-identification attack is carried out. The results are then averaged across all subjects.
dtw_attack = SlicingDTWAttack()
dataset = Wesad(dataset_size=15)
data_processing = StandardProcessing()
test_window_sizes = [i for i in range(1, 37)] # specify for which attack window sizes the DTW attack should be performed
resample_factor = 1000
additional_windows = 1000
multi = 3 # only needed for MultiDTWAttack() and MultiSlicingDTWAttack()
subject_ids = None # use all subjects; else e.g. [2, 3, 4, 5] -> use only subjects in list as target
run_dtw_attack(dtw_attack=dtw_attack, dataset=dataset, data_processing=data_processing,
test_window_sizes=test_window_sizes, resample_factor=resample_factor, multi=3, subject_ids=subject_ids)
The project comprises four similarity-based re-identification attacks that use dynamic time warping as the basis for calculating distances between the signals of two people.
The Single-DTW-Attack is the simplest DTW attack. This calculates distance values for all signals of the target person and all corresponding signals of the persons in the data set. In general, the person in the data set with the smallest distance to the target corresponds to the target.
dtw_attack = SingleDTWAttack()
The Multi-DTW-Attack builds on this. The only difference here is that the signals of the attack set are first divided into multi attack sets of the same length. The attacks are then carried out for each multi attack set.
multi = 3 # Specify number of multi attack sets
dtw_attack = MultiDTWAttack()
With the Slicing-DTW-Attack, the signals of the data set are divided into overlapping slices instead of the attack set. The attacks are then carried out by calculating the distances between the attack set and each slice.
dtw_attack = SlicingDTWAttack()
The Multi-Slicing-DTW-Attack corresponds to a combination of the Multi-DTW-Attack and the Slicing-DTW-Attack.
multi = 3 # Specify number of multi attack sets
dtw_attack = MultiSlicingDTWAttack()
For the Multi-DTW-Attack, the Slicing-DTW-Attack and the Multi-Slicing-DTW-Attack, a method must be selected which reduces the multiple distances resulting from the multi- or slicing effect to one distance per person and sensor. This step is necessary for a uniform evaluation.
- Minimum Method Selection of the smallest distance for each person and sensor over all multiple distances. (Standard for Slicing-DTW-Attack)
- Average Methode Calculation of the average distance for each person and sensor over all multiple distances. (Standard for Multi-DTW-Attack)
result_selection_method = "min"
result_selection_method = "mean"
Two methods are required for the Multi-Slicing-DTW-Attack. One is to reduce the multi-effect and the other is to reduce the slicing effect.
result_selection_method = "min-min" # Standard for Multi-Slicing-DTW-Attack
result_selection_method = "min-mean"
result_selection_method = "mean-min"
result_selection_method = "mean-mean"
The DTW attacks are evaluated using a four-stage rank-based evaluation pipeline. For this purpose, the distances are converted into ranks in ascending order, with the smallest distance receiving rank 1. If several subjects receive the same rank, a realistic rank selection method is used to determine the rank. The quality criterion precision@k is calculated to determine the performance of a DTW attack.
dtw_attack = SlicingDTWAttack()
dataset = Wesad(dataset_size=15)
data_processing = StandardProcessing()
resample_factor = 1000
result_selection_method = "min"
run_optimization_evaluation(dtw_attack=dtw_attack, dataset=dataset, data_processing=data_processing,
resample_factor=resample_factor, result_selection_method=result_selection_method)
The following aspects are evaluated in the four stages, building on each other:
The ranking method rank or score can be used to calculate an average of a person's distances across several sensors / sensor combinations.
- Ranking method rank: Averages the ranks of the individual sensors and then re-ranks them.
- Ranking method score: Averages the distances of the individual sensors and then performs ranking. (default setting for simplified evaluation)
Evaluation of the classes (stress and non-stress data) separately and as a weighted average.
The sensors are evaluated by sensor combinations (standard sensor ranking) and by individual weights per sensor (weighted sensor ranking).
The attack window sizes are used to calculate for which of the specified attack window sizes the best re-identification performance can be achieved, i.e. the amount of data an attacker should ideally have available.
All results are saved in experiments/out. A memory structure is created for this in order to be able to evaluate all possible parameter combinations provided by the attack framework separately.
A complete attack according to simulation mode can be executed as follows:
dtw_attack = MultiDTWAttack()
dataset = Wesad(dataset_size=15)
data_processing = StandardProcessing()
resample_factor = 1000
result_selection_method = "mean"
test_window_sizes = [i for i in range(1, 13)]
additional_windows = 1000 # standard value; will be automatically scaled with resample_factor -> additional_windows / resample_factor
multi = 3
run_dtw_attack(dtw_attack=dtw_attack, dataset=dataset, data_processing=data_processing,
test_window_sizes=test_window_sizes, resample_factor=resample_factor, multi=3)
run_optimization_evaluation(dataset=dataset, resample_factor=resample_factor, data_processing=data_processing,
dtw_attack=dtw_attack, result_selection_method=result_selection_method)
run_calculate_max_precision(dataset=dataset, resample_factor=resample_factor, data_processing=data_processing,
dtw_attack=dtw_attack, result_selection_method=result_selection_method,
use_existing_weightings=False)
run_overall_evaluation(dataset=dataset, resample_factor=resample_factor, data_processing=data_processing,
dtw_attack=dtw_attack, result_selection_method=result_selection_method, save_weightings=True)
Conducting an exploratory data analysis that plots the signal data of the individuals in the data set:
dataset = Wesad(dataset_size=15)
data_processing = StandardProcessing()
resample_factor = 1000
plot_subject_data(dataset=dataset, data_processing=data_processing, resample_factor=resample_factor)
Creating DTW distance heatmaps:
dataset = Wesad(dataset_size=15)
data_processing = StandardProcessing()
resample_factor = 1000
run_dtw_alignments(dataset=dataset, data_processing=data_processing, resample_factor=resample_factor)
plot_distance_heatmap(dataset=dataset, data_processing=data_processing, resample_factor=resample_factor)
The DTW attacks only calculate which subject in the data set is most similar to the target being searched for. It is not possible to recognise whether the target is contained in the data set at all. The In-Out attacks introduce a threshold for this purpose. The distance between the target and a subject has to be <= the threshold for a match to be formed (classification target in the data set).
resample_factor = 1000
dataset = WesadCGan(dataset_size=120, resample_factor=resample_factor) # all data sets can be used
overlap = "medium" # or "small" or "high"
run_threshold_attack(dataset=dataset, overlap=overlap, resample_factor=resample_factor)
run_evaluation_threshold_attacks(datasets=[dataset], overlap=overlap)
Parameters:
- overlap: The overlap determines how high the proportion of targets is that are also contained in the data set as a subject. You can choose between "small" = 0.125, "medium" = 0.5 and "high" = 1.0.
The aim of the mitigation strategy should be to find a good trade-off between privacy (reduction of the re-identification risk) and utility (usability of the data for applications such as stress detection). Perform re-identification attacks on noisy WESAD data set. The data set becomes noisy by adding random noise according to a Laplace distribution. Compare precision@1 scores of the attacks with noisy stress detection F1 scores.
data_processing = StandardProcessing()
resample_factor = 1000
noise_multipliers = [i for i in range(0, 16)] # specify noise multipliers to be tested
runs = 10 # specify the number of runs for which the attacks are to be repeated on new noisy data sets
result_selection_method = "min"
run_noisy_attacks(dtw_attack=SingleDTWAttack(), data_processing=data_processing, resample_factor=resample_factor,
result_selection_method=result_selection_method, noise_multipliers=noise_multipliers, runs=runs)
result_selection_method = "mean"
run_noisy_attacks(dtw_attack=MultiDTWAttack(), data_processing=data_processing, resample_factor=resample_factor,
result_selection_method=result_selection_method, noise_multipliers=noise_multipliers, runs=runs)
result_selection_method = "min"
run_noisy_attacks(dtw_attack=SlicingDTWAttack(), data_processing=data_processing, resample_factor=resample_factor,
result_selection_method=result_selection_method, noise_multipliers=noise_multipliers, runs=runs)
result_selection_method = "min-min"
run_noisy_attacks(dtw_attack=MultiSlicingDTWAttack(), data_processing=data_processing, resample_factor=resample_factor,
result_selection_method=result_selection_method, noise_multipliers=noise_multipliers, runs=runs)
run_evaluation_privacy_usability(dtw_attacks=[SingleDTWAttack(), MultiDTWAttack(), SlicingDTWAttack(), MultiSlicingDTWAttack()])
- All health data originated from public sources provided for research purposes and was solely used within the limited scope of this work.
- This repository only contains the functionalities required to perform the described experiments. There are no methods included to perform the DTW attacks directly for an actual attack set of a target and a data set.