In this document, we introduce the main steps to generate the de-identified dataset. The steps are
- Create/Review/Delete de-identification jobs;
- Read data and Pre-processing;
- Generate de-identified data.
And we give you an example to demostrate the entire process of de-identication with the UCI Adult dataset.
Note:
- Make sure you have go through the starting tutorial and can visit to the dashboard correctly.
- The original UCI Adult Dataset is incomplete, without header and some missing values, we copied the original dataset, added the corresponding header and then remove some missing values. The manipulated version can be found here.
- The dashboard lists all the created de-identification tasks.
- Create: click to initiate a new de-identification task.
- Review: select a created task to view its details, can also modify the settings.
- Delete: select a created task to delete all the relative data.
- Fill the textbox with file name and click confirm. For example,
adults
Note: we now only support.csv
file, make sure dataset in the directory is in.csv
format. - Select the attributes will be included in the generated dataset.
- For the selected attributes, specify the data type respect to the attribute.
Example: theworkclass
is categorical. Then forworkclass
change the numerical/continuous type to categorical in the drag-down list. - Click the button Execute to perform pre-processing.
After the pre-processing step, we can select the privacy level to generate de-identification dataset.
- Select one privacy level from the drag-down list.
- Click the Execute button to generate de-identification dataset.
- When the Step 2 is succeed, you can download the generated de-identification dataset.
We demostrate the proccess with dataset copied from UCI Adult Dataset, and add header to it. You can download the manipulated dataset here.
Age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,salary_class
39,"State-gov",77516,"Bachelors",13,"Never-married","Adm-clerical","Not-in-family","White","Male",2174,0,40,"United-States","<=50K"
50,"Self-emp-not-inc",83311,"Bachelors",13,"Married-civ-spouse","Exec-managerial","Husband","White","Male",0,0,13,"United-States","<=50K"
38,"Private",215646,"HS-grad",9,"Divorced","Handlers-cleaners","Not-in-family","White","Male",0,0,40,"United-States","<=50K"
53,"Private",234721,"11th",7,"Married-civ-spouse","Handlers-cleaners","Husband","Black","Male",0,0,40,"United-States","<=50K"
28,"Private",338409,"Bachelors",13,"Married-civ-spouse","Prof-specialty","Wife","Black","Female",0,0,40,"Cuba","<=50K"
37,"Private",284582,"Masters",14,"Married-civ-spouse","Exec-managerial","Wife","White","Female",0,0,40,"United-States","<=50K"
...
Suppose your attach the directory /user/data/
into the Docker container, then put the downloaded dataset adults.csv
under /user/data
on the host machine, and the full path of dataset should be /user/data/adults.csv
on host machine. The following figure shows a inituition.
Suppose the Docker container is launched successfully, you can visit the dashboard via web browser. For example, if the IP address of host machine is 140.112.42.26 and the Docker container is listening to the 8888 port, then visit the deshboard via the URL, http://140.112.42.26:8888.
After the pre-processing step, we can select the privacy level to generate de-identification dataset.
- Select one privacy level from the drag-down list.
- Click the Execute button to generate de-identification dataset.
- When the Step 2 is succeed, you can download the generated de-identification dataset.