Initial plan for deduplication #11

reginafcompton · 2019-07-02T16:30:15Z

This document describes a solution using DataMade's dedupe. It only discusses the process for deduplication in batch process: it does not describe a solution for deduplicating on a per-user, per-post request.

reginafcompton · 2019-07-12T17:40:19Z

@gregmundy and I talked about the above document. We identified a few immediate new steps:

Research. Research solutions for master patient indexes. How do they manage deduplication? Merging records? And also handling large, diverse datasets?
Medical industry https://www.gao.gov/assets/700/696426.pdf
Census https://census.gov/library/working-papers/2014/adrm/carra-wp-2014-01.html
Task. Experiment with clustering using dedupe on some sample data. See ar_app_indv.csv in the DSS Sample Data Google drive directory
Research. For the memory question...Can we run dedupe in multiple threads, e.g., in multiple containers?

robinsonkwame · 2019-07-12T18:44:56Z

Deupe may be useful here for several of your requirements.

reginafcompton · 2019-07-12T20:55:19Z

@gregmundy see my research results at the end of the planning document: https://docs.google.com/document/d/12K9p7RgLwmAHXKM0lNG_kmsNGHbAzOhN90AU_rtn5C4/edit#heading=h.kt0y9act2nxn

If you think of other resources, please send them my way.

reginafcompton · 2019-07-12T20:57:27Z

@robinsonkwame read my document, friend! That's the tool I recommend...though admittedly, I am a little relieved that you recommend it, too.

robinsonkwame · 2019-07-13T12:35:44Z

Ah, I just read through the write up now, apologies for not before. Regarding retraining, unsupervised machine learning I can easily envision a process that:

Analyzes current PII data on premise when asked to do
The analyzation (is that a word?) process learns a probability distribution of that dataset in an unsupervised manner.
The analyzation process can be resource constrained to fit within the available compute resources at the customer. (note: investigate what online, out-of-core solutions exist; we want to provide the same quality of learning for any kind of compute, with time being the varying parameter)
The learned probability distribution is then used to generate (or serve) hyper realistic but sample datasets, that contain PII fields but aren't individuals thus there is no PII to leak, on demand of arbitrary size for downstream training and the like.

For learning probability distributions, I would recommend looking into TGAN first, although it does not appear to be online or out-of-core learning it is designed for large scale data.

reginafcompton · 2019-08-07T20:44:08Z

We have a strong case for never merging records, but just linking records (in another table). See notes here: https://docs.google.com/document/d/1A3_zQxccHxuK6RMPvE562EosqYNF3E6Xk-wOBa0dGd0/edit#

reginafcompton added this to the Deduplication: Batch process milestone Jul 2, 2019

gregmundy added the enhancement New feature or request label Oct 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial plan for deduplication #11

Initial plan for deduplication #11

reginafcompton commented Jul 2, 2019 •

edited

Loading

reginafcompton commented Jul 12, 2019 •

edited

Loading

robinsonkwame commented Jul 12, 2019

reginafcompton commented Jul 12, 2019

reginafcompton commented Jul 12, 2019

robinsonkwame commented Jul 13, 2019 •

edited

Loading

reginafcompton commented Aug 7, 2019

Initial plan for deduplication #11

Initial plan for deduplication #11

Comments

reginafcompton commented Jul 2, 2019 • edited Loading

reginafcompton commented Jul 12, 2019 • edited Loading

robinsonkwame commented Jul 12, 2019

reginafcompton commented Jul 12, 2019

reginafcompton commented Jul 12, 2019

robinsonkwame commented Jul 13, 2019 • edited Loading

reginafcompton commented Aug 7, 2019

reginafcompton commented Jul 2, 2019 •

edited

Loading

reginafcompton commented Jul 12, 2019 •

edited

Loading

robinsonkwame commented Jul 13, 2019 •

edited

Loading