Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial plan for deduplication #11

Open
reginafcompton opened this issue Jul 2, 2019 · 6 comments
Open

Initial plan for deduplication #11

reginafcompton opened this issue Jul 2, 2019 · 6 comments
Labels
enhancement New feature or request

Comments

@reginafcompton
Copy link
Contributor

reginafcompton commented Jul 2, 2019

This document describes a solution using DataMade's dedupe. It only discusses the process for deduplication in batch process: it does not describe a solution for deduplicating on a per-user, per-post request.

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Jul 12, 2019

@gregmundy and I talked about the above document. We identified a few immediate new steps:

@robinsonkwame
Copy link

Deupe may be useful here for several of your requirements.

@reginafcompton
Copy link
Contributor Author

@gregmundy see my research results at the end of the planning document: https://docs.google.com/document/d/12K9p7RgLwmAHXKM0lNG_kmsNGHbAzOhN90AU_rtn5C4/edit#heading=h.kt0y9act2nxn

If you think of other resources, please send them my way.

@reginafcompton
Copy link
Contributor Author

@robinsonkwame read my document, friend! That's the tool I recommend...though admittedly, I am a little relieved that you recommend it, too.

@robinsonkwame
Copy link

robinsonkwame commented Jul 13, 2019

Ah, I just read through the write up now, apologies for not before. Regarding retraining, unsupervised machine learning I can easily envision a process that:

  • Analyzes current PII data on premise when asked to do
  • The analyzation (is that a word?) process learns a probability distribution of that dataset in an unsupervised manner.
  • The analyzation process can be resource constrained to fit within the available compute resources at the customer. (note: investigate what online, out-of-core solutions exist; we want to provide the same quality of learning for any kind of compute, with time being the varying parameter)
  • The learned probability distribution is then used to generate (or serve) hyper realistic but sample datasets, that contain PII fields but aren't individuals thus there is no PII to leak, on demand of arbitrary size for downstream training and the like.

For learning probability distributions, I would recommend looking into TGAN first, although it does not appear to be online or out-of-core learning it is designed for large scale data.

@reginafcompton
Copy link
Contributor Author

We have a strong case for never merging records, but just linking records (in another table). See notes here: https://docs.google.com/document/d/1A3_zQxccHxuK6RMPvE562EosqYNF3E6Xk-wOBa0dGd0/edit#

@gregmundy gregmundy added the enhancement New feature or request label Oct 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants