-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial plan for deduplication #11
Comments
@gregmundy and I talked about the above document. We identified a few immediate new steps:
|
Deupe may be useful here for several of your requirements. |
@gregmundy see my research results at the end of the planning document: https://docs.google.com/document/d/12K9p7RgLwmAHXKM0lNG_kmsNGHbAzOhN90AU_rtn5C4/edit#heading=h.kt0y9act2nxn If you think of other resources, please send them my way. |
@robinsonkwame read my document, friend! That's the tool I recommend...though admittedly, I am a little relieved that you recommend it, too. |
Ah, I just read through the write up now, apologies for not before. Regarding retraining, unsupervised machine learning I can easily envision a process that:
For learning probability distributions, I would recommend looking into TGAN first, although it does not appear to be online or out-of-core learning it is designed for large scale data. |
We have a strong case for never merging records, but just linking records (in another table). See notes here: https://docs.google.com/document/d/1A3_zQxccHxuK6RMPvE562EosqYNF3E6Xk-wOBa0dGd0/edit# |
This document describes a solution using DataMade's dedupe. It only discusses the process for deduplication in batch process: it does not describe a solution for deduplicating on a per-user, per-post request.
The text was updated successfully, but these errors were encountered: