-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] A percentage threshold for array intersection #2382
Comments
Hello! Thanks - this is interest interesting. We have thought about this a bit in the past. I agree that there should be some measure that is able to account for the size of the arrays. I'm certainly willing to consider adding a percentage intersection according to your definition, but i wnated to raise some points for discussion first: I think the biggest challenge with using percentage intersection to categorise similarity is fact that the arrays may be a different size. For example:
and
both have 50% intersection, but in many cases, I think we might say that the first example is more similar. This is somewhat equivalent to the fact that NULLs are genearlly treated as no information, but explicit disagreement usually gets negative weight in probabilistic models. The percentage intersection level is treating them as equivalent A second issue is as follows:
Arguably the second example is a much 'bigger match' (e.g. if the arrays were postcodes it would be much less likely to occur by chance). Another aspect is that as the size of the arrays grow, I think the size of the percentage intersection may grow, and this is especially the case if the field is either low cardinality (a lot number of discrete values), or has high skew (e.g. name, where some names like John and David appear a lot, so once you have several values in the array, it become very likely a few match birthday paradox style). None of this is to say we shouldn't include a percentage threshold, I just thought I'd highlight to see whether you had further ideas. Perhaps there is a better or more nuanced way of defining it to get at what you're trying to achieve, but that has higher accuracy (is able to capture similarity between arrays with greater nuance). I'm not sure I have a better solution, but consider something like:
Any thoughts? |
Hey, Robin, thanks for dropping in! We had an internal discussion about this in our team meeting. We believe that the subset idea will probably help us solve our current problem, and the intersection-with-difference idea also shows good potential as a fallback. We threw together some code and are now testing it to see how it fares. Thanks again for your insight in solving this problem! I'm setting my PR back into a draft stage while we work this out, and we'll keep you posted about it. |
Hey, Robin, we just ended up testing these locally and found that exact subsets work really nicely for our use case. Could you take a look at the changes I've just committed and let me know about next steps? |
Hey, Robin, just checking in to ask if you've had the chance to look at my testing commit and my question regarding null sets? |
Is your proposal related to a problem?
We've been working with splink to do a bit of address matching, and have identified cases where:
Describe the solution you'd like
We've written a patch into our code that's based on splink.internals.comparison_level_library.ArrayIntersectLevel and defines a new class called ArrayIntersectPercentage that:
We would like to make a pull request to this effect, adding this new class to the toolkit that splink officially provides to users as of its current version.
Describe alternatives you've considered
As mentioned before, we've considered and attempted to use Levenshtein distance and splink.internals.comparison_level_library.ArrayIntersectLevel as possible ways of tackling our specific address matching problem, but the patch that we wrote in solved the problem for us, and we think it'll be a nice addition to the entity resolution toolkit.
Additional context
We're relatively new to open source, and any help in getting this feature on par with the rest of the codebase in terms of standards would be much appreciated!
The text was updated successfully, but these errors were encountered: