Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] A percentage threshold for array intersection #2382

Open
mihir-packmoose opened this issue Sep 4, 2024 · 4 comments
Open

[FEAT] A percentage threshold for array intersection #2382

mihir-packmoose opened this issue Sep 4, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@mihir-packmoose
Copy link
Contributor

mihir-packmoose commented Sep 4, 2024

Is your proposal related to a problem?

We've been working with splink to do a bit of address matching, and have identified cases where:

  1. We can't use Levenshtein distance because it wouldn't capture nuance (sometimes if most words match with different ordering, the distance is high when we want a metric that is okay with reordered words).
  2. We can't use splink.internals.comparison_level_library.ArrayIntersectLevel because addresses have different sizes and there's no good min_intersection that we can define.

Describe the solution you'd like

We've written a patch into our code that's based on splink.internals.comparison_level_library.ArrayIntersectLevel and defines a new class called ArrayIntersectPercentage that:

  1. Takes a percentage_threshold
  2. Performs the following pseudocode operation in the sql_base_dialect_sql statement:

$$\frac{size(intersection(left\_column,right\_column))}{greatest(size(left\_column),size(right\_column)}\leq percentage\_threshold$$

We would like to make a pull request to this effect, adding this new class to the toolkit that splink officially provides to users as of its current version.

Describe alternatives you've considered

As mentioned before, we've considered and attempted to use Levenshtein distance and splink.internals.comparison_level_library.ArrayIntersectLevel as possible ways of tackling our specific address matching problem, but the patch that we wrote in solved the problem for us, and we think it'll be a nice addition to the entity resolution toolkit.

Additional context

We're relatively new to open source, and any help in getting this feature on par with the rest of the codebase in terms of standards would be much appreciated!

@mihir-packmoose mihir-packmoose added the enhancement New feature or request label Sep 4, 2024
@RobinL
Copy link
Member

RobinL commented Sep 5, 2024

Hello!

Thanks - this is interest interesting. We have thought about this a bit in the past. I agree that there should be some measure that is able to account for the size of the arrays.

I'm certainly willing to consider adding a percentage intersection according to your definition, but i wnated to raise some points for discussion first:

I think the biggest challenge with using percentage intersection to categorise similarity is fact that the arrays may be a different size.

For example:

[A,B] vs [A,B,C,D]

and

[A,B, E, F] vs [A,B,C,D]

both have 50% intersection, but in many cases, I think we might say that the first example is more similar.

This is somewhat equivalent to the fact that NULLs are genearlly treated as no information, but explicit disagreement usually gets negative weight in probabilistic models. The percentage intersection level is treating them as equivalent

A second issue is as follows:

[A] vs [A,B]
[A,B,C,D,E,F] vs [A,B,C]

Arguably the second example is a much 'bigger match' (e.g. if the arrays were postcodes it would be much less likely to occur by chance).

Another aspect is that as the size of the arrays grow, I think the size of the percentage intersection may grow, and this is especially the case if the field is either low cardinality (a lot number of discrete values), or has high skew (e.g. name, where some names like John and David appear a lot, so once you have several values in the array, it become very likely a few match birthday paradox style).

None of this is to say we shouldn't include a percentage threshold, I just thought I'd highlight to see whether you had further ideas. Perhaps there is a better or more nuanced way of defining it to get at what you're trying to achieve, but that has higher accuracy (is able to capture similarity between arrays with greater nuance).

I'm not sure I have a better solution, but consider something like:

  • Arrays exactly equal
  • One array is a subset of the other
  • Intersection of size >5, with difference of <=1
  • Intersection of size >3, with difference of <=2
  • Intersection of size >1, with difference of <=3
  • Else

Any thoughts?

@mihir-packmoose
Copy link
Contributor Author

Hey, Robin, thanks for dropping in! We had an internal discussion about this in our team meeting. We believe that the subset idea will probably help us solve our current problem, and the intersection-with-difference idea also shows good potential as a fallback. We threw together some code and are now testing it to see how it fares.

Thanks again for your insight in solving this problem! I'm setting my PR back into a draft stage while we work this out, and we'll keep you posted about it.

@mihir-packmoose
Copy link
Contributor Author

Hey, Robin, we just ended up testing these locally and found that exact subsets work really nicely for our use case. Could you take a look at the changes I've just committed and let me know about next steps?

@mihir-packmoose
Copy link
Contributor Author

Hey, Robin, just checking in to ask if you've had the chance to look at my testing commit and my question regarding null sets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants