[FEAT] A percentage threshold for array intersection #2382

mihir-packmoose · 2024-09-04T18:54:29Z

Is your proposal related to a problem?

We've been working with splink to do a bit of address matching, and have identified cases where:

We can't use Levenshtein distance because it wouldn't capture nuance (sometimes if most words match with different ordering, the distance is high when we want a metric that is okay with reordered words).
We can't use splink.internals.comparison_level_library.ArrayIntersectLevel because addresses have different sizes and there's no good min_intersection that we can define.

Describe the solution you'd like

We've written a patch into our code that's based on splink.internals.comparison_level_library.ArrayIntersectLevel and defines a new class called ArrayIntersectPercentage that:

Takes a percentage_threshold
Performs the following pseudocode operation in the sql_base_dialect_sql statement:

$$\frac{size(intersection(left\_column,right\_column))}{greatest(size(left\_column),size(right\_column)}\leq percentage\_threshold$$

We would like to make a pull request to this effect, adding this new class to the toolkit that splink officially provides to users as of its current version.

Describe alternatives you've considered

As mentioned before, we've considered and attempted to use Levenshtein distance and splink.internals.comparison_level_library.ArrayIntersectLevel as possible ways of tackling our specific address matching problem, but the patch that we wrote in solved the problem for us, and we think it'll be a nice addition to the entity resolution toolkit.

Additional context

We're relatively new to open source, and any help in getting this feature on par with the rest of the codebase in terms of standards would be much appreciated!

RobinL · 2024-09-05T13:45:25Z

Hello!

Thanks - this is interest interesting. We have thought about this a bit in the past. I agree that there should be some measure that is able to account for the size of the arrays.

I'm certainly willing to consider adding a percentage intersection according to your definition, but i wnated to raise some points for discussion first:

I think the biggest challenge with using percentage intersection to categorise similarity is fact that the arrays may be a different size.

For example:

[A,B] vs [A,B,C,D]

and

[A,B, E, F] vs [A,B,C,D]

both have 50% intersection, but in many cases, I think we might say that the first example is more similar.

This is somewhat equivalent to the fact that NULLs are genearlly treated as no information, but explicit disagreement usually gets negative weight in probabilistic models. The percentage intersection level is treating them as equivalent

A second issue is as follows:

[A] vs [A,B]

[A,B,C,D,E,F] vs [A,B,C]

Arguably the second example is a much 'bigger match' (e.g. if the arrays were postcodes it would be much less likely to occur by chance).

Another aspect is that as the size of the arrays grow, I think the size of the percentage intersection may grow, and this is especially the case if the field is either low cardinality (a lot number of discrete values), or has high skew (e.g. name, where some names like John and David appear a lot, so once you have several values in the array, it become very likely a few match birthday paradox style).

None of this is to say we shouldn't include a percentage threshold, I just thought I'd highlight to see whether you had further ideas. Perhaps there is a better or more nuanced way of defining it to get at what you're trying to achieve, but that has higher accuracy (is able to capture similarity between arrays with greater nuance).

I'm not sure I have a better solution, but consider something like:

Arrays exactly equal
One array is a subset of the other
Intersection of size >5, with difference of <=1
Intersection of size >3, with difference of <=2
Intersection of size >1, with difference of <=3
Else

Any thoughts?

mihir-packmoose · 2024-09-05T16:46:55Z

Hey, Robin, thanks for dropping in! We had an internal discussion about this in our team meeting. We believe that the subset idea will probably help us solve our current problem, and the intersection-with-difference idea also shows good potential as a fallback. We threw together some code and are now testing it to see how it fares.

Thanks again for your insight in solving this problem! I'm setting my PR back into a draft stage while we work this out, and we'll keep you posted about it.

mihir-packmoose · 2024-09-06T19:34:39Z

Hey, Robin, we just ended up testing these locally and found that exact subsets work really nicely for our use case. Could you take a look at the changes I've just committed and let me know about next steps?

mihir-packmoose · 2024-09-16T17:41:32Z

Hey, Robin, just checking in to ask if you've had the chance to look at my testing commit and my question regarding null sets?

mihir-packmoose added the enhancement New feature or request label Sep 4, 2024

mihir-packmoose mentioned this issue Sep 4, 2024

ArraySubset defined, first draft #2383

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] A percentage threshold for array intersection #2382

[FEAT] A percentage threshold for array intersection #2382

mihir-packmoose commented Sep 4, 2024 •

edited

Loading

RobinL commented Sep 5, 2024 •

edited

Loading

mihir-packmoose commented Sep 5, 2024

mihir-packmoose commented Sep 6, 2024

mihir-packmoose commented Sep 16, 2024

[FEAT] A percentage threshold for array intersection #2382

[FEAT] A percentage threshold for array intersection #2382

Comments

mihir-packmoose commented Sep 4, 2024 • edited Loading

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

RobinL commented Sep 5, 2024 • edited Loading

mihir-packmoose commented Sep 5, 2024

mihir-packmoose commented Sep 6, 2024

mihir-packmoose commented Sep 16, 2024

mihir-packmoose commented Sep 4, 2024 •

edited

Loading

RobinL commented Sep 5, 2024 •

edited

Loading