Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero trained m-values can lead to math domain error #2333

Open
ADBond opened this issue Aug 14, 2024 · 0 comments
Open

Zero trained m-values can lead to math domain error #2333

ADBond opened this issue Aug 14, 2024 · 0 comments
Labels
bug Something isn't working model training

Comments

@ADBond
Copy link
Contributor

ADBond commented Aug 14, 2024

Given the right circumstances in the data, trained m-values can end up as 0 (probably "really" just smaller than floating point precision). This leads to math domain error when we try to take the log of the (also 0) Bayes factor, in ComparisonLevel._as_detailed_record, in e.g. charts such as match_weights_chart or m_u_parameters_chart.

Here is a not-very-elegant reprex:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

settings = SettingsCreator(
    "dedupe_only",
    comparisons=[
        cl.LevenshteinAtThresholds("first_name"),
        cl.LevenshteinAtThresholds("surname"),
        cl.ExactMatch("city"),
        cl.LevenshteinAtThresholds("dob"),
        cl.LevenshteinAtThresholds("email"),
        cl.ExactMatch("cluster"),
        cl.ExactMatch("cluster_1"),
        cl.ExactMatch("cluster_2"),
        cl.ExactMatch("cluster_3"),
        cl.ExactMatch("non_match_cat"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("dob"),
        block_on("city"),
    ]
)

df = splink_datasets.fake_1000
df["cluster_1"] = df["cluster"]
df["cluster_2"] = df["cluster"]
df["cluster_3"] = df["cluster"]

# specially chosen non-matchy things
df["non_match_cat"] = None
cats = {
    9: 1,
    192: 1,
    10: 1,
    7: 1,
    21: 2,
    287: 2,
    263: 6,
    273: 6,
    500: 7,
    729: 7,
}
for id_n, cat in cats.items():
    df["non_match_cat"][df["unique_id"] == id_n] = cat

db_api = DuckDBAPI()
linker = Linker(df, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    block_on("first_name", "surname", "dob"), recall=0.7
)
linker.training.estimate_u_using_random_sampling(max_pairs=1e8)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("cluster"))
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
linker.training.estimate_parameters_using_expectation_maximisation(block_on("city"))

linker.misc.save_model_to_json("mde.json", overwrite=True)

ch = linker.visualisations.match_weights_chart()

Related is #1889, which has the same root cause, and can be reproduced if in the above we call linker.inference.predict() instead of the match weights chart. Opening this separately though, as that issue occurs during SQL execution, while here we hit the issue in python, and so these will potentially require different solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working model training
Projects
None yet
Development

No branches or pull requests

1 participant