Zero trained m-values can lead to `math domain error` #2333

ADBond · 2024-08-14T14:53:12Z

Given the right circumstances in the data, trained m-values can end up as 0 (probably "really" just smaller than floating point precision). This leads to math domain error when we try to take the log of the (also 0) Bayes factor, in ComparisonLevel._as_detailed_record, in e.g. charts such as match_weights_chart or m_u_parameters_chart.

Here is a not-very-elegant reprex:

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

settings = SettingsCreator(
    "dedupe_only",
    comparisons=[
        cl.LevenshteinAtThresholds("first_name"),
        cl.LevenshteinAtThresholds("surname"),
        cl.ExactMatch("city"),
        cl.LevenshteinAtThresholds("dob"),
        cl.LevenshteinAtThresholds("email"),
        cl.ExactMatch("cluster"),
        cl.ExactMatch("cluster_1"),
        cl.ExactMatch("cluster_2"),
        cl.ExactMatch("cluster_3"),
        cl.ExactMatch("non_match_cat"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("dob"),
        block_on("city"),
    ]
)

df = splink_datasets.fake_1000
df["cluster_1"] = df["cluster"]
df["cluster_2"] = df["cluster"]
df["cluster_3"] = df["cluster"]

# specially chosen non-matchy things
df["non_match_cat"] = None
cats = {
    9: 1,
    192: 1,
    10: 1,
    7: 1,
    21: 2,
    287: 2,
    263: 6,
    273: 6,
    500: 7,
    729: 7,
}
for id_n, cat in cats.items():
    df["non_match_cat"][df["unique_id"] == id_n] = cat

db_api = DuckDBAPI()
linker = Linker(df, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    block_on("first_name", "surname", "dob"), recall=0.7
)
linker.training.estimate_u_using_random_sampling(max_pairs=1e8)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("cluster"))
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
linker.training.estimate_parameters_using_expectation_maximisation(block_on("city"))

linker.misc.save_model_to_json("mde.json", overwrite=True)

ch = linker.visualisations.match_weights_chart()

Related is #1889, which has the same root cause, and can be reproduced if in the above we call linker.inference.predict() instead of the match weights chart. Opening this separately though, as that issue occurs during SQL execution, while here we hit the issue in python, and so these will potentially require different solutions.

The text was updated successfully, but these errors were encountered:

ADBond added bug Something isn't working model training labels Aug 14, 2024

ADBond mentioned this issue Aug 14, 2024

NaN trained values can break predict() #2334

Open

ADBond mentioned this issue Sep 24, 2024

predict() fails with threshold probability 0 #2420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero trained m-values can lead to `math domain error` #2333

Zero trained m-values can lead to `math domain error` #2333

ADBond commented Aug 14, 2024

Zero trained m-values can lead to math domain error #2333

Zero trained m-values can lead to math domain error #2333

Comments

ADBond commented Aug 14, 2024

Zero trained m-values can lead to `math domain error` #2333

Zero trained m-values can lead to `math domain error` #2333