Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realtime test sometime failing? #2515

Open
ADBond opened this issue Nov 20, 2024 · 7 comments
Open

Realtime test sometime failing? #2515

ADBond opened this issue Nov 20, 2024 · 7 comments
Assignees
Labels

Comments

@ADBond
Copy link
Contributor

ADBond commented Nov 20, 2024

The test tests/test_realtime.py::test_realtime_cache_different_settings[duckdb] failed in python 3.9 on this run. Re-running the job allowed the test to pass, which is odd as I can't see where any indeterminacy could occur.

Error was: Binder Error: Values list "l" does not have a column named "tf_first_name".

Full test output
________________ test_realtime_cache_different_settings[duckdb] ________________

self = <splink.internals.duckdb.database_api.DuckDBAPI object at 0x7fb7780ef4c0>
final_sql = 'CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS\nWITH __splink__compare_records_left AS (\n  SELECT\n    ..._adj_city,\n  "email_l",\n  "email_r",\n  gamma_email,\n  bf_email,\n  match_key\nFROM __splink__df_match_weight_parts'
templated_name = '__splink__realtime_compare_records'
physical_name = '__splink__realtime_compare_records_aj2njrcn'

    @final
    def _log_and_run_sql_execution(
        self, final_sql: str, templated_name: str, physical_name: str
    ) -> TablishType:
        """
        Log some sql, then call _run_sql_execution()
        Any errors will be converted to SplinkException with more detail
        names are only relevant for logging, not execution
        """
        logger.debug(execute_sql_logging_message_info(templated_name, physical_name))
        logger.log(5, log_sql(final_sql))
        try:
>           return self._execute_sql_against_backend(final_sql)

splink/internals/database_api.py:63: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <splink.internals.duckdb.database_api.DuckDBAPI object at 0x7fb7780ef4c0>
final_sql = 'CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS \nWITH\n\n__splink__compare_records_left as (\nselect * f..._tf_adj_city,"email_l","email_r",gamma_email,bf_email,match_key \n    from __splink__df_match_weight_parts\n    \n    '

    def _execute_sql_against_backend(self, final_sql: str) -> duckdb.DuckDBPyRelation:
>       return self._con.sql(final_sql)
E       duckdb.duckdb.BinderException: Binder Error: Values list "l" does not have a column named "tf_first_name"
E       LINE 11: ...from __splink__compare_records_left_aj2njrcn), 
E       
E       __splink__compare_records_rig...
E                                                          ^

splink/internals/duckdb/database_api.py:102: BinderException

The above exception was the direct cause of the following exception:

test_helpers = {'duckdb': (<class 'tests.helpers.DuckDBTestHelper'>, []), 'spark': (<class 'tests.helpers.SparkTestHelper'>, [<functi...ite': (<class 'tests.helpers.SQLiteTestHelper'>, []), 'postgres': (<class 'tests.helpers.PostgresTestHelper'>, [None])}
dialect = 'duckdb'

    @mark_with_dialects_excluding()
    def test_realtime_cache_different_settings(test_helpers, dialect):
        helper = test_helpers[dialect]
        db_api = helper.extra_linker_args()["db_api"]
    
        df1 = pd.DataFrame(
            [
                {
                    "unique_id": 0,
                    "first_name": "Julia",
                    "surname": "Taylor",
                    "city": "London",
                    "email": "[email protected]",
                }
            ]
        )
    
        df2 = pd.DataFrame(
            [
                {
                    "unique_id": 1,
                    "first_name": "Julia",
                    "surname": "Taylor",
                    "city": "London",
                    "email": "[email protected]",
                }
            ]
        )
    
        settings_1 = SettingsCreator(
            link_type="dedupe_only",
            comparisons=[
                cl.ExactMatch("first_name"),
                cl.ExactMatch("surname"),
                cl.ExactMatch("city"),
            ],
            blocking_rules_to_generate_predictions=[block_on("first_name")],
        )
    
        settings_2 = SettingsCreator(
            link_type="dedupe_only",
            comparisons=[
                cl.ExactMatch("first_name"),
                cl.ExactMatch("surname"),
                cl.ExactMatch("email"),
            ],
            blocking_rules_to_generate_predictions=[block_on("first_name")],
        )
    
>       res1 = compare_records(
            df1, df2, settings_1, db_api, use_sql_from_cache=True
        ).as_record_dict()[0]["match_weight"]

tests/test_realtime.py:343: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
splink/internals/realtime.py:88: in compare_records
    return db_api._sql_to_splink_dataframe(
splink/internals/database_api.py:93: in _sql_to_splink_dataframe
    spark_df = self._log_and_run_sql_execution(sql, templated_name, physical_name)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <splink.internals.duckdb.database_api.DuckDBAPI object at 0x7fb7780ef4c0>
final_sql = 'CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS\nWITH __splink__compare_records_left AS (\n  SELECT\n    ..._adj_city,\n  "email_l",\n  "email_r",\n  gamma_email,\n  bf_email,\n  match_key\nFROM __splink__df_match_weight_parts'
templated_name = '__splink__realtime_compare_records'
physical_name = '__splink__realtime_compare_records_aj2njrcn'

    @final
    def _log_and_run_sql_execution(
        self, final_sql: str, templated_name: str, physical_name: str
    ) -> TablishType:
        """
        Log some sql, then call _run_sql_execution()
        Any errors will be converted to SplinkException with more detail
        names are only relevant for logging, not execution
        """
        logger.debug(execute_sql_logging_message_info(templated_name, physical_name))
        logger.log(5, log_sql(final_sql))
        try:
            return self._execute_sql_against_backend(final_sql)
        except Exception as e:
            # Parse our SQL through sqlglot to pretty print
            try:
                final_sql = sqlglot.parse_one(
                    final_sql,
                    read=self.sql_dialect.sqlglot_dialect,
                ).sql(pretty=True)
                # if sqlglot produces any errors, just report the raw SQL
            except Exception:
                pass
    
>           raise SplinkException(
                f"Error executing the following sql for table "
                f"`{templated_name}`({physical_name}):\n{final_sql}"
                f"\n\nError was: {e}"
            ) from e
E           splink.internals.exceptions.SplinkException: Error executing the following sql for table `__splink__realtime_compare_records`(__splink__realtime_compare_records_aj2njrcn):
E           CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS
E           WITH __splink__compare_records_left AS (
E             SELECT
E               *
E             FROM __splink__compare_records_left_aj2njrcn
E           ), __splink__compare_records_right AS (
E             SELECT
E               *
E             FROM __splink__compare_records_right_aj2njrcn
E           ), __splink__compare_two_records_blocked AS (
E             SELECT
E               "l"."unique_id" AS "unique_id_l",
E               "r"."unique_id" AS "unique_id_r",
E               "l"."first_name" AS "first_name_l",
E               "r"."first_name" AS "first_name_r",
E               "l"."tf_first_name" AS "tf_first_name_l",
E               "r"."tf_first_name" AS "tf_first_name_r",
E               "l"."surname" AS "surname_l",
E               "r"."surname" AS "surname_r",
E               "l"."city" AS "city_l",
E               "r"."city" AS "city_r",
E               "l"."tf_city" AS "tf_city_l",
E               "r"."tf_city" AS "tf_city_r",
E               "l"."email" AS "email_l",
E               "r"."email" AS "email_r",
E               0 AS match_key
E             FROM __splink__compare_records_left AS l
E             CROSS JOIN __splink__compare_records_right AS r
E           ), __splink__df_comparison_vectors AS (
E             SELECT
E               "unique_id_l",
E               "unique_id_r",
E               "first_name_l",
E               "first_name_r",
E               CASE
E                 WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL
E                 THEN -1
E                 WHEN "first_name_l" = "first_name_r"
E                 THEN 1
E                 ELSE 0
E               END AS gamma_first_name,
E               "tf_first_name_l",
E               "tf_first_name_r",
E               "surname_l",
E               "surname_r",
E               CASE
E                 WHEN "surname_l" IS NULL OR "surname_r" IS NULL
E                 THEN -1
E                 WHEN "surname_l" = "surname_r"
E                 THEN 1
E                 ELSE 0
E               END AS gamma_surname,
E               "city_l",
E               "city_r",
E               CASE
E                 WHEN "city_l" IS NULL OR "city_r" IS NULL
E                 THEN -1
E                 WHEN "city_l" = "city_r"
E                 THEN 1
E                 ELSE 0
E               END AS gamma_city,
E               "tf_city_l",
E               "tf_city_r",
E               "email_l",
E               "email_r",
E               CASE
E                 WHEN "email_l" IS NULL OR "email_r" IS NULL
E                 THEN -1
E                 WHEN "email_l" = "email_r"
E                 THEN 1
E                 ELSE 0
E               END AS gamma_email,
E               match_key
E             FROM __splink__compare_two_records_blocked
E           ), __splink__df_match_weight_parts AS (
E             SELECT
E               "unique_id_l",
E               "unique_id_r",
E               "first_name_l",
E               "first_name_r",
E               gamma_first_name,
E               "tf_first_name_l",
E               "tf_first_name_r",
E               CASE
E                 WHEN gamma_first_name = -1
E                 THEN CAST(1.0 AS DOUBLE)
E                 WHEN gamma_first_name = 1
E                 THEN CAST(1024.0 AS DOUBLE)
E                 WHEN gamma_first_name = 0
E                 THEN CAST(0.03125 AS DOUBLE)
E               END AS bf_first_name,
E               CASE
E                 WHEN gamma_first_name = -1
E                 THEN CAST(1 AS DOUBLE)
E                 WHEN gamma_first_name = 1
E                 THEN (
E                   CASE
E                     WHEN NOT COALESCE("tf_first_name_l", "tf_first_name_r") IS NULL
E                     THEN POWER(
E                       CAST(0.000927734375 AS DOUBLE) / NULLIF(
E                         (
E                           CASE
E                             WHEN COALESCE("tf_first_name_l", "tf_first_name_r") >= COALESCE("tf_first_name_r", "tf_first_name_l")
E                             THEN COALESCE("tf_first_name_l", "tf_first_name_r")
E                             ELSE COALESCE("tf_first_name_r", "tf_first_name_l")
E                           END
E                         ),
E                         0
E                       ),
E                       CAST(1.0 AS DOUBLE)
E                     )
E                     ELSE CAST(1 AS DOUBLE)
E                   END
E                 )
E                 WHEN gamma_first_name = 0
E                 THEN CAST(1 AS DOUBLE)
E               END AS bf_tf_adj_first_name,
E               "surname_l",
E               "surname_r",
E               gamma_surname,
E               CASE
E                 WHEN gamma_surname = -1
E                 THEN CAST(1.0 AS DOUBLE)
E                 WHEN gamma_surname = 1
E                 THEN CAST(1024.0 AS DOUBLE)
E                 WHEN gamma_surname = 0
E                 THEN CAST(0.03125 AS DOUBLE)
E               END AS bf_surname,
E               "city_l",
E               "city_r",
E               gamma_city,
E               "tf_city_l",
E               "tf_city_r",
E               CASE
E                 WHEN gamma_city = -1
E                 THEN CAST(1.0 AS DOUBLE)
E                 WHEN gamma_city = 1
E                 THEN CAST(1024.0 AS DOUBLE)
E                 WHEN gamma_city = 0
E                 THEN CAST(0.03125 AS DOUBLE)
E               END AS bf_city,
E               CASE
E                 WHEN gamma_city = -1
E                 THEN CAST(1 AS DOUBLE)
E                 WHEN gamma_city = 1
E                 THEN (
E                   CASE
E                     WHEN NOT COALESCE("tf_city_l", "tf_city_r") IS NULL
E                     THEN POWER(
E                       CAST(0.000927734375 AS DOUBLE) / NULLIF(
E                         (
E                           CASE
E                             WHEN COALESCE("tf_city_l", "tf_city_r") >= COALESCE("tf_city_r", "tf_city_l")
E                             THEN COALESCE("tf_city_l", "tf_city_r")
E                             ELSE COALESCE("tf_city_r", "tf_city_l")
E                           END
E                         ),
E                         0
E                       ),
E                       CAST(1.0 AS DOUBLE)
E                     )
E                     ELSE CAST(1 AS DOUBLE)
E                   END
E                 )
E                 WHEN gamma_city = 0
E                 THEN CAST(1 AS DOUBLE)
E               END AS bf_tf_adj_city,
E               "email_l",
E               "email_r",
E               gamma_email,
E               CASE
E                 WHEN gamma_email = -1
E                 THEN CAST(1.0 AS DOUBLE)
E                 WHEN gamma_email = 1
E                 THEN CAST(1024.0 AS DOUBLE)
E                 WHEN gamma_email = 0
E                 THEN CAST(0.03125 AS DOUBLE)
E               END AS bf_email,
E               match_key
E             FROM __splink__df_comparison_vectors
E           )
E           SELECT
E             LOG(
E               2,
E               CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
E             ) AS match_weight,
E             CASE
E               WHEN bf_first_name = CAST('infinity' AS DOUBLE)
E               OR bf_tf_adj_first_name = CAST('infinity' AS DOUBLE)
E               OR bf_surname = CAST('infinity' AS DOUBLE)
E               OR bf_city = CAST('infinity' AS DOUBLE)
E               OR bf_tf_adj_city = CAST('infinity' AS DOUBLE)
E               OR bf_email = CAST('infinity' AS DOUBLE)
E               THEN 1.0
E               ELSE (
E                 CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
E               ) / NULLIF(
E                 (
E                   1 + (
E                     CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
E                   )
E                 ),
E                 0
E               )
E             END AS match_probability,
E             "unique_id_l",
E             "unique_id_r",
E             "first_name_l",
E             "first_name_r",
E             gamma_first_name,
E             "tf_first_name_l",
E             "tf_first_name_r",
E             bf_first_name,
E             bf_tf_adj_first_name,
E             "surname_l",
E             "surname_r",
E             gamma_surname,
E             bf_surname,
E             "city_l",
E             "city_r",
E             gamma_city,
E             "tf_city_l",
E             "tf_city_r",
E             bf_city,
E             bf_tf_adj_city,
E             "email_l",
E             "email_r",
E             gamma_email,
E             bf_email,
E             match_key
E           FROM __splink__df_match_weight_parts
E           
E           Error was: Binder Error: Values list "l" does not have a column named "tf_first_name"
E           LINE 11: ...from __splink__compare_records_left_aj2njrcn), 
E           
E           __splink__compare_records_rig...
E                                                              ^

splink/internals/database_api.py:75: SplinkException
Test summary output
=========================== short test summary info ============================
FAILED tests/test_realtime.py::test_realtime_cache_different_settings[duckdb] - splink.internals.exceptions.SplinkException: Error executing the following sql for table `__splink__realtime_compare_records`(__splink__realtime_compare_records_aj2njrcn):
CREATE TABLE __splink__realtime_compare_records_aj2njrcn AS
WITH __splink__compare_records_left AS (
  SELECT
    *
  FROM __splink__compare_records_left_aj2njrcn
), __splink__compare_records_right AS (
  SELECT
    *
  FROM __splink__compare_records_right_aj2njrcn
), __splink__compare_two_records_blocked AS (
  SELECT
    "l"."unique_id" AS "unique_id_l",
    "r"."unique_id" AS "unique_id_r",
    "l"."first_name" AS "first_name_l",
    "r"."first_name" AS "first_name_r",
    "l"."tf_first_name" AS "tf_first_name_l",
    "r"."tf_first_name" AS "tf_first_name_r",
    "l"."surname" AS "surname_l",
    "r"."surname" AS "surname_r",
    "l"."city" AS "city_l",
    "r"."city" AS "city_r",
    "l"."tf_city" AS "tf_city_l",
    "r"."tf_city" AS "tf_city_r",
    "l"."email" AS "email_l",
    "r"."email" AS "email_r",
    0 AS match_key
  FROM __splink__compare_records_left AS l
  CROSS JOIN __splink__compare_records_right AS r
), __splink__df_comparison_vectors AS (
  SELECT
    "unique_id_l",
    "unique_id_r",
    "first_name_l",
    "first_name_r",
    CASE
      WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL
      THEN -1
      WHEN "first_name_l" = "first_name_r"
      THEN 1
      ELSE 0
    END AS gamma_first_name,
    "tf_first_name_l",
    "tf_first_name_r",
    "surname_l",
    "surname_r",
    CASE
      WHEN "surname_l" IS NULL OR "surname_r" IS NULL
      THEN -1
      WHEN "surname_l" = "surname_r"
      THEN 1
      ELSE 0
    END AS gamma_surname,
    "city_l",
    "city_r",
    CASE
      WHEN "city_l" IS NULL OR "city_r" IS NULL
      THEN -1
      WHEN "city_l" = "city_r"
      THEN 1
      ELSE 0
    END AS gamma_city,
    "tf_city_l",
    "tf_city_r",
    "email_l",
    "email_r",
    CASE
      WHEN "email_l" IS NULL OR "email_r" IS NULL
      THEN -1
      WHEN "email_l" = "email_r"
      THEN 1
      ELSE 0
    END AS gamma_email,
    match_key
  FROM __splink__compare_two_records_blocked
), __splink__df_match_weight_parts AS (
  SELECT
    "unique_id_l",
    "unique_id_r",
    "first_name_l",
    "first_name_r",
    gamma_first_name,
    "tf_first_name_l",
    "tf_first_name_r",
    CASE
      WHEN gamma_first_name = -1
      THEN CAST(1.0 AS DOUBLE)
      WHEN gamma_first_name = 1
      THEN CAST(1024.0 AS DOUBLE)
      WHEN gamma_first_name = 0
      THEN CAST(0.03125 AS DOUBLE)
    END AS bf_first_name,
    CASE
      WHEN gamma_first_name = -1
      THEN CAST(1 AS DOUBLE)
      WHEN gamma_first_name = 1
      THEN (
        CASE
          WHEN NOT COALESCE("tf_first_name_l", "tf_first_name_r") IS NULL
          THEN POWER(
            CAST(0.000927734375 AS DOUBLE) / NULLIF(
              (
                CASE
                  WHEN COALESCE("tf_first_name_l", "tf_first_name_r") >= COALESCE("tf_first_name_r", "tf_first_name_l")
                  THEN COALESCE("tf_first_name_l", "tf_first_name_r")
                  ELSE COALESCE("tf_first_name_r", "tf_first_name_l")
                END
              ),
              0
            ),
            CAST(1.0 AS DOUBLE)
          )
          ELSE CAST(1 AS DOUBLE)
        END
      )
      WHEN gamma_first_name = 0
      THEN CAST(1 AS DOUBLE)
    END AS bf_tf_adj_first_name,
    "surname_l",
    "surname_r",
    gamma_surname,
    CASE
      WHEN gamma_surname = -1
      THEN CAST(1.0 AS DOUBLE)
      WHEN gamma_surname = 1
      THEN CAST(1024.0 AS DOUBLE)
      WHEN gamma_surname = 0
      THEN CAST(0.03125 AS DOUBLE)
    END AS bf_surname,
    "city_l",
    "city_r",
    gamma_city,
    "tf_city_l",
    "tf_city_r",
    CASE
      WHEN gamma_city = -1
      THEN CAST(1.0 AS DOUBLE)
      WHEN gamma_city = 1
      THEN CAST(1024.0 AS DOUBLE)
      WHEN gamma_city = 0
      THEN CAST(0.03125 AS DOUBLE)
    END AS bf_city,
    CASE
      WHEN gamma_city = -1
      THEN CAST(1 AS DOUBLE)
      WHEN gamma_city = 1
      THEN (
        CASE
          WHEN NOT COALESCE("tf_city_l", "tf_city_r") IS NULL
          THEN POWER(
            CAST(0.000927734375 AS DOUBLE) / NULLIF(
              (
                CASE
                  WHEN COALESCE("tf_city_l", "tf_city_r") >= COALESCE("tf_city_r", "tf_city_l")
                  THEN COALESCE("tf_city_l", "tf_city_r")
                  ELSE COALESCE("tf_city_r", "tf_city_l")
                END
              ),
              0
            ),
            CAST(1.0 AS DOUBLE)
          )
          ELSE CAST(1 AS DOUBLE)
        END
      )
      WHEN gamma_city = 0
      THEN CAST(1 AS DOUBLE)
    END AS bf_tf_adj_city,
    "email_l",
    "email_r",
    gamma_email,
    CASE
      WHEN gamma_email = -1
      THEN CAST(1.0 AS DOUBLE)
      WHEN gamma_email = 1
      THEN CAST(1024.0 AS DOUBLE)
      WHEN gamma_email = 0
      THEN CAST(0.03125 AS DOUBLE)
    END AS bf_email,
    match_key
  FROM __splink__df_comparison_vectors
)
SELECT
  LOG(
    2,
    CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
  ) AS match_weight,
  CASE
    WHEN bf_first_name = CAST('infinity' AS DOUBLE)
    OR bf_tf_adj_first_name = CAST('infinity' AS DOUBLE)
    OR bf_surname = CAST('infinity' AS DOUBLE)
    OR bf_city = CAST('infinity' AS DOUBLE)
    OR bf_tf_adj_city = CAST('infinity' AS DOUBLE)
    OR bf_email = CAST('infinity' AS DOUBLE)
    THEN 1.0
    ELSE (
      CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
    ) / NULLIF(
      (
        1 + (
          CAST(0.00010001000100010001 AS DOUBLE) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_city * bf_tf_adj_city * bf_email
        )
      ),
      0
    )
  END AS match_probability,
  "unique_id_l",
  "unique_id_r",
  "first_name_l",
  "first_name_r",
  gamma_first_name,
  "tf_first_name_l",
  "tf_first_name_r",
  bf_first_name,
  bf_tf_adj_first_name,
  "surname_l",
  "surname_r",
  gamma_surname,
  bf_surname,
  "city_l",
  "city_r",
  gamma_city,
  "tf_city_l",
  "tf_city_r",
  bf_city,
  bf_tf_adj_city,
  "email_l",
  "email_r",
  gamma_email,
  bf_email,
  match_key
FROM __splink__df_match_weight_parts

Error was: Binder Error: Values list "l" does not have a column named "tf_first_name"
LINE 11: ...from __splink__compare_records_left_aj2njrcn), 

__splink__compare_records_rig...
                                                   ^
= 1 failed, 321 passed, 12 skipped, 395 deselected, 16 warnings in 116.40s (0:01:56) =
Environment info (3.9.20)
Package operations: 36 installs, 0 updates, 1 removal

  - Removing setuptools (75.5.0)
  - Installing six (1.16.0)
  - Installing attrs (24.2.0)
  - Installing python-dateutil (2.9.0.post0)
  - Installing rpds-py (0.20.1)
  - Installing referencing (0.35.1)
  - Installing jsonschema-specifications (2023.12.1)
  - Installing markupsafe (2.1.5)
  - Installing numpy (1.24.4)
  - Installing pytz (2024.2)
  - Installing tzdata (2024.2)
  - Installing exceptiongroup (1.2.2)
  - Installing greenlet (3.1.1)
  - Installing iniconfig (2.0.0)
  - Installing jinja2 (3.1.4)
  - Installing jsonschema (4.23.0)
  - Installing packaging (24.2)
  - Installing narwhals (1.14.1)
  - Installing pandas (2.2.3)
  - Installing pluggy (1.5.0)
  - Installing py4j (0.10.9.7)
  - Installing pyarrow (17.0.0)
  - Installing texttable (1.7.0)
  - Installing tomli (2.1.0)
  - Installing typing-extensions (4.12.2)
  - Installing altair (5.4.1)
  - Installing duckdb (1.1.3)
  - Installing igraph (0.11.8)
  - Installing networkx (3.1)
  - Installing psycopg2-binary (2.9.10)
  - Installing pyspark (3.5.3)
  - Installing pytest (8.3.3)
  - Installing rapidfuzz (3.9.7)
  - Installing ruff (0.4.10)
  - Installing sqlalchemy (2.0.36)
  - Installing sqlglot (25.31.4)
  - Installing tabulate (0.9.0)
@ADBond ADBond added the testing label Nov 20, 2024
@ADBond
Copy link
Contributor Author

ADBond commented Nov 25, 2024

See also this run - python 3.13 + postgres, (psycopg2.errors.UndefinedColumn) column l.tf_first_name does not exist.

@ADBond
Copy link
Contributor Author

ADBond commented Nov 25, 2024

It looks like the test tests/test_realtime.py::test_realtime_cache_different_settings is trying to run SQL which relates to settings in the earlier tests (either of them, they should generate the same).

It's likely to do with the sql caching mechanism. I've looked through the stack a bit to try to figure out what code might be non-deterministic. My current hypothesis is based around the fact that we use id(settings) to generate a cache key:

  • in test_realtime_cache_multiple_records or test_realtime_cache_two_records we define a settings object that has tf adjustments on first_name and city. Let's say this settings has id of 123
  • we call compare_records. The SQL gets cached using the key 123
  • the test finishes, and the settings object gets garbage collected
  • test_realtime_cache_different_settings defines new settings objects that do not have term frequency adjustments at all (and in fact each have fewer columns). This settings object is assigned an id of 123 by python (as the previous settings has been garbage collected, this id is no longer reserved, and python decides to re-use it)
  • we call compare_records with use_sql_from_cache=True. Even the first time, we get a cache hit, as the cache is a global object, and it has a SQL entry corresponding to the key of 123. When we try to run this SQL, we get an error, as the input data lacks the corresponding term-frequency columns (as we don't need them!)

So far this is just a theory. I think this scenario should be possible, but I haven't managed locally to generate an id collision like this in a toy example. I also haven't managed to recreate the test failure locally. If this is the cause then it may well be to do with differences in the environment between me and CI (which could well affect gc behaviour and memory things).

If this is the issue, then it should be relatively easy to fix with a slight tweak to the caching strategy (and maybe we could do this anyway). But it would be nice to get to the bottom of this beforehand, as this kind of failure could mask more subtle bugs. If for instance the only difference between the two settings objects were in the comparison levels (or even just the parameter values!), then the 'incorrect' SQL would still execute. If we were only dealing with the match_probability (rather than looking at individual match weights or other intermediate columns) we might never be aware that we are in fact running a different model to what we thought.

@ADBond
Copy link
Contributor Author

ADBond commented Nov 25, 2024

Note this line from the python docs on id()

Two objects with non-overlapping lifetimes may have the same id() value.

which suggests that this mechanism is at least plausible. How likely it is I don't really have a sense of.

@ADBond
Copy link
Contributor Author

ADBond commented Nov 25, 2024

Okay I believe that this is in fact what is occurring. I have run tests repeatedly on my fork in CI to get a failure, with some additional logging in place. This run has a failure, and you can see in the logs that a settings id has been re-used. In particular, the settings object instantiated in test_realtime_cache_multiple_records has id of 140110989126384. Then in test_realtime_cache_different_settings the second settings object also gets assigned the id 140110989126384 (the previous object having fallen out of scope, and apparently garbage collected at some point in the meantime), which means we look up the wrong SQL, and thus the test fails.:

tests/test_realtime.py::test_realtime_cache_multiple_records[duckdb] 
-------------------------------- live log call ---------------------------------
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 140110989126384
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 140110989126384
PASSED                                                                   [ 66%]
tests/test_realtime.py::test_realtime_cache_different_settings[duckdb] 
-------------------------------- live log call ---------------------------------
WARNING  root:realtime.py:94 Settings object had id: 140110989243008
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 140110989243008
WARNING  root:realtime.py:94 Settings object had id: 140110989126384
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 140110989126384
FAILED     

For reference here is an equivalent portion of the logs for a non-failing test run - in this case the object gets a distinct id from previous tests, so we do not get a cache hit, and thus construct the SQL afresh:

tests/test_realtime.py::test_realtime_cache_multiple_records[duckdb] 
-------------------------------- live log call ---------------------------------
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152719376
WARNING  root:realtime.py:94 Settings object had id: 139678152719376
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 139678152719376
PASSED                                                                   [ 66%]
tests/test_realtime.py::test_realtime_cache_different_settings[duckdb] 
-------------------------------- live log call ---------------------------------
WARNING  root:realtime.py:94 Settings object had id: 139678152854448
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152854448
WARNING  root:realtime.py:94 Settings object had id: 139678152852816
WARNING  splink.internals.realtime:realtime.py:38 Setting cache for 139678152852816
WARNING  root:realtime.py:94 Settings object had id: 139678152854448
WARNING  splink.internals.realtime:realtime.py:27 Getting cache for 139678152854448
PASSED  

@ADBond ADBond self-assigned this Nov 25, 2024
@RobinL
Copy link
Member

RobinL commented Nov 26, 2024

Thanks for looking into this, and excellent detective work! I just checked back on this one because I just got the same failure here

The reason I used id(settings) was simply that it seems like a low-overhead way of uniquely identifying a settings object. Obviously it didn't work as anticipated but any other mechanism should do. It was just to avoid caching errors in the case where the user is making predictions from two different models - leading to the potential for one model to use the cached sql from the other

@ADBond
Copy link
Contributor Author

ADBond commented Nov 26, 2024

Update on this / to document my thinking:
Currently I have a small partial fix for this. Instead of passing in the id we pass in the object itself. Internally we still use the id as a key, but alongside the SQL and uid we store a weak reference to the object itself. When we get a would-be-cache-hit, if the referent is dead we delete the cache entry and return nothing. The trouble with this is that:

  • it won't work for dict or str settings, as they can't be weakreffed, so would need to handle these differently in some fashion
  • this feels like a bit of an indirect approach

I've run it with some hooks to confirm that in cases where this issue would have arisen this fixes things.

I could merge that in now, and it would solve the intermittent test failures. But as it's not a blocking issue currently, I'm going to explore another option based on hashing the object (snapshot)s. I realise it's probably a fairly niche bug, but I think it would be useful to try and get this right, as there is potential for quite a subtle (and thus hard-to-notice) error here.

@RobinL
Copy link
Member

RobinL commented Nov 26, 2024

Interesting - the weakref idea is definitely an improvement, and feels like the correct/principled solution for the object case, it's a shame about the dict and str issue

I agree that it's worth solving this 'properly'

Could we simply set our own uuid __settings_id on the settings object, check for existence and use that as the cache key if it exists?

  • This feels like it works for SettingsCreator, Settings, dict
  • If the input is a path/str, could use the path/str as the settings id?

But, I appreciate it's a bit hacky!

The loading directly from a path scenario is interesting because it's potentially a common production use case. The simplicity of 'if settings is a string, use it as a key' seems potentially nice - easy to understand and high performance.

(apologies for the mistake in the first place - i had assumed that because the id is a big number, the chance of collision was negligible, but I can see now why that's not the case)

Or maybe it's just:

  • if dict or string, use hash
  • if obj, use weakref.

That doesn't feel too bad...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants