Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: merging selections combines filters in incorrect way #9058

Closed
1 task done
NickCrews opened this issue Apr 25, 2024 · 17 comments · Fixed by #9065
Closed
1 task done

bug: merging selections combines filters in incorrect way #9058

NickCrews opened this issue Apr 25, 2024 · 17 comments · Fixed by #9065
Labels
bug Incorrect behavior inside of ibis
Milestone

Comments

@NickCrews
Copy link
Contributor

What happened?

I do a filter-mutate-filter, and the order needs to stay in that order to be correct, but ibis appears to jumble the order of these steps:

import ibis
from ibis import _

ibis.options.interactive = True

t = ibis.read_parquet(
    "https://github.com/NickCrews/election-data/releases/download/v2024-04-05_0/cleaned.parquet"
)
t = t.cache()
t = t.filter(_.votes != "*").mutate(votes=_.votes.cast("int64")).filter(_.votes > 0)
# occasionally, but not always, I get
# ConversionException: Conversion Error: Could not convert string '*' to INT64
t.votes.mean()

This gives the sql

SELECT
  AVG("t1"."votes") AS "Mean(votes)"
FROM (
  SELECT
    "t0"."year",
    "t0"."date",
    "t0"."state_po",
    "t0"."county_name",
    "t0"."county_fips",
    "t0"."jurisdiction_name",
    "t0"."jurisdiction_fips",
    "t0"."district",
    "t0"."office",
    "t0"."magnitude",
    "t0"."special",
    "t0"."stage",
    "t0"."precinct",
    "t0"."writein",
    "t0"."candidate",
    "t0"."party_detailed",
    "t0"."mode",
    CAST("t0"."votes" AS BIGINT) AS "votes",
    "t0"."readme_check",
    "t0"."county_fips2",
    "t0"."jurisdiction_fips2"
  FROM "ibis_cache_hys22fcdbnbmpdoasprivdb3pm" AS "t0"
  WHERE
    "t0"."votes" <> '*' AND CAST("t0"."votes" AS BIGINT) > CAST(0 AS TINYINT)
) AS "t1"

I think we need to keep these as two separate relations. I think this is somewhat related to #9014 (comment)

What version of ibis are you using?

main

What backend(s) are you using, if any?

duckdb

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Apr 25, 2024
@gforsyth
Copy link
Member

Thanks for the report @NickCrews -- we'll probably disable merging filters across selects to avoid this happening. In the interim (and we won't release 9.0 until we sort this out), you can probably use try_cast to handle this without the failure mode.

@kszucs
Copy link
Member

kszucs commented Apr 26, 2024

The produced expression then the sql specific Select operations looks correct to me. I understand that this data-dependent case works if the filters are not fused, but ibis expressions shouldn't have assumptions about the underlying data.

The minimal problem is the following:

t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0)

gets reduced to

t.filter(_.votes != "*", _.votes.cast("int64") > 0)

and the latter fails due to the fallible cast operation.

By not fusing the operations these two will behave differently depending on the data which might be considered a downside as well. This is not something we can make foolproof since the user may use the compound predicate directly t.filter(_.votes != "*", _.votes.cast("int64") > 0).

The predicate which should be really used for this case is:

t.filter((_.votes != "*").ifelse(_.votes.cast("int64") > 0), False)

Or rather a new API allowing to chain ands more conveniently:

t.filter((_.votes != "*").and_then(_.votes.cast("int64") > 0)

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2024

I understand that this data-dependent case works if the filters are not fused, but ibis expressions shouldn't have assumptions about the underlying data.

Removing filters for this case removes assumptions, it doesn't add any. Are you saying something else?

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2024

By not fusing the operations these two will behave differently depending on the data which might be considered a downside as well.

This can't be addressed statically (which means definitely not in any tool that generates code without knowing what the data are), since it's data dependent. Not fusing doesn't make that particular issue worse or better, but fusing definitely makes it worse.

@kszucs
Copy link
Member

kszucs commented Apr 26, 2024

but fusing definitely makes it worse

I kinda disagree with this, I think fusing makes it just as bad (not better not worse [semantically]).
Additionally there are clear workarounds to either use .ifelse() or .try_cast().

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2024

The workarounds aren't that clear IMO. It's not like we'd recognize the pattern and suggest an alternative. However, fusing makes the solution of "spell things out bit by bit" impossible to write.

@jcrist
Copy link
Member

jcrist commented Apr 26, 2024

and the latter fails due to the fallible cast operation.

Correct, as it should. I don't think we need a new method here. An ibis .filter call should semantically appear to the user to run as a subquery with a separate WHERE clause. The backend is free to fuse steps as needed, but we need to ensure the semantically the results are the same.

So

t.filter(_.x != "*").filter(_.x.cast("int64") > 0)

should compile to something semantically identical to (all sql typos mine):

SELECT * FROM (
  SELECT * FROM t WHERE x <> '*'
) WHERE CAST(x AS int64) > 0

Whereas

t.filter(_.x != "*", _.x.cast("int64") > 0)

should be semantically identical to:

SELECT * FROM t WHERE x <> '*" AND CAST(x as int64) > 0

These aren't semantically identical queries, and the latter may not work appropriately in this case since it's gating the CAST operation on an AND in the same filter (and backends provide no guarantee on AND shortcircuiting).

We can fuse if we want to produce prettier sql, with the caveat that we don't want to change the semantics. In the case of an operation like CAST which can error we can't fuse filters like this. Enumerating the cases where we can fuse properly seems tricky to do and probably not something we should introduce in 9.0 given all the recent bugs. I agree with @cpcloud here - we should just disable fusing to avoid this issue.

@kszucs
Copy link
Member

kszucs commented Apr 26, 2024

I would prefer catching that my query is fallible early.

Imagine that the nested query successfully executes then a new record gets inserted into the data source (not a parquet file, but a live database table) with _.x = "uncastable", the ibis query will fail the same way just later.

@kszucs
Copy link
Member

kszucs commented Apr 26, 2024

Just checked and this has been the behaviour for earlier ibis versions as well:

In [16]: t = ibis.read_parquet(
    ...:     "https://github.com/NickCrews/election-data/releases/download/v2024-04-05_0/cleaned.parquet"
    ...: )
    ...: t = t.cache()
100% ▕████████████████████████████████████████████████████████████▏

In [17]: t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0)
Out[17]:
r0 := DatabaseTable: ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu
  year               int16
  date               date
  state_po           string
  county_name        string
  county_fips        string
  jurisdiction_name  string
  jurisdiction_fips  string
  district           string
  office             string
  magnitude          int64
  special            boolean
  stage              string
  precinct           string
  writein            boolean
  candidate          string
  party_detailed     string
  mode               string
  votes              string
  readme_check       boolean
  county_fips2       string
  jurisdiction_fips2 string

Selection[r0]
  predicates:
    r0.votes != '*'
    Cast(r0.votes, to=int64) > 0

In [19]: print(t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0).compile())
SELECT t0.year, t0.date, t0.state_po, t0.county_name, t0.county_fips, t0.jurisdiction_name, t0.jurisdiction_fips, t0.district, t0.office, t0.magnitude, t0.special, t0.stage, t0.precinct, t0.writein, t0.candidate, t0.party_detailed, t0.mode, t0.votes, t0.readme_check, t0.county_fips2, t0.jurisdiction_fips2
FROM ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu AS t0
WHERE t0.votes != :param_1 AND CAST(t0.votes AS BIGINT) > CAST(:param_2 AS INTEGER)

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2024

I don't think that matters too much, given that this is incorrect behavior given what the user has written.

I'm still not understanding what the rationale is for keeping the merging given all the issues we have encountered so far.

@kszucs
Copy link
Member

kszucs commented Apr 27, 2024

  1. I still think that merging is correct. For example polars does the same thing (and I assume other systems as well):
In [1]: import polars as pl
   ...:
   ...: lf = pl.LazyFrame(
   ...:     {
   ...:         "foo": [1, 2, 3],
   ...:         "bar": [6, 7, 8],
   ...:         "ham": ["a", "b", "c"],
   ...:     }
   ...: )
   ...: hamint = pl.col("ham").cast(pl.Int64)
   ...: expr = lf.filter(pl.col("foo") > 1).filter(hamint > 2)
   ...:
   ...: print(expr.explain())
DF ["foo", "bar", "ham"]; PROJECT */3 COLUMNS; SELECTION: "[([(col(\"ham\").strict_cast(Int64)) > (2)]) & ([(col(\"foo\")) > (1)])]"
  1. Even if we consider it as an incorrect functionality, it is not a regression.
  2. I tried disabling the merge in refactor(sql): make select merging opt-in #9066 and there are new issues arising from it, especially with mssql, so we need to implement workaround rewrites for those cases.

@NickCrews
Copy link
Contributor Author

_.x = "uncastable"

I'm not sure I understand, can you give a more concrete example?

@NickCrews
Copy link
Contributor Author

Consider

def load():
    ...
    t = t.distinct()
    t = t.filter(t.votes != "*")
    return t

# many lines

t = load()
t = t.filter(t.votes.cast(int) > 0)
  1. If you look at each part of this program locally, it appears correct. It only breaks if these two distant parts of the program interact in an unlucky way.
  2. If you switched the order of the .distinct() to be the last statement in load(), it would succeed, even though semantically that yields the same data out of load(). Correctness hinging upon an implementation detail like this smells bad.

@kszucs
Copy link
Member

kszucs commented Apr 27, 2024

If you switched the order of the .distinct() to be the last statement in load(), it would succeed

Wouldn't it only succeed if the votes column contains valid integer strings or "*"? What I am trying to say that it is entirely data dependent and t.votes.cast(int) is fallible hence the query may or may not fail.

Do you have an example where the filters are not fallible end the merge alters the behaviour?

@NickCrews
Copy link
Contributor Author

Wouldn't it only succeed if the votes column contains valid integer strings or "*"?

Yes, that is the scenario I am considering here, the data only contains valid ints or "*". In that scenario:

  • t.filter(_.votes != "*").distinct().filter(_.votes.cast(int) > 0): This SHOULD never error, and in the current implementation, it never errors.
  • t.distinct().filter(_.votes != "*").filter(_.votes.cast(int) > 0): again, this SHOULD never error (per my above argument of "things that look locally correct should be globally correct" argument, and phillip's "spell things out bit by bit" argument), but in the current implementation, it sometimes errors.

@kszucs
Copy link
Member

kszucs commented Apr 29, 2024

things that look locally correct should be globally correct

The second part of your example looks fallible to me, so no surprise that I am getting a cast error since I cannot have any knowledge about what load does but I do know that t.votes is a string column.

t = load()
t = t.filter(t.votes.cast(int) > 0)

Apparently this only appears to be an issue for duckdb. At least I am unable to reproduce the problem on other backends which suggest that our .cast() behaviour if fairly inconsistent.

I would still like to have an example where the problem occurs without calling .cast().

@cpcloud
Copy link
Member

cpcloud commented Apr 29, 2024

Reposting my comment from #9065 (comment)

I took the weekend to think about some of the issues here and did a bit of research on the SQL standard specification of predicate short-circuiting.

  • According to this SO Q&A the SQL standard does not specify the evaluation order of predicates. That means that an engine is free to reorder (or not) predicates in any way AFAICT, including if it would change the semantics of the query.

For example, an engine can evaluate a fallible expression first, even if a subsequent expression with short circuiting would avoid the fallible expression evaluation failing.

In practice, DuckDB merges a fallible cast with an infallible operation in the same way producing the exact same plan regardless of how the SQL is specified:

D explain select * from (select * from t where cast(x as int) != 0) where x <> '*';

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ ((x != '*') AND (CAST(x AS│
│       INTEGER) != 0))     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             x             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 5           │
└───────────────────────────┘
D explain select * from (select * from t where x <> '*') where cast(x as int) != 0;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ ((x != '*') AND (CAST(x AS│
│       INTEGER) != 0))     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             x             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 5           │
└───────────────────────────┘

This means that even if Ibis decided to generate subqueries that map one-to-one to SQL selects it provides absolutely zero guarantee that generating the SQL in that way produces specific behavior.

IMO this means that Ibis should be free to merge predicates, assuming our implementation of that process is correct.

Now, I'm not trying to gloss over the fact that we have had a number of issues with select merging, and so I think we should provide a way to disable it if it can potentially help unblock users.

I think the opt-in behavior would be kind of pointless, since people will never opt-in without lots of documentation and if we're going to go that route we should disable it.

On the other hand, if we think this is long-term a thing we want enabled, then we definitely need feedback on it from users. The only way to get that feedback AFAICT is to enable this feature.

That does mean we have to accept that the current implementation may not be 100% finished and will elicit bug reports.

The opt-out behavior IMO seems like an acceptable balance here: people can disable select merging on an as-needed basis, and we can continuously evaluate whether keeping it around is causing overhead that isn't worth it.

Disabling select merging it turns out is also not without issue: SQLite fails due to a query we generate being too large for the parser stack, and MS SQL loses the ability to specify order by in many cases as well as LIKE functionality in certain positions.

cpcloud pushed a commit that referenced this issue Apr 30, 2024
Add a global option to enable or disable merging select statements where
possible. This can serve as an escape route if the optimization fails.
It is currently enabled by default, we may change to make it opt-in.

xref #9064 and #9058

Closes #9058.
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants