-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: disallow empty structs #9310
Conversation
51e35ce
to
c2da371
Compare
Indeed, it is only supported on bigquery, dask, and pandas currently. Possibly could be implemented in other backends with some effort. But still, if you stay in ibis-land, perhaps it is useful. ie doing datatype kung-fu or only using them for intermediate calculations. Not that hard for us to support it, so why not. Based off your comment in this other PR, I was thinking that your desired behavior was to allow empty structs, so I was working towards that. But fine switching these tests to test for disallowment instead, if that is the desired behavior. but I just want to nail down the semantics here before I go back to #8666 |
Well, famous last words :)
I don't think I've seen a strong justification for their support in the form of a concrete use case, so I'd be in favor of dropping support for empty struct. If we go down that route, we should also drop support for empty schemas for consistency, which also do not have any documented concrete user-facing use case. |
I'm in favor of allowing both empty structs and empty schemas, this is the pyarrow behaviour. |
Someone really needs to provide a concrete use case for empty structs and schemas to justify keeping it around, given that effectively one backend supports it. "PyArrow supports it" is not a strong enough reason to do something that isn't well supported across our backends at all. |
See ibis-project#9310 for a discussion of whether empty structs should be allowed or disallowed. We settled on disallowing them for now, since pyarrow is like the only backend to support them, and it is definitely easier to change our mind and allow them later rather than disallow them later. Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you only use them for intermediate calculations? But until we see this concrete use case, don't worry about it.
See ibis-project#9310 for a discussion of whether empty structs should be allowed or disallowed. We settled on disallowing them for now, since pyarrow is like the only backend to support them, and it is definitely easier to change our mind and allow them later rather than disallow them later. Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you only use them for intermediate calculations? But until we see this concrete use case, don't worry about it.
See ibis-project#9310 for a discussion of whether empty structs should be allowed or disallowed. We settled on disallowing them for now, since pyarrow is like the only backend to support them, and it is definitely easier to change our mind and allow them later rather than disallow them later. Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you only use them for intermediate calculations? But until we see this concrete use case, don't worry about it. I had to adjust the dask testcase generation because this change revealed a bug in our existing dask tests. `ibis.backends.dask.Backend.get_schema()` uses a method of `PandasData.infer_table(df.head(1))`, so we were infering the type of the `map_of_*` testing column as `struct<>` since the first element in that test array of values was `{}`. So the simple solution was to just put the "representative" value first. This doesn't really solve the root problem in how dask does type inference, we probably want to improve this. I'd guess we should infer the column type from a larger sample of values. 1 seems really absurdly small. I tried doing this but it was trickier than I would have thought, with a head(10) I was only getting two values back, even thought there should be 3?? IDK that should get solved in a different PR.
See ibis-project#9310 for a discussion of whether empty structs should be allowed or disallowed. We settled on disallowing them for now, since pyarrow is like the only backend to support them, and it is definitely easier to change our mind and allow them later rather than disallow them later. Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you only use them for intermediate calculations? But until we see this concrete use case, don't worry about it. I had to adjust the dask testcase generation because this change revealed a bug in our existing dask tests. `ibis.backends.dask.Backend.get_schema()` uses a method of `PandasData.infer_table(df.head(1))`, so we were infering the type of the `map_of_*` testing column as `struct<>` since the first element in that test array of values was `{}`. So the simple solution was to just put the "representative" value first. This doesn't really solve the root problem in how dask does type inference, we probably want to improve this. I'd guess we should infer the column type from a larger sample of values. 1 seems really absurdly small. I tried doing this but it was trickier than I would have thought, with a head(10) I was only getting two values back, even thought there should be 3?? IDK that should get solved in a different PR.
hmm, our pyarrow type roundtrip fuzzing tests are failing. Can I get some help with hypothesis to limit it to only give us fields of length 1+? I don't understand hypothesis. |
See ibis-project#9310 for a discussion of whether empty structs should be allowed or disallowed. We settled on disallowing them for now, since pyarrow is like the only backend to support them, and it is definitely easier to change our mind and allow them later rather than disallow them later. Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you only use them for intermediate calculations? But until we see this concrete use case, don't worry about it. I had to adjust the dask testcase generation because this change revealed a bug in our existing dask tests. `ibis.backends.dask.Backend.get_schema()` uses a method of `PandasData.infer_table(df.head(1))`, so we were infering the type of the `map_of_*` testing column as `struct<>` since the first element in that test array of values was `{}`. So the simple solution was to just put the "representative" value first. This doesn't really solve the root problem in how dask does type inference, we probably want to improve this. I'd guess we should infer the column type from a larger sample of values. 1 seems really absurdly small. I tried doing this but it was trickier than I would have thought, with a head(10) I was only getting two values back, even thought there should be 3?? IDK that should get solved in a different PR.
@kszucs Can you comment on why empty structs are useful, independent of their existence in PyArrow? I still don't follow what their utility is from the perspective of an end user. |
Just to be clear, I swapped this PR to now be checking for disallowment |
Yep, jut wanting to keep discussion all in one place! |
Struct values might be the outcome of other operations like converting a JSON typed column to a strictly typed struct typed column which would be fallible if empty struct typed are not supported. User defined functions may produce empty struct types as well. |
None of these operations can actually be executed on anything other than BigQuery, so being able to spell them in the type system is entirely moot. |
What's the status of this PR? Does it supersede the work in #8876? |
I don't think so. This PR contains tests and a bit of implementation while 8876 contains more implementation work. |
That said, I'm going to close both of them out in an effort to set expectations on the amount of maintenance I can take on right now. |
working towards #8289
I'm not sure how useful empty structs are, since it seems like
only bigquery, dask, and pandas actually support them.
But still, if you stay in ibis-land, perhaps it is useful.
ie doing datatype kung-fu or only using them for intermediate calculations.
Not that hard for us to support it, so why not.
I'm not sure of the history of the specific disallowment
that I am removing from the type inference.
Relevant context: