Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: disallow empty structs #9310

Closed
wants to merge 1 commit into from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Jun 4, 2024

working towards #8289

I'm not sure how useful empty structs are, since it seems like
only bigquery, dask, and pandas actually support them.
But still, if you stay in ibis-land, perhaps it is useful.
ie doing datatype kung-fu or only using them for intermediate calculations.
Not that hard for us to support it, so why not.

I'm not sure of the history of the specific disallowment
that I am removing from the type inference.

Relevant context:

@NickCrews NickCrews marked this pull request as draft June 4, 2024 18:47
@NickCrews NickCrews force-pushed the empty-struct branch 2 times, most recently from 51e35ce to c2da371 Compare June 4, 2024 19:30
@NickCrews NickCrews marked this pull request as ready for review June 4, 2024 20:29
@NickCrews
Copy link
Contributor Author

I don't think this is supported in many of our backends. How is this spelled in DuckDB for example?

Indeed, it is only supported on bigquery, dask, and pandas currently. Possibly could be implemented in other backends with some effort. But still, if you stay in ibis-land, perhaps it is useful. ie doing datatype kung-fu or only using them for intermediate calculations. Not that hard for us to support it, so why not.

Based off your comment in this other PR, I was thinking that your desired behavior was to allow empty structs, so I was working towards that.

But fine switching these tests to test for disallowment instead, if that is the desired behavior. but I just want to nail down the semantics here before I go back to #8666

@NickCrews
Copy link
Contributor Author

@jcrist @kszucs @gforsyth curious what yall think about the costs/benefits of allowing/disallowing empty structs

@cpcloud
Copy link
Member

cpcloud commented Jun 7, 2024

Not that hard for us to support it, so why not.

Well, famous last words :)

I was thinking that your desired behavior was to allow empty structs

I don't think I've seen a strong justification for their support in the form of a concrete use case, so I'd be in favor of dropping support for empty struct.

If we go down that route, we should also drop support for empty schemas for consistency, which also do not have any documented concrete user-facing use case.

@kszucs
Copy link
Member

kszucs commented Jun 10, 2024

I'm in favor of allowing both empty structs and empty schemas, this is the pyarrow behaviour.

@cpcloud
Copy link
Member

cpcloud commented Jun 10, 2024

Someone really needs to provide a concrete use case for empty structs and schemas to justify keeping it around, given that effectively one backend supports it.

"PyArrow supports it" is not a strong enough reason to do something that isn't well supported across our backends at all.

@cpcloud cpcloud added this to the 9.2 milestone Jun 13, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 29, 2024
See ibis-project#9310 for a discussion of whether
empty structs should be allowed or disallowed.
We settled on disallowing them for now,
since pyarrow is like the only backend to support them,
and it is definitely easier to change our mind and allow them later rather than
disallow them later.

Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you
only use them for intermediate calculations?
But until we see this concrete use case, don't worry about it.
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 29, 2024
See ibis-project#9310 for a discussion of whether
empty structs should be allowed or disallowed.
We settled on disallowing them for now,
since pyarrow is like the only backend to support them,
and it is definitely easier to change our mind and allow them later rather than
disallow them later.

Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you
only use them for intermediate calculations?
But until we see this concrete use case, don't worry about it.
@NickCrews NickCrews changed the title feat: allow empty structs test: disallow empty structs Jun 29, 2024
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 29, 2024
See ibis-project#9310 for a discussion of whether
empty structs should be allowed or disallowed.
We settled on disallowing them for now,
since pyarrow is like the only backend to support them,
and it is definitely easier to change our mind and allow them later rather than
disallow them later.

Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you
only use them for intermediate calculations?
But until we see this concrete use case, don't worry about it.

I had to adjust the dask testcase generation because this change revealed a bug in our existing dask tests. `ibis.backends.dask.Backend.get_schema()` uses a method of `PandasData.infer_table(df.head(1))`, so we were infering the type of the `map_of_*` testing column as
`struct<>` since the first element in that test array of values was `{}`.
So the simple solution was to just put the "representative" value first. This doesn't really
solve the root problem in how dask does type inference, we probably want to improve this.
I'd guess we should infer the column type from a larger sample of values. 1 seems really absurdly small. I tried doing this but it was trickier than I would have thought,
with a head(10) I was only getting two values back, even thought there should be 3??
IDK that should get solved in a different PR.
NickCrews added a commit to NickCrews/ibis that referenced this pull request Jun 29, 2024
See ibis-project#9310 for a discussion of whether
empty structs should be allowed or disallowed.
We settled on disallowing them for now,
since pyarrow is like the only backend to support them,
and it is definitely easier to change our mind and allow them later rather than
disallow them later.

Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you
only use them for intermediate calculations?
But until we see this concrete use case, don't worry about it.

I had to adjust the dask testcase generation because this change revealed a bug in our existing dask tests. `ibis.backends.dask.Backend.get_schema()` uses a method of `PandasData.infer_table(df.head(1))`, so we were infering the type of the `map_of_*` testing column as
`struct<>` since the first element in that test array of values was `{}`.
So the simple solution was to just put the "representative" value first. This doesn't really
solve the root problem in how dask does type inference, we probably want to improve this.
I'd guess we should infer the column type from a larger sample of values. 1 seems really absurdly small. I tried doing this but it was trickier than I would have thought,
with a head(10) I was only getting two values back, even thought there should be 3??
IDK that should get solved in a different PR.
@NickCrews
Copy link
Contributor Author

NickCrews commented Jun 29, 2024

hmm, our pyarrow type roundtrip fuzzing tests are failing. Can I get some help with hypothesis to limit it to only give us fields of length 1+? I don't understand hypothesis.

See ibis-project#9310 for a discussion of whether
empty structs should be allowed or disallowed.
We settled on disallowing them for now,
since pyarrow is like the only backend to support them,
and it is definitely easier to change our mind and allow them later rather than
disallow them later.

Perhaps if you stay in ibis-land, emptry structs are useful ie for doing type manipulations, or maybe you
only use them for intermediate calculations?
But until we see this concrete use case, don't worry about it.

I had to adjust the dask testcase generation because this change revealed a bug in our existing dask tests. `ibis.backends.dask.Backend.get_schema()` uses a method of `PandasData.infer_table(df.head(1))`, so we were infering the type of the `map_of_*` testing column as
`struct<>` since the first element in that test array of values was `{}`.
So the simple solution was to just put the "representative" value first. This doesn't really
solve the root problem in how dask does type inference, we probably want to improve this.
I'd guess we should infer the column type from a larger sample of values. 1 seems really absurdly small. I tried doing this but it was trickier than I would have thought,
with a head(10) I was only getting two values back, even thought there should be 3??
IDK that should get solved in a different PR.
@cpcloud
Copy link
Member

cpcloud commented Jun 30, 2024

@kszucs Can you comment on why empty structs are useful, independent of their existence in PyArrow? I still don't follow what their utility is from the perspective of an end user.

@NickCrews
Copy link
Contributor Author

Just to be clear, I swapped this PR to now be checking for disallowment

@cpcloud
Copy link
Member

cpcloud commented Jun 30, 2024

Yep, jut wanting to keep discussion all in one place!

@kszucs
Copy link
Member

kszucs commented Jul 1, 2024

@kszucs Can you comment on why empty structs are useful, independent of their existence in PyArrow? I still don't follow what their utility is from the perspective of an end user.

Struct values might be the outcome of other operations like converting a JSON typed column to a strictly typed struct typed column which would be fallible if empty struct typed are not supported. User defined functions may produce empty struct types as well.

@cpcloud cpcloud removed this from the 9.2 milestone Jul 15, 2024
@cpcloud
Copy link
Member

cpcloud commented Jul 16, 2024

Struct values might be the outcome of other operations like converting a JSON typed column to a strictly typed struct typed column which would be fallible if empty struct typed are not supported. User defined functions may produce empty struct types as well.

None of these operations can actually be executed on anything other than BigQuery, so being able to spell them in the type system is entirely moot.

@jcrist
Copy link
Member

jcrist commented Aug 12, 2024

What's the status of this PR? Does it supersede the work in #8876?

@cpcloud
Copy link
Member

cpcloud commented Sep 30, 2024

What's the status of this PR? Does it supersede the work in #8876?

I don't think so. This PR contains tests and a bit of implementation while 8876 contains more implementation work.

@cpcloud
Copy link
Member

cpcloud commented Sep 30, 2024

That said, I'm going to close both of them out in an effort to set expectations on the amount of maintenance I can take on right now.

@cpcloud cpcloud closed this Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants