Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check mapping logic for glue data type to arrow data type #4

Open
matthewmturner opened this issue Jun 4, 2022 · 19 comments
Open

Check mapping logic for glue data type to arrow data type #4

matthewmturner opened this issue Jun 4, 2022 · 19 comments

Comments

@matthewmturner
Copy link
Collaborator

I tried using this with some of my data and got errors when mapping to arrow data types.

Specifically for glue type date and then also some nested data i have where there are struct and arrays.

It looks like you have logic for most / all arrow data types - were you able to test on real data and real catalog? Anything I can provide to help?

@matthewmturner matthewmturner changed the title Add more mappings from glue data type to arrow data type Check mapping logic for glue data type to arrow data type Jun 4, 2022
@timvw
Copy link
Member

timvw commented Jun 6, 2022

I tested this on a real catalog with real data as well.
Don't think any of the tables contained "complex" (arrays and/or nested structs)

Some glue schema + sample data to reproduce the date issue would be helpful.

Will try to add some additional methods that allow the user to decide/specify what the schema strategy should be (map from glue schema or infer or ..)

@timvw
Copy link
Member

timvw commented Jun 8, 2022

It took a while to remember, but the biggest pain point is indeed in the parsing of nested structs and maps (as that requires more than the current naive regex matching).

@timvw
Copy link
Member

timvw commented Jun 8, 2022

Changing the implementation to use Pest-parser instead of naive regular expressions.. Seems to work better..

PR (might need some cleanup) -> #5.

Released v0.1.2 on crates.io -> https://crates.io/crates/datafusion-catalogprovider-glue

@matthewmturner
Copy link
Collaborator Author

thanks! will try it out and report back.

@matthewmturner
Copy link
Collaborator Author

got this error when trying with the v0.1.2

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { variant: ParsingError { positives: [Int, Boolean, BigInt, Float, Double, Binary, Timestamp, String, Decimal, ArrayType, MapType, StructType], negatives: [] }, location: Pos(0), line_col: Pos((1, 1)), path: None, line: "date", continued_line: None }', /Users/matth/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-catalogprovider-glue-0.1.2/src/catalog_provider/glue.rs:234:83

@timvw
Copy link
Member

timvw commented Jun 9, 2022

Seems that I changed the behavior (panic while trying to register a table instead of returning an Err).
Will fix that.

Also see that you have a column of datatype "date" which is currently not supported.
Will add support for that as well.

@matthewmturner
Copy link
Collaborator Author

also how does the table registration work? for example lets say i have 2 tables in my catalog. 1 you are able to parse correctly and the other you cant. should the one that is parsed correctly get registered? or if one fails to register will they all fail to register?

@timvw
Copy link
Member

timvw commented Jun 9, 2022

Failure of one registration will not block/affect registration of another one.
For each table that is (tried to be) registered, you get back a result:

  • Ok (when it's registered),
  • Err (when it's not).

@timvw
Copy link
Member

timvw commented Jun 9, 2022

Released v0.1.4 which:

  • contains support for additional datatypes (including date)
  • corrected fieldnames for arrays (should be element) and maps (should be key_value)
  • documented known mapping issues in README page
    • Not able to infer whether a column is nullable or not
    • Not able to infer whether a column is signed or unsigned xInt
  • (in v1.0.3) replace a lot of .unwrap() with ok_or_else to leverage Result flow instead of panicking

@matthewmturner
Copy link
Collaborator Author

thanks! making progress. was able to register a bunch of tables. having some other issues now.

i tried querying one of the tables that was registered and got:

read 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DataFusionError(ExternalError(Execution("Failed to map column projection for field listing_activities. Incompatiable data types List(Field { name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }) and List(Field { name: \"element\" data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None })"))))

Also it looks like non parquet format may not be enabled as none of my json tables were registered because of unsupported format. i can create separate issue for that though.

@timvw
Copy link
Member

timvw commented Jun 9, 2022

I also have a table where I run into the "Failed to map column projection where datafusion names the field "item" and the code names it "element". The strange/interesting issue is that for other tables they are named "element".

Currently there is only support for Parquet, Avro and CSV (JSON is not as high on my wishlist as Delta, but in order to support Delta I'm waiting for datafusion 9 to be released).

@matthewmturner
Copy link
Collaborator Author

Ok - i might be able to work on json. ill see if i have time this weekend. thats an important one for me.

@timvw
Copy link
Member

timvw commented Jun 9, 2022

Here is a headstart ;)

https://github.com/datafusion-contrib/datafusion-catalogprovider-glue/tree/support-json

It fails on the only json table I have at hand for testing though.. (Probably need to do something with SerDeInfo "parameters": {"paths": "array" } metadata in Glue)

@matthewmturner
Copy link
Collaborator Author

Thanks!

I will check it out and let you know if any questions / comments.

@timvw
Copy link
Member

timvw commented Jun 10, 2022

Closing this now that v0.1.6 is released

Currently it is possible to decide on the strategy to use to determine the table schema

1/ Derive from Glue

  • fast
  • not always correct (eg: impossible to infer nullability of a column or signedness of a number)

2/ Infer from data

  • slow
  • will match whatever datafusion believes that data looks like

@timvw timvw closed this as completed Jun 10, 2022
@matthewmturner
Copy link
Collaborator Author

inference is working well but i still end up with that item vs element issue. You think updating from element to item would break something else?

@matthewmturner
Copy link
Collaborator Author

when i just create the table myself using CREATE EXTERNAL TABLE ... my list array has name item so i would have thought that was the right one.

@timvw
Copy link
Member

timvw commented Jun 10, 2022

Looking at the arrow-datafusion code I only find entries for "item" and none for "element" so I will rename that to "item" again...

@AlJohri
Copy link

AlJohri commented Oct 30, 2022

@timvw in arrow-datafusion, I believe this pull request added the functionality to round trip the list element name: apache/datafusion#2893

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants