-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check mapping logic for glue data type to arrow data type #4
Comments
I tested this on a real catalog with real data as well. Some glue schema + sample data to reproduce the date issue would be helpful. Will try to add some additional methods that allow the user to decide/specify what the schema strategy should be (map from glue schema or infer or ..) |
It took a while to remember, but the biggest pain point is indeed in the parsing of nested structs and maps (as that requires more than the current naive regex matching). |
Changing the implementation to use Pest-parser instead of naive regular expressions.. Seems to work better.. PR (might need some cleanup) -> #5. Released v0.1.2 on crates.io -> https://crates.io/crates/datafusion-catalogprovider-glue |
thanks! will try it out and report back. |
got this error when trying with the v0.1.2
|
Seems that I changed the behavior (panic while trying to register a table instead of returning an Err). Also see that you have a column of datatype "date" which is currently not supported. |
also how does the table registration work? for example lets say i have 2 tables in my catalog. 1 you are able to parse correctly and the other you cant. should the one that is parsed correctly get registered? or if one fails to register will they all fail to register? |
Failure of one registration will not block/affect registration of another one.
|
Released v0.1.4 which:
|
thanks! making progress. was able to register a bunch of tables. having some other issues now. i tried querying one of the tables that was registered and got:
Also it looks like non parquet format may not be enabled as none of my json tables were registered because of unsupported format. i can create separate issue for that though. |
I also have a table where I run into the "Failed to map column projection where datafusion names the field "item" and the code names it "element". The strange/interesting issue is that for other tables they are named "element". Currently there is only support for Parquet, Avro and CSV (JSON is not as high on my wishlist as Delta, but in order to support Delta I'm waiting for datafusion 9 to be released). |
Ok - i might be able to work on json. ill see if i have time this weekend. thats an important one for me. |
Here is a headstart ;) https://github.com/datafusion-contrib/datafusion-catalogprovider-glue/tree/support-json It fails on the only json table I have at hand for testing though.. (Probably need to do something with SerDeInfo "parameters": {"paths": "array" } metadata in Glue) |
Thanks! I will check it out and let you know if any questions / comments. |
Closing this now that v0.1.6 is released Currently it is possible to decide on the strategy to use to determine the table schema 1/ Derive from Glue
2/ Infer from data
|
inference is working well but i still end up with that item vs element issue. You think updating from element to item would break something else? |
when i just create the table myself using |
Looking at the arrow-datafusion code I only find entries for "item" and none for "element" so I will rename that to "item" again... |
@timvw in arrow-datafusion, I believe this pull request added the functionality to round trip the list element name: apache/datafusion#2893 |
I tried using this with some of my data and got errors when mapping to arrow data types.
Specifically for glue type date and then also some nested data i have where there are struct and arrays.
It looks like you have logic for most / all arrow data types - were you able to test on real data and real catalog? Anything I can provide to help?
The text was updated successfully, but these errors were encountered: