-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
json to parquet supported datatypes #1
Comments
I solved that the csv to parquet conversion program by allowing explicit schema to be able to override the types, doing the same thing to the json programs should be very straightforward if you look at https://github.com/HariSekhon/pytools/blob/master/spark_csv_to_parquet.py to see how it's done. I'll update the json programs with the same capability when I get time. |
thanks @HariSekhon for the quick reply I hope you will update the docker image soon :) |
I was thinking about implementing this right now and one of the challenges I think is how to pass the schema types to the program in a generic way on the command line, either:
Btw I believe the link you pasted is for writing out to JSON format and converting a field which is already a date in a dataframe to a specific string format when writing a JSON file out, which is the inverse of what we're talking about in this case. However I appreciate the same principal could apply for inferring a date field from a specific string found in JSON data, we should incorporate that in to one of the two solutions above (or other if we come up with another good generic method of solving this). Ultimately the solution should be generic enough to not require rewriting code to make it work for everybody as that's the most stable thing to do and I can add suitable tests on that. Thanks Hari |
@HariSekhon Anyway, to address your suggestion, i think a file is more convenient. Lior |
I am trying to convert json files to parquet, in order to load them to aws S3 and query with athena
However some binary fields ( for example timestamp in different formats ) are automatically converted to strings ( or "BYTE_ARRAY"S )
How can I control which field in the json is converted to which type in parquet?
Can i just use a specific format or is it simply not supported
thanks
The text was updated successfully, but these errors were encountered: