Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix .json.gz files
The problem with the .json.gz files
The original data .json.gz files were not formatted correctly (v1_20180119 and in v3_20200302). It's neither a JSON nor a JSON Lines format.
See the following simplified examples:
JSON
JSON Lines
The format in the SmartData .json.gz files:
Neither loading the whole file content at once as one JSON Dict: nor loading each line as a one JSON Dict works with this format. E.g.
or
Both result in a
json.decoder.JSONDecodeError
.What's new?
In this Pull Request I read the .avro files using the
fastavro
library and re-exported them in the JSON Lines format.As a side effect these newly exported data files do not contain extra keys for
string
,array
, etc. like the original JSON export that was done with the old Scala/Java codebase.While the data files in v2_20190802 were formatted correctly in the JSON Lines format, I also re-exported those files to get rid of the extra keys for
string
,array
, etc.I also added the conversion script.