Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix .json.gz files #1

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Fix .json.gz files #1

wants to merge 2 commits into from

Conversation

phucdev
Copy link
Member

@phucdev phucdev commented Jun 24, 2024

Fix .json.gz files

The problem with the .json.gz files

The original data .json.gz files were not formatted correctly (v1_20180119 and in v3_20200302). It's neither a JSON nor a JSON Lines format.
See the following simplified examples:

JSON

[
    {
        "text": "lorem ipsum", 
        "label": "O"
    },
    {
        "text": "lorem ipsum", 
        "label": "O"
    }
]

JSON Lines

{"text": "lorem ipsum", "label": "O"}
{"text": "lorem ipsum", "label": "O"}

The format in the SmartData .json.gz files:

{
    "text": "lorem ipsum", 
    "label": "O"
}
{
    "text": "lorem ipsum", 
    "label": "O"
}

Neither loading the whole file content at once as one JSON Dict: nor loading each line as a one JSON Dict works with this format. E.g.

import gzip
import json

with gzip.open("v3_20200302/train.json.gz") as f:
    docs = json.load(f)

or

import gzip
import json

with gzip.open("v3_20200302/train.json.gz") as f:
    docs = []
    for line in f:
        docs.append(json.loads(line))

Both result in a json.decoder.JSONDecodeError.

What's new?

In this Pull Request I read the .avro files using the fastavro library and re-exported them in the JSON Lines format.
As a side effect these newly exported data files do not contain extra keys for string, array, etc. like the original JSON export that was done with the old Scala/Java codebase.
While the data files in v2_20190802 were formatted correctly in the JSON Lines format, I also re-exported those files to get rid of the extra keys for string, array, etc.

I also added the conversion script.

phucdev added 2 commits June 24, 2024 16:22
- The original .json.gz format was not valid JSON and also contained extra "array" keys for arrays etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant