Fix .json.gz files #1

phucdev · 2024-06-24T14:46:01Z

Fix .json.gz files

The problem with the .json.gz files

The original data .json.gz files were not formatted correctly (v1_20180119 and in v3_20200302). It's neither a JSON nor a JSON Lines format.
See the following simplified examples:

JSON

[
    {
        "text": "lorem ipsum", 
        "label": "O"
    },
    {
        "text": "lorem ipsum", 
        "label": "O"
    }
]

JSON Lines

{"text": "lorem ipsum", "label": "O"}
{"text": "lorem ipsum", "label": "O"}

The format in the SmartData .json.gz files:

{
    "text": "lorem ipsum", 
    "label": "O"
}
{
    "text": "lorem ipsum", 
    "label": "O"
}

Neither loading the whole file content at once as one JSON Dict: nor loading each line as a one JSON Dict works with this format. E.g.

import gzip
import json

with gzip.open("v3_20200302/train.json.gz") as f:
    docs = json.load(f)

or

import gzip
import json

with gzip.open("v3_20200302/train.json.gz") as f:
    docs = []
    for line in f:
        docs.append(json.loads(line))

Both result in a json.decoder.JSONDecodeError.

What's new?

In this Pull Request I read the .avro files using the fastavro library and re-exported them in the JSON Lines format.
As a side effect these newly exported data files do not contain extra keys for string, array, etc. like the original JSON export that was done with the old Scala/Java codebase.
While the data files in v2_20190802 were formatted correctly in the JSON Lines format, I also re-exported those files to get rid of the extra keys for string, array, etc.

I also added the conversion script.

- The original .json.gz format was not valid JSON and also contained extra "array" keys for arrays etc.

phucdev added 2 commits June 24, 2024 16:22

Re-exported the avro files to .jsonl.gz format via fastavro

fb21e8e

- The original .json.gz format was not valid JSON and also contained extra "array" keys for arrays etc.

Add convert_avro2jsonl.py script

3f85833

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix .json.gz files #1

Fix .json.gz files #1

phucdev commented Jun 24, 2024

Fix .json.gz files #1

Are you sure you want to change the base?

Fix .json.gz files #1

Conversation

phucdev commented Jun 24, 2024

Fix .json.gz files

The problem with the .json.gz files

What's new?