Replies: 1 comment 2 replies
-
It's been a long time since I used HDF databases. When you mean performance degradation, do you mean compared to reading from a CSV file? Or do you mean the database creating is just slow? I think if you lower the compression, it might potentially speed up a bit |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I'm trying to convert several TSV files from c4 200M dataset into HDF5 format and I based my conversion on your notebook.
The dataset is composed by 10 files, each file containing approximately 18 million records, with 2 string columns.
Given the size of the dataset, I thought that converting it to HDF5 format would give a significant benefit and would allow me to know the shape of each file and give significant performance boost in the read of chunks of the dataset.
In a first trial I converted 1 million records in about 3 minutes, however when I tried to convert all 18 million records it is taking more than 6 hours per file.
I am currently loading my tsv in the following way
where i set num_lines equal to the total lines of each file, where chunksize = 10000.
I did not expect this performance degradation, have you ever tried to use your code to convert dataset of a similar dimension?
Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions