TSV to HDF5 converter on a very large dataset #65

ciodar · 2022-06-29T14:27:11Z

ciodar
Jun 29, 2022

Hi,
I'm trying to convert several TSV files from c4 200M dataset into HDF5 format and I based my conversion on your notebook.

The dataset is composed by 10 files, each file containing approximately 18 million records, with 2 string columns.
Given the size of the dataset, I thought that converting it to HDF5 format would give a significant benefit and would allow me to know the shape of each file and give significant performance boost in the read of chunks of the dataset.

In a first trial I converted 1 million records in about 3 minutes, however when I tried to convert all 18 million records it is taking more than 6 hours per file.
I am currently loading my tsv in the following way

def csv_to_hf5(csv_path, num_lines=1000000, chunksize=100000, columns=None):
    if columns is None:
        columns = ['input', 'labels']
    csv_path = pl.Path(csv_path)

    hdf_filename = csv_path.parent / pl.Path(csv_path).name.replace('.tsv', '.hf5')

    # suppose this is a large CSV that does not
    # fit into memory:

    # Get number of lines in the CSV file if it's on your hard drive:
    # num_lines = subprocess.check_output(['wc', '-l', in_csv])
    # num_lines = int(nlines.split()[0])
    # use 10,000 or 100,000 or so for large files

    dt = h5py.special_dtype(vlen=str)

    # this is your HDF5 database:
    with h5py.File(hdf_filename, 'w') as h5f:

        # use num_features-1 if the csv file has a column header
        dset1 = h5f.create_dataset('input',
                                   shape=(num_lines,),
                                   compression=9,
                                   dtype=dt
                                   )
        dset2 = h5f.create_dataset('labels',
                                   shape=(num_lines,),
                                   compression=9,
                                   dtype=dt
                                   )

        # change range argument from 0 -> 1 if your csv file contains a column header
        for i in tqdm(range(0, num_lines, chunksize)):
            df = pd.read_csv(csv_path,
                             sep='\t',
                             names=columns,
                             header=None,  # no header, define column header manually later
                             nrows=chunksize,  # number of rows to read at each iteration
                             skiprows=i,
                             )  # skip rows that were already read

            features = df.input.values.astype(str)
            labels = df.labels.values.astype(str)

            # use i-1 and i-1+10 if csv file has a column header
            dset1[i:i + chunksize] = features
            dset2[i:i + chunksize] = labels

where i set num_lines equal to the total lines of each file, where chunksize = 10000.

I did not expect this performance degradation, have you ever tried to use your code to convert dataset of a similar dimension?

Thanks in advance.

rasbt · 2022-06-29T21:00:15Z

rasbt
Jun 29, 2022
Maintainer

It's been a long time since I used HDF databases. When you mean performance degradation, do you mean compared to reading from a CSV file? Or do you mean the database creating is just slow? I think if you lower the compression, it might potentially speed up a bit

2 replies

ciodar Jun 30, 2022
Author

I meant the database creation process. The solution of lowering the compression does indeed help ... however the obvious drawback is that it increases the file size.

By setting a compression of 4 (half as before), each converted HDF5 file is 5.6 GB, compared to the 4.8 GB of the CSV.
Each file can be written in about one hour, which is not so good, but it is manageable.

By further twiddling with the values, i could even find a better combination, but I could already be satisfied with this solution. Thanks again to suggest that, it was fast and simple.

However, I still do not understand why the write time increases with every written chunk: here I plotted, for a single file conversion, the elapsed time of reading a chunk from pandas and writing it to the HDF5 store. As you can see, there's a linear increase with every iteration.

I will further look into it and check for a better solution, for the benefit of anyone who could run into this kind of problem in the future.

rasbt Jun 30, 2022
Maintainer

Huh, that's a good question. I am just speculating here, but maybe it is partly uncompressing things before extending the database and compressing it again. And the larger the database becomes, the more expensive this process becomes because there is more to uncompress & recompress.

I am just wondering what happens if you turn the compression entirely off. I realize that it's probably not an option regarding the final file size of the database, but I am curious if the write time is closer to constant in that case.

Maybe there is a way to write the database uncompressed and then compress it later once the whole database is built.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TSV to HDF5 converter on a very large dataset #65

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

TSV to HDF5 converter on a very large dataset #65

ciodar Jun 29, 2022

Replies: 1 comment · 2 replies

rasbt Jun 29, 2022 Maintainer

ciodar Jun 30, 2022 Author

rasbt Jun 30, 2022 Maintainer

ciodar
Jun 29, 2022

Replies: 1 comment 2 replies

rasbt
Jun 29, 2022
Maintainer

ciodar Jun 30, 2022
Author

rasbt Jun 30, 2022
Maintainer