You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The newly added zip file table to SQLite index conversion does not have a progress bar, but/and it still can take quite a long time, e.g., for this real dataset. This dataset is also a nice benchmark for rapidgzip integration for zip because all of the files are deflate compressed:
There probably is a lot of optimization that can still be done for this, e.g., SQLite insertion batching or maybe moving the zip metadata reading to native code, e.g., integrate it in rapidgzip.
Finally done:
> time ratarmount rsna-intracranial-hemorrhage-detection.zipCreating new SQLite index database at rsna-intracranial-hemorrhage-detection.zip.index.sqliteCreating offset dictionary for rsna-intracranial-hemorrhage-detection.zip ...Creating offset dictionary for rsna-intracranial-hemorrhage-detection.zip took 993.41sreal 16m39.042suser 0m24.294ssys 0m9.148s
> time find rsna-intracranial-hemorrhage-detection/rsna-intracranial-hemorrhage-detection//stage_2_test/ | wc -l121233real 0m1.540suser 0m0.041ssys 0m0.040s
More granular timings show that SQLite index insertion is definitely not the bottleneck:
Create SQLite database for 874037 items
zipfile.infolist() took 0.000s
conversion to file infos took 1060.196s
conversion to file infos without calling findDataOffset took 2.401s
conversion to file infos without calling struct.Struct inside findDataOffset took
insertion into index took 6.220s
The problem seems to be that we have to seek to each local file header and read over the header to find out the data offset. Of course, this will take some time, especially on slow hard drives.
I'm not sure whether the data offset can be reliably determined from the central directory at the end of the zip, which is probably also what zipfile reads. Theoretically, it should be redundant to the local headers but it feels unreliable. The data offset currently is at the header offset + some fixed bytes + file name length + extra field length + 12 B encryption header if it is encrypted. The file name should probably be identical in the central directory. The extra field length may for example contain the 64-bit size in the zip64 format. The file name might differ:
4.4.17.2 If using the Central Directory Encryption Feature and
general purpose bit flag 13 is set indicating masking, the file
name stored in the Local Header will not be the actual file name.
A masking value consisting of a unique hexadecimal value will
be stored.
Alternatively, we could store only the header offset and read over the header on ZipMountSource.open, i.e., a kind of lazy evaluation but it should suffice. It will not be downwards-compatible though.
It seems like the data offset is unused anyway ... The file is opened via the ZipInfo corresponding to the offsetheader! The data offset was probably only intended for future use with a manual reading of file members, even though it is hazardous because these values are basically untested!
So yeah, seems like setting all data offsets to 0 would be the best option and deprecate them. They are useless anyway without knowing the compression / encryption that was applied to the actual data, and this information is not stored in the index.
The text was updated successfully, but these errors were encountered:
The newly added zip file table to SQLite index conversion does not have a progress bar, but/and it still can take quite a long time, e.g., for this real dataset. This dataset is also a nice benchmark for rapidgzip integration for zip because all of the files are deflate compressed:
195 GB -> 459 GB
Also, there is basically no CPU usage to speak of but with Ctrl+C, I do see it breaking from:
There probably is a lot of optimization that can still be done for this, e.g., SQLite insertion batching or maybe moving the zip metadata reading to native code, e.g., integrate it in rapidgzip.
Finally done:
More granular timings show that SQLite index insertion is definitely not the bottleneck:
The problem seems to be that we have to seek to each local file header and read over the header to find out the data offset. Of course, this will take some time, especially on slow hard drives.
ZipMountSource.open
, i.e., a kind of lazy evaluation but it should suffice. It will not be downwards-compatible though.So yeah, seems like setting all data offsets to 0 would be the best option and deprecate them. They are useless anyway without knowing the compression / encryption that was applied to the actual data, and this information is not stored in the index.
The text was updated successfully, but these errors were encountered: