-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support the Zstandard (aka zstd) library for compression #84
Comments
@data-man To go ahead on this the best argument would be just make a libzim/zimwriterfs/ZIM file with that algorithm and clearly demonstrate about what kind of improvement we talk would deal with. |
The zstdmt also deserves attention. |
@data-man It makes two years this feature request has been open and it is more attractive now to me that is was. Would you volontaire to make a POC integration of zstd in the libzim? |
Implementing this would help to avoid such a problem like kiwix/kiwix-tools#345 |
@kelson42 Some thoughts:
|
That would be really awesome!
Yes, but if zstd is better then probably we would at middle-term just switch per default.
This is an idea, please open a dedicated ticket in zim-tools.
|
@data-man please see with me before trying to implement this. (I'm I've investigate a bit zstandard and it may provide a functionality we want since a long time : random access decompression. But doing this, it would need us to do a lot in libzim : changing the cluster format, use a article cache size (instead of cluster), maybe change the way we regroup articles in cluster, ... It is also (and of course) possible to simply use zstd to compress the cluster as we already do for now. But at least we should prepare the possibility that we change the way we store articles in cluster. |
@mgautierfr What you talk about is corresponding to #76 amd #78. If we would tackle these peoblem as well, this would be awesome^2? |
@mgautierfr @veloman-yunkan is going to work on this. |
💯 BTW, I changed my mind. :)
|
@kelson42 Is there a test ZIM file that uses zstd compression, to work with on kiwix/kiwix-js#611 ? |
@data-man Would you please be upload one (ZIM file with zstd compression) somewhere? I have created a ticket to easily be able in the future to create one with |
|
@data-man Perfect. Thx. I have updated it to http://tmp.kiwix.org/foo-zstd.zim as well. |
@data-man I'm trying to implement zstd decompression in Kiwix JS, our JavaScript reader (using a node version of zstandard library), see kiwix/kiwix-js#611 (comment) . Regarding the sample ZIM you kindly supplied, I have some questions:
You will see from the comment referenced above that I can present what I believe is the compressed cluster's data to the decompressor using the Simple API, but it returns |
@Jaifroid
No. createZimExample uses |
The API used to compress the data shouldn't matter. Your question is akin to asking about the application used to create a PNG file, in order to decide which viewer to use for opening it. Data compressed with any zstd API complies with the zstd codec specification, and should be correctly decoded with any of its decompression APIs.
Introducing zstd support at this stage doesn't bring new approaches to compression in ZIM file format - it's just a new method of compressing the entire cluster as a whole. |
Thank you @data-man and @veloman-yunkan. The info is helpful for pinpointing where to look for troubleshooting our code (I'm getting a null result when I try to decompress a cluster in the test ZIM file using the JS version of the codec). |
Just to mention that the clusters don't have to be written consecutive. So you cannot use the ClusterPointerList to get the size of a cluster. |
@mgautierfr OK, thank you for that important clarification. This most likely explains the issue I'm having. |
@mgautierfr, I understand from your comment in #210 (quoted below) that we cannot know the size-on-disk of a compressed cluster, and have to decompress the data chunk by chunk. Is the chunk size the same for zstandard as it is for xz? This seems to be 1024 x 5 in Kiwix JS for xz-compressed chunks.
(#210) |
Yes, at best you are sure that if something start at a offset after the cluster (another cluster or anything else as a dirent, clusterPtrPos, ...) the compressed cluster will finish before this offset. But the size may be actually smaller.
There is no "chunk size" specific to zstandard or xz. As @veloman-yunkan said, how the content has been compress is irrelevant of how you should decompress it. Of course, you must decompress xz content with xz algorithm and decompress zstd algorithm with zstd. But you don't care about the chunck size used at compression time or other things like that. For decompression you can use the simple api (all in once) or the streaming api (chunck by chunck), it doesn't matter. On a performance/memory usage it may, but the result will be the same. |
Thank you very much for the hints @mgautierfr . As you can probably tell, I'm rather new to decompressing streams, but I think I've got my head round it now with your help. We're facing Kiwix JS becoming obsolete overnight once new ZIMs are produced with zstd compression, hence the effort to try to reproduce the libzim process in JavaScript. (We would much rather use libzim directly, but it's proved impossible to compile to asm/webassembly with Emscripten in a usable state to date, mostly due to filesystem limitations we think.) |
GoldenDict now supports zims with zstd (commit). |
I am a bit confused by the GoldenDict implementation. It appears to calculate the cluster size before decompressing the cluster by subtracting the beginning of the cluster from the beginning of the next cluster: https://github.com/goldendict/goldendict/blob/master/zim.cc#L322 Is this actually a useful heuristic to know "roughly" how much data we are dealing with? |
Yes. This is a useful heuristic but there is no guaranty that cluster are written sequentially. |
OK, thanks @mgautierfr . I'm now quite close to completing the Kiwix JS implementation of ZSTD decompression. The goldendict version suddenly made me think our implementation was over-complicated, and we might just be able to decompress a whole cluster in one go, but it's clearly not safe even if it would work most of the time. So instead we decompress from the start of the cluster up to the end of the offset + data length we need. A side effect of this is that we have to restart decompression from the beginning of a cluster each time we want a blob from that cluster, even if the next blob we want is stored consecutively in the cluster after a previously retrieved blob. This seems like a waste of CPU cycles even if the decompression is fast.... (I know it can be ameliorated with cluster caching.) |
Why this is not safe to decompress the whole cluster ?
This is a complex problem (not on the code side, but on what is the best strategy). |
I meant it's not safe to do it using values calculated by subtracting the current cluster offset from the beginning of the next cluster offset, for the reason you state (they are not guaranteed to be written consecutively). So in Kiwix JS we currently decompress just "enough" data for the blob requested, and then start again for the next blob. We have an experimental cluster cache made by peter-x some years ago, which works well, but of course like all caches it's hit-and-miss, and has high churn rate for large ZIMs. |
Indeed, if you use the next offset as end offset, it is not safe. In libzim (for now), we decompress the whole compressed stream (and let lzma/zlib/... detect the end of the stream). |
See
Advantages:
The text was updated successfully, but these errors were encountered: