Items in an in-memory cluster as separate objects #395

veloman-yunkan · 2020-08-09T15:24:56Z

Currently the internal representation of an in-memory cluster (zim::Cluster, http://github.com/openzim/libzim/tree/6.2.0/src/cluster.h) is a single buffer (behind a zim::Reader interface) where different ranges correspond to different articles/items. A more optimal representation is one with a separate buffer/blob for every article/item, which will allow more granular caching on article/item level (the problem with coarse caching of clusters is that a cluster may need to be kept in cache because of a single small item, the other items not being accessed at all).

The text was updated successfully, but these errors were encountered:

mgautierfr · 2020-08-10T15:49:01Z

This would potentially save a bit memory cache. But not necessary improve speed.

First the cluster cache is only used for compressed cluster.

If we decompress a cluster to get one article we have two options :

Split the cluster into "items" and store all of them in the cache. It would use the same amount of memory but will increase the cache size (in number of item, see Optimization of zim::Cache #385 about the number of item performance). It would help a bit the memory as we could drop unused items.
Store only the item for which we were decompressing the cluster. But then we decompress all the cluster (or at least all items before the one we want) and never cache it. It may simply leads to decompressing several time the same (beginning of a) cluster in a row. (And we should ensure than the "cluster order" is also "blob order").

Yet again, we need measurements before changing this.

kelson42 · 2020-08-24T07:53:36Z

@veloman-yunkan Now that the most of the work has been done in the cache. Should we do these measurements now?

kelson42 · 2020-08-30T14:10:21Z

I support this

kelson42 · 2020-09-16T15:11:22Z

@mgautierfr @veloman-yunkan What is the status here? I believe this is not implemented, but do we still need it? Would that still bring an improvement?

mgautierfr · 2020-09-23T16:04:41Z

I don't know if we need this.
We have made great improvements on the cache system and the partial cluster decompression.

Caching item's data individually will allow the cache to drop the unused item. We will win memory but in the same time we may have to decompress cluster data several times.
I have no clue at all if it will be a win or a lost.

veloman-yunkan · 2020-09-23T17:01:11Z

Then let's not chase it.

veloman-yunkan added the enhancement label Aug 9, 2020

veloman-yunkan self-assigned this Aug 9, 2020

veloman-yunkan mentioned this issue Aug 9, 2020

Streaming decompression #394

Closed

kelson42 added the question label Aug 10, 2020

veloman-yunkan mentioned this issue Aug 28, 2020

Partial/incremental decompression of clusters #411

Closed

kelson42 linked a pull request Aug 30, 2020 that will close this issue

Partial/incremental decompression of clusters #411

Closed

mgautierfr mentioned this issue Aug 31, 2020

Add support the Zstandard (aka zstd) library for compression #84

Closed

kelson42 assigned mgautierfr Sep 16, 2020

veloman-yunkan closed this as completed Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Items in an in-memory cluster as separate objects #395

Items in an in-memory cluster as separate objects #395

veloman-yunkan commented Aug 9, 2020 •

edited

Loading

mgautierfr commented Aug 10, 2020

kelson42 commented Aug 24, 2020

kelson42 commented Aug 30, 2020

kelson42 commented Sep 16, 2020 •

edited

Loading

mgautierfr commented Sep 23, 2020

veloman-yunkan commented Sep 23, 2020

Items in an in-memory cluster as separate objects #395

Items in an in-memory cluster as separate objects #395

Comments

veloman-yunkan commented Aug 9, 2020 • edited Loading

mgautierfr commented Aug 10, 2020

kelson42 commented Aug 24, 2020

kelson42 commented Aug 30, 2020

kelson42 commented Sep 16, 2020 • edited Loading

mgautierfr commented Sep 23, 2020

veloman-yunkan commented Sep 23, 2020

veloman-yunkan commented Aug 9, 2020 •

edited

Loading

kelson42 commented Sep 16, 2020 •

edited

Loading