Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Items in an in-memory cluster as separate objects #395

Closed
veloman-yunkan opened this issue Aug 9, 2020 · 6 comments
Closed

Items in an in-memory cluster as separate objects #395

veloman-yunkan opened this issue Aug 9, 2020 · 6 comments

Comments

@veloman-yunkan
Copy link
Collaborator

veloman-yunkan commented Aug 9, 2020

Currently the internal representation of an in-memory cluster (zim::Cluster, http://github.com/openzim/libzim/tree/6.2.0/src/cluster.h) is a single buffer (behind a zim::Reader interface) where different ranges correspond to different articles/items. A more optimal representation is one with a separate buffer/blob for every article/item, which will allow more granular caching on article/item level (the problem with coarse caching of clusters is that a cluster may need to be kept in cache because of a single small item, the other items not being accessed at all).

@mgautierfr
Copy link
Collaborator

This would potentially save a bit memory cache. But not necessary improve speed.

First the cluster cache is only used for compressed cluster.

If we decompress a cluster to get one article we have two options :

  • Split the cluster into "items" and store all of them in the cache. It would use the same amount of memory but will increase the cache size (in number of item, see Optimization of zim::Cache #385 about the number of item performance). It would help a bit the memory as we could drop unused items.
  • Store only the item for which we were decompressing the cluster. But then we decompress all the cluster (or at least all items before the one we want) and never cache it. It may simply leads to decompressing several time the same (beginning of a) cluster in a row. (And we should ensure than the "cluster order" is also "blob order").

Yet again, we need measurements before changing this.

@kelson42
Copy link
Contributor

@veloman-yunkan Now that the most of the work has been done in the cache. Should we do these measurements now?

@kelson42
Copy link
Contributor

I support this

@kelson42
Copy link
Contributor

kelson42 commented Sep 16, 2020

@mgautierfr @veloman-yunkan What is the status here? I believe this is not implemented, but do we still need it? Would that still bring an improvement?

@mgautierfr
Copy link
Collaborator

I don't know if we need this.
We have made great improvements on the cache system and the partial cluster decompression.

Caching item's data individually will allow the cache to drop the unused item. We will win memory but in the same time we may have to decompress cluster data several times.
I have no clue at all if it will be a win or a lost.

@veloman-yunkan
Copy link
Collaborator Author

Then let's not chase it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants