Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libzim cache size should be programmatically mutable #311

Closed
kelson42 opened this issue Apr 1, 2020 · 15 comments
Closed

libzim cache size should be programmatically mutable #311

kelson42 opened this issue Apr 1, 2020 · 15 comments

Comments

@kelson42
Copy link
Contributor

kelson42 commented Apr 1, 2020

Currently this depends from an ENV varaible. I propose to remove this and in place create a method which can be called whenever we want to be able to to increase/decrease it size.

This idea comes after discussing #303

@veloman-yunkan
Copy link
Collaborator

I don't think that making this enhancement in a standalone manner makes sense (then it may be either overdesigned or implemented the wrong way). A better approach is to do it in the context of an optimization that will immediately utilize the new feature.

@kelson42
Copy link
Contributor Author

kelson42 commented Jul 22, 2020

@veloman-yunkan Actually this is needed now for zimcheck and might I believe be the next ticket to implement as there is quite a consensus about it.
@MiguelRocha Would you be able to explain here what is needed please, for both caches?

@MiguelRocha
Copy link
Contributor

Context:

Currently libzim has 2 caches that are heavly utilized by several libzim API calls:

Cache<article_index_t, std::shared_ptr<const Dirent>> direntCache;
Cache<cluster_index_t, std::shared_ptr<Cluster>> clusterCache;

The number of max elements for each cache is fixed and is taken from 2 places:
Environment variables ZIM_DIRENTCACHE/ZIM_CLUSTERCACHE if they exists. If not, hard coded values are used:

	#define DIRENT_CACHE_SIZE 512
	#define CLUSTER_CACHE_SIZE 16

Problem:

This approach is not very smart since:

  • The environment variables have to be set up by the user running the applications. Most of the times they are not aware what are the appropriate values.

  • When the user does not set up the environment variables (which seem to be most of the cases), the default values for DIRENT_CACHE_SIZE/CLUSTER_CACHE_SIZE are fixed and very small for most of the zimfiles, and
    the application using libzim (example: zimcheck) has no control over this values).

Solution:

The proposed solutions allow the application to adapt the size of the caches considering the necessity of the specific zimfile being used.

Proposed solution 1:

Give control to the application using libzim.

Have in the libzim API some way to change the cache values. (this is already exposed by Cache::setMaxElements(size_type maxElements_))

Proposed solution 2:

Give control to libzim.

Libzim can have some inteligence and apply an algorithm to setup better appropriate values. Example: The dirent cache max elements can be 20-50% of the total nr of dirents.

We can also implement both solutions and give the option to the user to choose. In the following priority order:

  1. ENV variables
  2. Solution 1
  3. Solution 2
  4. Hard coded values

Use case:

zimcheck heavly use direntCache when running the full checks. When we set up ZIM_DIRENTCACHE env variable with high values we get very significant performance improvements (at an insignificant memory usage cost).

@mgautierfr
Copy link
Collaborator

We have to be sure that the smart solution is smart enough to not take all the memory when opening big zim file on small devices (android, ios, embedded).

In all case, the user should have the end word here. Even if they have a lot of memory, they may want to limit the cache to avoid for any reason (reserve memory for other process...)

@MiguelRocha
Copy link
Contributor

We have to be sure that the smart solution is smart enough to not take all the memory when opening big zim file on small devices (android, ios, embedded).

In all case, the user should have the end word here. Even if they have a lot of memory, they may want to limit the cache to avoid for any reason (reserve memory for other process...)

Yes we should care about the memory, but considering all the memory used by libzim, these caches doesn't seem to be to occupy a lot. (its basically a <primitive type, pointer type> pair for each entry.

With the priority order that I specified the end user has the end word.

@kelson42
Copy link
Contributor Author

To me this is important that:

  • Wee keep default values like they are currently
  • We remove any dependence to ENV
  • We don't have anything smart in the libzim (at least in the scope of this ticket).

So just make the two cache properly mutable for software using libzim. @veloman-yunkan Does that sounds good to you?

@veloman-yunkan
Copy link
Collaborator

To me this is important that:

* Wee keep default values like they are currently

* We remove any dependence to ENV

* We don't have anything smart in the libzim (at least in the scope of this ticket).

So just make the two cache properly mutable for software using libzim. @veloman-yunkan Does that sounds good to you?

@kelson42 From the implementation perspective it is trivial. However I am not sure that sticking with the Worse is better software development philosophy is optimal in this case. Even if we close this ticket in such a non-sophisticated manner, we should come up with a solution that is more client-friendly.

@kelson42
Copy link
Contributor Author

kelson42 commented Jul 31, 2020

Even if we close this ticket in such a non-sophisticated manner, we should come up with a solution that is more client-friendly.

@veloman-yunkan If you believe doing what I propose generates a new problem, please express it because this is not clear to me unfortunately. If you believe part of the problem is not fixed and it should, please open a new ticket.

@veloman-yunkan
Copy link
Collaborator

@kelson42 Your proposal doesn't generate a new problem per se. It just shifts the burden of struggling with a known problem to the clients. They either have to solve that problem (determining the optimal cache size) on their own, or expose that same API, enabling the end user to control that setting. The latter may be perceived either good or bad (depending on the user). On the other hand, if applications invent their own strategies for setting the cache size, those may become obsolete by later optimizations in libzim and turn into wasted effort. If we are aware of any good cache management strategies that may be common to multiple applications we better build those into libzim and update them as libzim evolves.

@kelson42
Copy link
Contributor Author

kelson42 commented Aug 1, 2020

@veloman-yunkan Thank you for clarifying your concern for me. What you have written makes sense to me but these questions are really not easy to answer now and I'm not sure about the importance of the problem... maybe the problems will stay theoritical. Who knows, at least I really don't know.

What I know:

  • I like to keep problems/tickets as small as possible
  • Move ahead step-by-step
  • We have a relatively urgent need for zimcheck/zimdump
  • I like that we solve real problems not hypothetical ones
  • In general we try to keep the libzim not too smart/complex
  • zimcheck/zimdump won't implement a smart strategy but just increase by a few factor the caches when needed which should not pose any problem anyway
  • If we need something smart, I would really prefer having us experimenting outside the libzim first.
  • I'm not aware about any usage outside openZIM/Kiwix which has complained about or would need this kind of feature
  • Whatever how we see it this is not TOP prio

For all these reasons, I prefer that we don't focus on this now. I can only reiterate my invitation to open an other ticket if you believe this worth a tracking.

@veloman-yunkan
Copy link
Collaborator

veloman-yunkan commented Aug 1, 2020

@kelson42

Reiterating my earlier comment:

I don't think that making this enhancement in a standalone manner makes sense (then it may be either overdesigned or implemented the wrong way). A better approach is to do it in the context of an optimization that will immediately utilize the new feature.

If there is a ticket for performance improvement in zimcheck/zimdump, may I be assigned to it?

@kelson42
Copy link
Contributor Author

kelson42 commented Aug 8, 2020

@veloman-yunkan As long you agree with @MiguelRocha about who does what and that this is clear in the tickets, this is good to me.
@MiguelRocha You have open a ticket about perf. Improvements at openzim/zim-tools#134 but nothing about tunning the memory. Would you be able to point out to a ticket about that?

@mgautierfr
Copy link
Collaborator

I second @veloman-yunkan here. We don't know what we need. It is useless to add a method if it will fix anything at all.

As @MiguelRocha points it in openzim/zim-tools#134, the dirent cache take a lot of times. But changing the cache size (increase or decrease it?) may not help. Maybe the solution is to fix the cache system itself (#385)

And I think it make less sens to configure a cache system by the number of item than by the total memory used. It all software using caches (and providing configuration), I always seen a configuration option about the allowed memory to the cache. Never about the number of entries. And it make sens, I know how to limit the cache usage depending of my configuration, I have no idea of how many items cache.

@kelson42
Copy link
Contributor Author

@veloman-yunkan What approach do you want to follow for this ticket? Do you want to finish openzim/zim-tools#69 (introducing multithreading) and see if this is readlly needed?

@kelson42
Copy link
Contributor Author

It seems this is not needed, so will close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants