Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SetOfActiveKeysKey is not cleaned up and hogs cpu/memory over time #21

Open
nomaddamon opened this issue May 31, 2016 · 11 comments
Open

Comments

@nomaddamon
Copy link
Contributor

Our situation:
Table "A" has ~100.000.000 records, and has rather specific load pattern - if a row is queried then it is likely it will be queried again few thousand times over next few minutes.
Around 5000 different rows are queried every minute (distribution is near random).
After a few minutes given row might not be accessed again for days.

A logical setup would be to set CacheRegion absolute expiry to ~ 5 minutes and keep only "hot" rows in cache (~ 5 * 5.000 = ~ 25.000 rows in cache).

Since SetOfActiveKeysKey items have no expiry, it will build up over time (items are removed from cache via expiry, but :keys records are kept). With our load pattern, this leads to having 10+m records in :keys set, while having ~25.000 items cached. I'm quite certain this has negative effect on lookup speed (not tested though), and certain about memory consumption - purging the :keys will result in few GB of freed memory

At the moment we use scheduled cleanup as a workaround, but this seems like a hack and shouldn't be part of this project? At the same time I can't figure out how to address this issue cleanly - considering current cache architecture.

(The hack we use is to run a schedule every minute that does full SCAN (preferably on slave), finds :keys entries, does SSCAN on each entry, finds (in memory) obsolete entries and deletes them in batches)

@nomaddamon
Copy link
Contributor Author

A more common example of the problem would be querying a database object with expiry
session.CreateCriteria<Banner>().Add<Banner>(x => x.ExpiresAt > expiryTime).SetCacheable(true).List(); where expiryTime is DateTime.Now rounded to full minute.

Reasoning - every minute we query database for a list of not-expired banners and all other queries hit StandardQueryCache

As a result every minute a new entry is added to NHibernate-Cache:NHibernate.Cache.StandardQueryCache:keys and they are only removed on Redis FLUSHDB

@nomaddamon
Copy link
Contributor Author

Went ahead and tried a different approach - not to keep track of keys via :keys but to SCAN and DEL them on ICache.Clear(), see https://github.com/nomaddamon/NHibernate.Caches.Redis

This adds considerable overhead to ICache.Clear() (clearing a small region in a huge database might result in 100x preformance loss) but it does come with a few perks:

  • ICache.Get, ICache.Put, ICache.Remove are nominally faster
  • :keys do not exist anymore (and therefore don't leak memory)
  • ICache.Clear does not cause fail-over on large data-sets anymore (i.e. 10M set values in :keys ... DEL on that key takes ~10s on our test machine, but locks the database and sentinels promote another slave to master (making the DEL fail anyway)... SCAN/DEL takes ~100 seconds but doesn't cause Redis failure)

This might not be OK approach for this project, but in our use case it works better (ICache.Clear() is not used in our solutions since we try to avoid named queries)

@TheCloudlessSky
Copy link
Owner

TheCloudlessSky commented May 31, 2016

Thanks for bringing this up. This will be quite a long read, but hopefully it helps you.

This is a known disadvantage to using this library (and it's not particularly easy to fix).

Prerequisite thoughts

The general idea with this project is to be the lowest common denominator for using Redis as a cache for NHibernate. With that in mind, you may have to customize/fork this library to get the best performance with your own application's constraints (as I'm writing this I see you've already stumbled upon one of the solutions 😄).

Have you profiled to test that adding Redis cache in front of your database is faster than actually just querying the database directly? This obviously depends on how your data is modeled and how you're using NHibernate. Adding cache on top of NHibernate can actually reduce performance because of serialization/deserialization/materialization costs inside of NHibernate. In large applications I've created (and with specific usage patterns/environments), I've found using the DB directly was as fast as using the cache + full round trips. Obviously this highly depends on your DB, environment, data etc.

Do you need a distributed NHibernate cache (e.g. are you using only one web server)?

Have you considered using a non-ORM for high-performance reads (Dapper, Massive etc)? For my applications, we usually start with NHibernate (to get going quickly). As we find performance-sensitive areas, we use raw SQL queries with Dapper to avoid all of the costs associated with NHibernate (and extra round-tripping to the cache). We use MediatR to abstract away the implementation of these queries to consumers.

If you're using the NHibernate query cache, be aware that there are broad cases where an updated will cause NHibernate to call ICache.Clear(). This, again, depends on how your using NHibernate, but I've found that in situations like this, the NHibernate query cache has no real benefit, so we leave it off completely. We then move the caching "up" (e.g. cache rendered HTML or JSON rather than actual DB entities).

Potential solutions

  1. Turn on keyspace notifications in Redis and listen for "expire" events in your application code. When each key is expired, remove it from the set of active keys (you may have race conditions, however).
    • Because this isn't enabled in Redis by default, this isn't something we can easily bake into this library for everyone to use.
    • It uses additional CPU in Redis.
    • What happens if you have more than one listener (e.g. multiple web servers would receive the keyspace notifications and both do the remove from the set of active keys).
  2. Use your external timer to "prune" old entries.
    • If we baked this into this library, it'd have to be environment-agnostic (e.g. HostingEnvironment.QueueBackgroundWorkItem is only useful in web scenarios).
    • Instead we could expose a RedisCache.Prune() so that you can determine how often you'd want to prune the entries.
    • The actual pruning, however, really depends on your use case. For example, SSCAN is a server operation, not a full database operation.
    • How many "N" do we query at one time?
    • Should we always target a slave or the master (or use PreferSlave)?
    • There are lots of questions that make Prune() less general and not a good fit to add to this library (however I'll still consider it if we can nail down a great implementation).
    • Could we just publicly expose the RedisNamespace so you can get direct access to the key formatting (and then write your own Prune())?
  3. Fork the library, manually do a SCAN/KEYS in ICache.Clear(), and don't track the set of active keys.
    • Edit: Just saw that you stumbled upon this!
    • This was one of the original ways this was implemented (by other libraries that inspired this one). The, reason for not doing this is because of normalizing performance.
    • Is usually very specific and has really bad performance.
    • It has potential race conditions.
    • NHibernate can call ICache.Clear() at different times (e.g. executing native SQL, manually evicting a whole collection/region/type) so you have to be careful about performance.
    • We could think about turning this into a boolean option (turned off by default) for each ICache and bake it into this library. I'd want it to be robust enough. We'd have to take into account that SCAN/KEYS are server (not database) operations.
  4. For the library and use "generations" (previous versions of this library used this approach).
    • Bake a "generation" into the key name: NHibernate-Cache:<region>:<generation>:<key>
    • Every ICache has to sync their generation with the other servers (meaning we have to store the generation in Redis and make sure they all update when the generation changes).
      • With versions <= 2, we synced the generation with every operation... so it was more commands hitting Redis than how it was changed for v3.
    • Calling ICache.Clear() would be as simple as incrementing the generation and letting everyone else know the generation has been incremented (e.g. you could use Redis pub/sub to get notified and also sync your generation on startup).
    • There's a chance of a race condition since your ICache might have an out-of-date generation (maybe by a few seconds).
    • You'd need a LRU Redis policy and ensure that all NHibernate cache regions use a decent expiration (so they expire when they're no longer used once the generation is updated).
    • This is a good lowest common denominator for most applications, but you have to keep generation syncing in mind.
  5. Fork the library and use a separate server/database for each of your NHibernate cache regions.
    • If you have more than 15 regions mapping to individual databases, you'd need to change the Redis config.
    • Map from region name to database number.
    • Won't need the set of active keys.
    • Clear would be as simple as a FLUSHDB for the specific server/database pair.
    • This is one of the best approaches to scale NHibernate cache. For example, you could slice your Users table cache off to its own dedicated server for even better performance.
    • We can't make this assumption for all applications.

Additional thoughts

  • The current solution for v3 is a balance between ICache.Clear(), memory, the number of commands and CPU.
  • Make sure your cache regions are very small (e.g. map "YourApp.Models.Users" to just "r1") so they take up less space.
  • One area that we can reduce memory in this library is only storing the NHibernate key in the set of active keys. Currently we store the full Redis key (from RedisNamespace.GetKey()). Because the Redis key has duplicate information, it's wasted memory. It would mean that this set of keys no longer has a full list of valid Redis keys, but it's something we can always re-create.

Thanks!

@TheCloudlessSky
Copy link
Owner

You should be able to improve your Clear() by using the IDatabase.KeyDelete(RedisKey[] keys) overload instead of doing a batch:

db.KeyDelete(new RedisKey[] { "key1", "key2", "keyN" }, CommandFlags.FireAndForget);

@nomaddamon
Copy link
Contributor Author

Thanks for long and informative reply! :)

First the plan:

  • I will do a few small commits to improve (in my opinion) a few things
  • I will try to go with solution 4 and try to bake it into library acceptably

Additional thoughts:

About our setup: we use a bunch of web servers (16 as of writing, but this can change in minutes) load balanced by haproxy. Platform is a mix of WebForms and MVC, ORM is done by NHibernateX (NH for .NET 4.5 with async support). We used Memcached for quite a while successfully, but moved to Redis due to more reasonable failover architecture. The data-structure in our DB is rather complex and some queries (which can be frequent) are really complex. Most of the data is slowly changing, a random entity will probably get a few thousand hits before cache expiry / invalidation. Moving further away from relational data-models is the way forward for scaling, but would mean a huge rewrite and that isn't planned at the moment.

Without caching, we would need to upgrade our relational DB performance by a factor of 10 to keep current page latency, so yes, we need cache (and since we have multiple servers we need a distributed cache)

Moving some queries to Dapper or similar doesn't make much sense, since relational DB is the hardest thing to scale out in a hurry and cost's the most to massively over-provision.

We have been using query cache for a few years with NHibernate and in our case it makes a huge impact on performance. (On a side-note: are you sure about query-cache and ICache.Clear() - we haven't seen this in our logs and i can't trace it down from NH source)

We do use multiple caching levels, depending on which point the result becomes predictable (and re-usage is likely)

About potential solutions

  1. don't seem reasonable due to increased load
  2. don't seem reasonable due to increased load (we do use it in production at the moment until we get 3. through QA) SSCAN/SREM do slow down SISMEMBER on the same key noticeably.
  3. we found this to be most reasonable for our workload. I will try to bake it into this project as a configurable option
  4. seem reasonable approach, but generation syncing seems a hassle (as I understand multiplexer can reconnect to master after momentary network failure without end-user noticing it, while disconnect caused unsubscribe in Redis) or performance heavy (like you did it in v2)
  5. doesn't work - we'd need a separate Redis instance/server for each region. If FLUSHDB locks (i.e. deleting a :keys set with 10m entries) it locks up the whole Redis instance, not just given DB. SHUTDOWN NOSAVE on an instance can "flush" it fast - but having a Redis instance for each region seems a rather peculiar architecture

@nomaddamon
Copy link
Contributor Author

Please review this branch: https://github.com/nomaddamon/NHibernate.Caches.Redis/tree/disableIndexSetOnKeys and let me know what to improve in order to get it merged

@TheCloudlessSky
Copy link
Owner

Regarding ICache.Clear(), it's not entirely obvious where it can be called, but if you do a "find all references" and work backwards you'll see at least the following places:

  • Executing certain HQL statement (that may delete/update), it can clear all cache regions (or at least try to clear the affected cache regions):
    • See: AbstractStatementExecutor.CoordinateSharedCacheCleanup() (used by MultiTableDeleteExecutor, MultiTableUpdateExecutor and BasicExecutor)
  • Executing a native SQL query (e.g. session.CreateSQLQuery()):
    • See: NativeSQLQueryPlan.CoordinateSharedCacheCleanup()
  • Manually evicting:
    • SessionFactoryImpl.Evict(Type persistentClass)
    • SessionFactoryImpl.EvictEntity(string entityName)
    • SessionFactoryImpl.EvictCollection(string roleName, object id)
    • SessionFactoryImpl.EvictQueries()
    • SessionFactoryImpl.EvictQueries(string cacheRegion)

Here's my thoughts:

  • I'm going to bring in your change to make the cache key prefix configurable (I might rename it to CacheKeyPrefix or KeyPrefix).
  • I'd like to remove the prefix and region from what we store in the set of active keys (it's redundant and a waste of memory). I can do this or you're free to submit a PR. This will cause existing caches to be cleared the first time you deploy (since the new key without the prefix/region won't be in the set).
  • I recommend you use tiny prefix/region names so the cache keys are much smaller (less memory).
  • I've thought it over the past ~30 hours and I don't think it's a good direction for this library to have a configurable option to make ICache.Clear() potentially unsafe. I think this is a really good thing to have for your own use cases, however. But, I'm not sure of the value it will bring to the majority of others using this library. Perhaps it's something we can link to directly from the README?
  • An additional approach you may want to look at is still using the set of active keys but only for Put() and Clear(). Clear() would do a SSCAN on the set of active keys key rather than doing a full scan on the whole keyspace (potentially making this smaller).
  • I personally recommend the "prune" approach where a timer is configured to look at each set of active keys and then remove them if the associated key has expired. This would be done in batches and with SSCAN so to not consume all of Redis. It's essentially what a garbage collector would do in a managed language and I think it provides the best balance of safety and performance.

Also, I'm curious, how are you serializing cache objects (JSON, XML, etc)?

@nomaddamon
Copy link
Contributor Author

Unrelated stuff first

We are at the moment using default serializer. We played around with different options and found it to be acceptable (computing power for serializing is easy to scale - and xml storage overhead in Redis is negligible (bandwidth and latency-wise))

Somewhat related stuff

We use CI and can basically do A-B testing (this is a coincidence of our deployment procedure - UpdateTimestampsCache is in one Redis database for all servers, but when deploying, new servers start using a new database to avoid conflicts in data object definitions - this gives us guaranteed non-stale cache, even if some servers are on version X and others are on X+1 (where data structures are different)). Last few weeks we have tested quite a lot in production :) - basically deploying 10-20% of servers with some POC code (that has passed QA) to see how it affects Redis performance. We used scheduled cleanup for a while but it creates quite a lot of load (compared to minimizing ICache.Clear() usage and not using index set).

Most of ICache.Clear() usages come from usages we consider bad practices - hardcoding SQL/HQL queries, using named queries etc. (and yes, named query with no result type / sync set will call ICache.Clear() on all regions). We log all ICache.Clear() calls as errors...

Related stuff

Please see commit nomaddamon@a0df68f and run the new tests on your own machine :)

Test logic:
Set up a region with ~480k expired keys and ~20k active keys (we wait a bit to make sure Redis cleans up most of expired keys) - 20k active keys expire after the test finishes so we are only testing Get/Put/Clear

Use 5 paralel tasks for testing:
2 tasks simulate cache-miss (GET-PUT) to load-test region
2 tasks simulate cache-miss (GET-PUT) to unrelated region
1 task calls Clear while others are running

You might have to change parameters, depending on you test machine, but with defaults the tests should run at about ~50s combined.
There are no Asserts, console output only (the code is not meant to be part of this project and is not formatted according to your style (sorry :) resharper took over at some places))

My local output for Clear_preformance_without_index
Setup took 22156
Test db keyspace: [db15, keys=22593,expires=22593,avg_ttl=3717]
Read-Write started
Read-Write started
Read-Write started
Read-Write started
Clear starting
Clear finished, took 442
Read-Write ended
Read-Write ended
Read-Write ended
Read-Write ended
So... 22.5 unexpired keys (2.5k should expire while test is running) and 442ms clear - ok result if you dont use Clear :)

My local output for Clear_preformance_with_index
Setup took 20831
Test db keyspace: [db15, keys=20002,expires=20001,avg_ttl=816]
Read-Write started
Read-Write started
Read-Write started
Read-Write started
Clear starting
Clear finished, took 11
Read-Write took 474ms on region1
Read-Write took 472ms on region1
Read-Write took 471ms on region2
Read-Write took 475ms on region2
Read-Write ended
Read-Write ended
Read-Write ended
Read-Write ended
Multiplexer is already warmed up so this round there was only 1 keys that would expire while testing.. clear took 11ms (due to FireAndForget) but LOCKED up ENTRIE redis for 470+ms (all regions, all databases)

Play around with parameters a bit and you will be able to get timeout exceptions instead of 400ms read-writes (for me it took 1.5m expired and 50k active keys)

This might not seem that bad, but here we were running at 80 ops per second (at best case) and used a rather small cache... we see 150k ops per second in peak load times...
The problem is that if Redis gets locked up, sentinels start failover and the current master gets demoted - meaning that slave will resume from last known good state, which was prior to clear.

So my point is, that you are not adding an option to potentially make ICache.Clear() unsafe, but that it already is very unsafe under load (under our load calling ICache.Clear() will result in ~500 requests delayed by 1 second (failover-timeout) while having average page load time of 50ms, ~500 error messages and a lot of work figuring out what happened... not calling ICache.Clear() will result in Redis DB growth of 1-2GB / day)

@TheCloudlessSky
Copy link
Owner

About memory

  • I don't recommend using XML serialization. It's the default because it's pretty much the only serializer built into .NET that everyone will have (although it is "slow"). CPU/network might be negligible, but it'll be consuming more memory. I provide a sample JSON.NET serializer that I use in all of my production code.
  • I still plan on bringing in the changes to change the key prefix which will further help reduce the memory. Similarly, reducing the data that's stored in the set is also beneficial.

Running your load tests

Our setup to run your tests was with a local Windows build of Redis that did not have any slaves. We found, obviously, that the non-index ICache.Clear() with SCAN was slower, but that makes sense because the SCAN is done on the master and not a slave. We added a slave, and saw performance improvements.

I found some variables were inconsistent between tests (e.g. changing the timeout). I wrote the same tests from scratch to fully understand your load tests.

Our results were similar in that calling DEL on the set of active keys can pause other connections for a few hundred milliseconds when it contains lots of values. This makes sense because, as of Redis 3.2, it uses synchronous deletes. Coming in Redis 3.4 will be the UNLINK command which will enable asynchronous deletes. Eventually, we'll be able to include this in our current implementation depending on what your server supports.

Safety (at least strictly for this library) is defined in terms of having atomic operations. We need ICache.Clear() to be atomic for the general use case of this library. I don't think it's a good idea to have mixed safety (even as a config option because it duplicates the complexity of this library). Our current implementation is slower, but guarantees that ICache.Clear() is atomic. I'd really like to keep this library as simple as possible so that everyone is free to customize it to their use cases. You've done this perfectly where having lots of keys has proven you need to take the library in a different direction and are able to sacrifice atomicity for performance.

I also wanted to note that having a separate Prune() that does a SSCAN on the set of active keys and asynchronously removes expired keys from the set will definitely improve performance of ICache.Clear() since it doesn't have to free up as much memory.

If forking this library is a pain, I've also thought about introducing something like an IRedisCacheStrategy where you'd be able to customize exactly how the values are stored/retrieved/cleared in Redis. It would take some thought to be generalized, but I'm open to adding that to reduce the pains (compilation/maintenance) of your own fork.

@nomaddamon
Copy link
Contributor Author

Thanks again for detailed answer with a lot of insight.
We have switched to your JSON.NET serializer and saw ~30% memory gain for Redis cluster :)

With some work you could make it the default serializer and offer NetDataContractSerializer as an alternative. It does have some issues at the moment, but nothing major, iv'e added some issues in comments to that file (they are also the reason we didn't start with that serializer)

We are fine with maintaining our own fork until (and if) we can make this project support both current (guaranteed atomicity) and our (high load, many keys, almost no Clear()) scenarios.

Some thoughts on alternative stategies:

  • UNLINK will be a nice addition to Redis - but using it on ICache.Clear() will not address situations where ICache.Clear() is not called
  • We did experiment with Prune() in our production, but the overhead is too noticeable --
    We had constantly ~150k keys in a region with 5 min expiry and ran Prune() every minute, resulting in ~20k deletions every run. Both "SSCAN+MGET+SREM (for a batch)" and "SSCAN (to end)+SCAN+SREM (more atomic)" will result in noticeable latency increase for unrelated operations and even more to SISMEMBER on the same key. (And then there is the issue to get a single client to run Prune() in a fault-tolerant way in multi-server environment)

@TheCloudlessSky
Copy link
Owner

I'm not sure why, but GitHub doesn't send email notifications for Gists so I'm glad you mentioned it here. As I'm sure you're aware, I responded.

I've also merged in #22 (and cleaned it up to fit our code base). Thanks very much for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants