Simplify the cache #817

darkowlzz · 2024-10-17T15:07:48Z

Based on the cache implementation improvement/simplification plan discussed
before, this change refactors it along the same lines with additional
improvements. See
https://gist.github.com/darkowlzz/34bc8b37d016b8aeb1de6660173652eb for the
initial improvement idea. It also updates oci/auth package which used the cache,
to remove the auth token caching. Auth token caching will be done at a higher
level by an OCI client.

At the time of writing the initial improvement plan, I was focused on the cache
interface and the exposed API. But while implementing the changes, I realized
how tightly the cache and their metrics were coupled, which may have contributed
to the verbose API. I'll explain some of it below for the record along with some
other explanations related to the changes.

dependency on kubernetes

The initial implementation was based on client-go's cache.Store interface and
was implemented around kubernetes style objects with object metadata. Due to
this, the implementation used kubernetes related go packages. Any user of the
cache, for example the git package for caching the git auth tokens, when
importing the cache, also imports all the kubernetes dependencies. This is
undesirable to keep simple packages independent of kubernetes dependencies. The
simplification changes the design around objects and makes it based on generic
value.

verbose interface

The initial Store interface, which the cache implementations implement, had too
many behaviors defined. It seemed to leak the implementation detail on the
interface. A cache interface needs ways to perform Get, Set and Delete. All the
other operations like resize, listing the keys, etc can be done per
implementation or build on top of the simple base interface if needed. It's
possible that we may have use case for some of the extra behaviors in the
future, but at present, we don't seem to have them. To simplify the interface,
the Store interface now has Get(), Set() and Delete() which work on string key
and generic type value.

complexity due to StoreObject

Because the initial implementation was based on kubernetes like objects,
any/most operations require building a StoreObject which is passed to the
cache for setting Get/Set/Delete. This made the cache usage code complex for
simple use. Most of the cache use cases in flux will store string or byte slice.
There's no known use case for caching kubernetes style object. Which makes the
need to create StoreObject for every use unnecessary. The new implementation
explicitly requires a string key and generic type value.

cache metrics

The initial cache was implemented to be self-instrumented. For all the
operations, it records metrics on its own. But cache event metrics which
recorded cache hit or miss were defined to be associated with object being
reconciled, similar to the metrics in the runtime package. It is desirable to
provide metrics for cache usage by flux object reconciliation. Since the initial
implementation required a StoreObject for every operation, it extracted the
associated reconciling object details from the StoreObject being passed. This
creates a strong coupling between the cache and the metrics, which is
reflected in the API. It's a nice property but at the cost of complexity on the
API. In addition, the implementation wasn't consistent in regards to the
associated object information on the metric. Some operations like GetByKey()
didn't consider the need for the reconciled object.
The new implementation decouple the cache and the metrics, making the metrics
that require the associated reconciling object to be explicitly called by the
caller who may have the information about the reconciling object.

The new implementation also introduced methods to delete the metrics when the
associated objects are deleted, like we do for any metric that's associated with
flux objects.

cache metrics prefix

The cache Options now support setting a metrics prefix to be added to all the
metrics registered by a cache. This is to prevent metrics name conflict when
using multiple instances of the cache in a program with the same metrics
registerer. Refer fluxcd/source-controller#1644 for more details.

remove cache from OCI auth

Initially, the cache was used in OCI auth package to cache the obtained auth
token. But later on, it was decided to not cache at the login manager level and
let the caller of such login functions to perform the cache as they have more
context about the scenario. #776 introduced
the cache in oci/auth package along with expiry time from the auth providers.
The cache is now removed and to return the expiry time to the caller, a new
function LoginWithExpiry() is introduced. A auth token caching client can call
LoginWithExpiry() to obtain the TTL of the obtained token and use it in the
cache.

Following is a list of overall changes in brief:

Remove dependency on kubernetes.
Simplify the Store interface to only have Get, Set and Delete.
Add new RecordCacheEvent() method in the cache implementations for recording the cache events, hit or miss, along with labels of the associated flux object being reconciled. The labels name, namespace and kind for this metric are not required to be configured when creating the cache. They are predefined in the created metrics recorder, similar to the runtime metrics recorder for the rest of flux. RecordCacheEvent() has to be called by the caller of cache operations explicitly along with the label information of the associated object being reconciled. The cache no longer has the knowledge of the object being reconciled, decoupled.
With the decoupled cache and reconciling object for metrics, the storeOptions no longer need to have the generic type defined. Simplifying the usage.
Add new DeleteCacheEvent() method in the cache implementations for deleting the cache events, hit or miss, which are associated with the object being reconciled. When the reconciling object is deleted, these metrics associated with the object must also be deleted.
Updated all the tests to use simple string cache.
Get() now returns a pointer instead of a separate exists boolean. If the pointer is nil, the item is not found in the cache.
Get() takes a key and returns a cached item for it. GetByKey() is removed as Get() does the same now.
In order to prevent the cache metrics to conflict when multiple caches are used against a common metrics registerer, the metrics constructor now accepts a prefix. This can be used to add a prefix to all the registered metrics of a cache.
Revert the cache introduced in oci/auth in [OCI] Cache Login credentials #776, keeping the expiration time which can be used by a cache at a higher level. Introduce LoginWithExpiry() to return the auth expiry time, keeping the existing Login() signature as it is. Auth token caching for OCI will be implemented separately in the OCI client.
Add doc.go for the package to show its usage.

OCI integration test run with the above changes: https://github.com/fluxcd/pkg/actions/runs/11743637711

stefanprodan

LGTM

Thanks @darkowlzz 🥇

dipti-pai · 2024-11-26T20:07:19Z

cache/cache.go

 	}
 	item, found := c.index[key]
 	if !found {
 		c.mu.RUnlock()
 		recordRequest(c.metrics, StatusSuccess)
-		return res, false, nil
+		return nil, nil


In the previous implementation the bool return value (exists) indicated if the element is not present in the cache. With the current implementation, the caller cannot deduce a cache miss scenario since no error is returned when the cache element is not found. This leads to a segmentation fault in the caller while dereferencing the item. This can be fixed by returning ErrNotFound here if the item is not found in the cache. The caller can then check for ErrNotFound to figure out if this is a cache miss.

This is intentional with the expectation that the caller should always check if the returned value is nil as it's returning a pointer. When it's a nil, it's taken as not found. This is based on the initial proposal in https://gist.github.com/darkowlzz/34bc8b37d016b8aeb1de6660173652eb#other-minor-issues and I believe we discussed this behavior explicitly. I was trying to stay close to the initial implementation which doesn't treat not found as an error for Get(), but it does for GetExpiration() and SetExpiration(). I assumed that operations against an item that doesn't exist result in not found error, but when obtaining the item itself with a given key that doesn't exist, it's not an error.

If we are to reevaluate this assumption/behavior, and I can't justify not returning an error for Get(), I would also like it to return an error. But if we do that, there's no longer a need to return a pointer to the item. The purpose of pointer was to indicate that the item was not found without returning an error.

Ok, my vote is to return an error for not found and not return a pointer as it feels more intuitive to use and keeps the implementation in sync with the expiration methods.

Going through some other Go cache implementations from other projects like

https://github.com/hashicorp/golang-lru/blob/v2.0.7/expirable/expirable_lru.go#L149

https://github.com/patrickmn/go-cache/blob/v2.1.0/cache.go#L120

https://github.com/eko/gocache/blob/621044ba0d3536d13db320a23d31d9af364d2cbd/store/go_cache/go_cache.go#L45

it seems the signature for Get() varies without any clear reasoning. hashicorp/golang-lru and patrickmn/go-cache return a separate boolean to indicate existence of the item, whereas eko/gocache, which also wraps a patrickmn/go-cache client internally, prefers to return a not found error. I think the reasoning behind our current design is because it's based on patrickmn/go-cache and hashicorp/golang-lru, which is also a generic cache implementation. But since our cache initially did the object key derivation and a few other operations that aren't directly cache store operation, it needed to return error for those operations and we ended up with an exists value and also an error.

Returning not found error seems better to me too. And with that, there's no reason to return a pointer.

The consequences of this is, previously, in a way, it was nice to be able to do

if got == nil { // Item doesn't exist, record cache miss. }

But now, it'll have to be more specific

if err == ErrNotFound { // Item doesn't exist, record cache miss. }

The caller will have to do something like

got, err := cache.Get("foo") if err != nil { if err == ErrNotFound { // Record cache miss. } return err } else { // Record cache hit. }

It's just that the cache miss check becomes part of further error value analysis.

While previously, it would have been

got, err := cache.Get("foo") if err != nil { return err } if got == nil { // Record cache miss. } else { // Record cache hit. }

Leaving the observations here for now for any further feedback on this new usage.

Adding this as a separate commit in case we find convincing reasons to go back to how it was before.

cache/cache.go

dipti-pai · 2024-11-26T21:07:30Z

Reviewed the changes and tested the following scenarios by using the cache to store credentials used for github app authentication -

Cache miss - Credentials are not present in the cache.
Set credentials into the cache after successful authentication.
Cache hit - Credentials are present in the cache and retrieved successfully.

- Remove dependency on kubernetes. - Simplify the Store interface to only have Get, Set and Delete. - Add new RecordCacheEvent() method in the cache implementations for recording the cache events, hit or miss, along with labels of the associated flux object being reconciled. The labels name, namespace and kind for this metric are not required to be configured when creating the cache. They are predefined in the created metrics recorder, similar to the runtime metrics recorder for the rest of flux. RecordCacheEvent() has to be called by the caller of cache operations explicitly along with the label information of the associated object being reconciled. The cache no longer has the knowledge of the object being reconciled, decoupled. - With the decoupled cache and reconciling object for metrics, the storeOptions no longer need to have the generic type defined. Simplifying the usage. - Add new DeleteCacheEvent() method in the cache implementations for deleting the cache events, hit or miss, which are associated with the object being reconciled. When the reconciling object is deleted, these metrics associated with the object must also be deleted. - Updated all the tests to use simple string cache. - Get() now returns a pointer instead of a separate exists boolean. If the pointer is nil, the item is not found in the cache. - Get() takes a key and returns a cached item for it. GetByKey() is removed as Get() does the same. Signed-off-by: Sunny <[email protected]>

Custom prefix allows using multiple instances of the cache in an application with unique metrics. Signed-off-by: Sunny <[email protected]>

OCI auth token caching will be done in the OCI client. Remove the cache from auth package and its use in the integration tests. Since the cache related changes introduced the auth token expiry time to be returned from the auth provider logins, a new function LoginWithExpiry() is introduced which returns the expiry time. The original Login() remains as it is for backwards compatibility. A higher level client can use LoginWithExpiry() to obtain the TTL of the auth token to use with the cache. Signed-off-by: Sunny <[email protected]>

Previously, when an item is not found in the cache, no error was returned and instead the returned value was expected to be checked for nil to determine if the object was found or not. Considering that other cache functions like GetExpiration() and SetExpiration() return ErrNotFound for items not in cache, it's more consistent to return ErrNotFound for Get() as well. Returning ErrNotFound from Get() makes it clear that the object was not found, instead of needing to check the obtained value. There is no more need to return a pointer to the item. Return the item value instead. This changes the Store interface method signature for Get(). Signed-off-by: Sunny <[email protected]>

dipti-pai

LGTM! Thanks.

darkowlzz mentioned this pull request Oct 21, 2024

Cache metric issues fluxcd/source-controller#1644

Open

darkowlzz force-pushed the refactor-cache branch 3 times, most recently from 543974d to 02d44f7 Compare November 7, 2024 16:54

darkowlzz marked this pull request as ready for review November 7, 2024 20:06

darkowlzz requested review from stefanprodan and a team as code owners November 7, 2024 20:06

darkowlzz force-pushed the refactor-cache branch from 02d44f7 to 65b1039 Compare November 8, 2024 14:06

stefanprodan approved these changes Nov 9, 2024

View reviewed changes

dipti-pai reviewed Nov 26, 2024

View reviewed changes

cache/cache.go Outdated Show resolved Hide resolved

darkowlzz added 4 commits November 27, 2024 21:24

cache: Allow setting cache metrics prefix

d8e4eae

Custom prefix allows using multiple instances of the cache in an application with unique metrics. Signed-off-by: Sunny <[email protected]>

darkowlzz force-pushed the refactor-cache branch from 65b1039 to 2b64615 Compare November 27, 2024 17:10

dipti-pai approved these changes Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the cache #817

Simplify the cache #817

darkowlzz commented Oct 17, 2024 •

edited

Loading

stefanprodan left a comment

dipti-pai Nov 26, 2024

darkowlzz Nov 26, 2024 •

edited

Loading

dipti-pai Nov 26, 2024

darkowlzz Nov 27, 2024 •

edited

Loading

dipti-pai commented Nov 26, 2024 •

edited

Loading

dipti-pai left a comment

Simplify the cache #817

Are you sure you want to change the base?

Simplify the cache #817

Conversation

darkowlzz commented Oct 17, 2024 • edited Loading

dependency on kubernetes

verbose interface

complexity due to StoreObject

cache metrics

cache metrics prefix

remove cache from OCI auth

stefanprodan left a comment

Choose a reason for hiding this comment

dipti-pai Nov 26, 2024

Choose a reason for hiding this comment

darkowlzz Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

dipti-pai Nov 26, 2024

Choose a reason for hiding this comment

darkowlzz Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

dipti-pai commented Nov 26, 2024 • edited Loading

dipti-pai left a comment

Choose a reason for hiding this comment

darkowlzz commented Oct 17, 2024 •

edited

Loading

darkowlzz Nov 26, 2024 •

edited

Loading

darkowlzz Nov 27, 2024 •

edited

Loading

dipti-pai commented Nov 26, 2024 •

edited

Loading