Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable cache for external tables using a cache control on metrics views #6265

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

k-anshul
Copy link
Member

@k-anshul k-anshul commented Dec 12, 2024

No description provided.

@k-anshul k-anshul self-assigned this Dec 12, 2024
proto/rill/runtime/v1/resources.proto Outdated Show resolved Hide resolved
runtime/compilers/rillv1/parse_metrics_view.go Outdated Show resolved Hide resolved
runtime/compilers/rillv1/parse_metrics_view.go Outdated Show resolved Hide resolved
runtime/metricsview/executor.go Outdated Show resolved Hide resolved
runtime/query.go Show resolved Hide resolved
runtime/query.go Outdated Show resolved Hide resolved

return runtime.ReconcileResult{Err: validateErr}
}

func metricsViewCacheControl(spec *runtimev1.MetricsViewSpec, streaming bool, updatedOn *timestamppb.Timestamp, dialect drivers.Dialect) *runtimev1.MetricsViewSpec_Cache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to move the logic (which basically just applies defaults) in this function to the metricsview_cache_key resolver – to keep the cache key logic in a single place. (It also makes refactoring the defaults easier because they will only be applied at runtime, not persisted in the DB as here.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We still need to validate for key_sql i.e. whether we are able to resolve a valid key_sql for streaming metrics views. It will also be ideal to do that validation in reconciler and not during query time. Since we are anyways doing some validations it made sense to also apply defaults.
  2. In reconciler I always mark cache enabled or disabled which the resolvers can use to check if caching should be enabled and get metricsViewCacheKey. Now resolvers will need to query every time and handle an error thrown by metrics_cache_key resolver.

It also makes refactoring the defaults easier because they will only be applied at runtime, not persisted in the DB as here

Since defaults can only be changed with a code change so it should make sure the defaults are also updated on metrics view reconcile on restart.

Having said that I also thought about keeping this in resolver and was indifferent to both.
I can move this logic to resolver as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way if we don't apply defaults in reconciler than we can't resolve this todo since executor can't call metrics_cache_key resolver.

// Cacheable returns whether the result of running the given query is cacheable.
func (e *Executor) Cacheable(qry *Query) bool {
// TODO: Get from OLAP instead of hardcoding
return e.olap.Dialect() == drivers.DialectDuckDB
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to cache the cache key correctly (this is so confusing), it seems like some of the flows around Cacheable anyway need to be refactored?

In general, the idea is that the metricsview package implements logic related to metrics views (such as validation, query generation, exports, etc.). This is largely in order to centralize the core logic related to metrics views in one package to make it easier to understand how our metrics layer works.

In line with that, it would be nice if the metrics-related cache key resolving could also be implemented in the metricsview package. And then called by upstream resolvers.

Right now we have a nice flow of the resolver cache calling the resolver calling the metricsview package.

This easily gets mixed up with this new caching logic. But I think it should be possible to find a way to keep the logic flow here relatively simple. Can you see if you can think of a way to keep it as straightforward as possible?

Copy link
Member Author

@k-anshul k-anshul Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of following ideas:

  1. We discussed on resolving this via a refresh schedule on metrics_view. This could be evaluated. In that case we can compute the key and store it directly in spec.
  2. Another option is to let runtime expose APIs to set/get data in query_cache which the executor can utilise to set/get metrics_cache_key in the cache. This could be useful for other use cases as well. Like as of now we are always querying watermark so it could be cached as well(May be it is not queried for all API calls, just thinking out aloud).
  3. Let the executor expose an API CacheControl instead of Cacheable which the caller can use. If enabled execute the KeySQL and save the result in the cache using the CacheKey.
type CacheControl struct {
	Enabled       bool
	KeySQL       string
	CacheKey    string
}

Copy link
Member Author

@k-anshul k-anshul Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 2 sounds most convenient to me. Good chance we may come up with similar use cases in the future as well. Logically also sounds okay. If anybody can access OLAP interface than anybody should be able to leverage Cache as well. In future the OLAP APIs can accept cache parameter and handle caching results(Just thinking out aloud).

Copy link
Contributor

@begelundmuller begelundmuller Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we basically only have caching happening in runtime.Runtime.Resolve (and the legacy runtime.Runtime.Query), which is nice because it's much simpler than when caching happens at multiple levels. I think it would be very nice to avoid caching happening inside the metricsview package – so the guarantee basically becomes that it always returns fresh data, which upstream callers can cache (we may need multi-level caching at some point, but I think we may be able to avoid it for now).

Looking at the resolvers, the Cacheable and Key functions are only called from here:

rill/runtime/resolver.go

Lines 126 to 136 in 8e7bca1

// If not cacheable, just resolve and return
if !resolver.Cacheable() {
return resolver.ResolveInteractive(ctx)
}
// Build cache key based on the resolver's key and refs
ctrl, err := r.Controller(ctx, opts.InstanceID)
if err != nil {
return nil, err
}
hash := md5.New()
if _, err := hash.Write([]byte(resolver.Key())); err != nil {
.

What if we replaced these with a CacheKey(ctx context.Context) (key []byte, ok bool) function (where caching should be skipped if !ok)? Then the metrics, metrics_sql and metrics_time_range resolvers could call the metrics_cache_key resolver from their CacheKey implementation – if they pass that call through opts.Runtime.Resolve, then it in turn will also be cached.

Do you think that will work? The advantage of this approach is that there continues only to be caching happening in runtime.Runtime.Resolve, and resolvers are only being called from other resolvers (not from the metricsview package). This doesn't quite account for the legacy runtime.Runtime.Query implementation, but hopefully the above idea can also work for that in a similar way?

When it comes to the actual execution (but not caching) of the metrics_cache_key resolver, that in turn could actually be implemented inside the metricsview package (similar to how the logic for the metrics resolver is implemented there).

Lastly, to comment on this:

Like as of now we are always querying watermark so it could be cached as well

You are thinking about this, right?

t, err := e.loadWatermark(ctx, executionTime)

I agree that it would be nice to have this cached (it hasn't been urgent since the UI currently always passes fixed time ranges). In this case, we could also avoid multi-level caching by instead moving this to the resolver – e.g. you could have something like this in the metrics resolver (pseudocode):

// Create an executor
executor := ...

// Call watermark resolver for the metrics view (can be cached in the resolver cache)
watermark := opts.Runtime.Resolve(...)

// Bind the query using the provided watermark (to avoid nested, uncached queries for the watermark)
boundQry = executor.BindWatermark(qry, watermark)

// Now execute the bound qry
res := executor.Execute(boundQry)

Copy link
Member Author

@k-anshul k-anshul Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the CacheKey API seems good to me.
But I am somewhat unconvinced with how it is computed in resolvers.

  1. Feels like too much indirection for CacheKey. It is hard to understand and trace the calls mentally. metrics and other resolvers need to call another resolver via the runtime.Resolve method to resolve cache key so that data can be cached.
  2. Some APIs on executor like Query can be called directly but other APIs like CacheKey need to go via the resolver path. So the decision to cache something is split across multiple places.
  3. It will be ideal to not let resolvers call other resolvers to prevent some cyclic calls.

May be for internal use cases instead of creating resolvers there should be a uncomplicated way to get and set things in query_cache to avoid too many indirections.

Copy link
Contributor

@begelundmuller begelundmuller Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like too much indirection for CacheKey

Agree, but I think it's intrinsic to the idea of a "cached cache key", not this implementation.

Some APIs on executor like Query can be called directly but other APIs like CacheKey need to go via the resolver path. So the decision to cache something is split across multiple places.

You must have misunderstood Executor.Query – it never does any caching. So on the contrary, this ensures consistency because caching is never implemented in the metricsview package, and only implemented one place (in the runtime/resolver.go function (and the legacy runtime/query.go, but that's being deprecated)).

It will be ideal to not let resolvers call other resolvers to prevent some cyclic calls.

I think resolvers calling other resolvers can actually be very powerful. Think of resolvers as memoized functions – it's completely fine for functions to call other functions, but of course, you should be careful if you call the same function recursively.

there should be a uncomplicated way to get and set things in query_cache to avoid too many indirections.

It's subjective, but I'm conversely more worried about the potential for bugs if stuff is getting cached at many different levels in the call chain. Having caching only happen in Resolve IMO makes it easier to reason about when things are or are not cached. I guess there's a functional vs. imperative programming way to look at this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but I think it's intrinsic to the idea of a "cached cache key", not this implementation.

Same indirection will be present if resolvers need to load watermark as well.

You must have misunderstood Executor.Query – it never does any caching

No I didn't meant that. I meant metrics resolver and other such resolvers need to call Executor.Query directly but call Executor.CacheKey via runtime.Resolve.

I think resolvers calling other resolvers can actually be very powerful.

I mean there is no denying that it is powerful. My reservations are more around ease of usage and ease of understanding. Creating resolvers for internal usage feels like adding too much bloat. I think there has to be a tradeoff between "perfect" abstractions and ease of understanding/usage. May be we are now in it is subjective territory.

It's subjective, but I'm conversely more worried about the potential for bugs if stuff is getting cached at many different levels in the call chain.

I don't know if the current approach is less bug prone. It is easy to miss the fact calling the Executor. CacheKey in resolver needs to go via runtime.Resolve and not to be called directly which is not the case with other APIs on executor.

runtime/reconcilers/metrics_view.go Outdated Show resolved Hide resolved
runtime/server/batch_query_test.go Outdated Show resolved Hide resolved
runtime/metricsview/executor.go Outdated Show resolved Hide resolved
if res.Err() != nil {
return nil, false, err
}
return []byte(key), true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to protect agains this, but often the key will be a timestamp and people may forget to cast it to a string.

An easy way to protect here is to use var key any and then run it through jsonval.ToValue and then json.Marshal – that will almost always give you a safely serialized string representation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly I did not find this issue in my testing and on digging further I realise that the sql package already does conversion when source is time.Time and dest is *string : https://github.com/golang/go/blob/669d87a935536eb14cb2db311a83345359189924/src/database/sql/convert.go#L286
It does this conversions for most common types.

But I think it is good to run it through jsonval.ToValue since it handles more types.

runtime/query.go Outdated
Comment on lines 83 to 88
cacheKey, err := r.metricsViewCacheKey(ctx, instanceID, res.Meta.Name.Name, priority)
if err != nil {
// skip caching
// the cache_key_resolver should ideally only return an error if caching is disabled or context is cancelled
return query.Resolve(ctx, r, instanceID, priority)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a dedicated skip bool for skipping caching. Because if the err is a context cancellation, that's ideally an error we return immediately. It will also make it easier to understand if an unexpected error is actually returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. In that case I think it will be best to expose errCachingDisabled and check for it directly for now. Once query package is removed that can be made internal as well.

return nil, false, nil
}

hasher := md5.New()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already being hashed in runtime/resolver.go so could we return the full pointer here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values were already being hashed in the older implementation :

func (r *metricsResolver) Key() string {

Since the key is now composed of two things (cache_key and query) I thought of just using a md5 hasher.
Do you mean we create a new struct and just return any from CacheKey instead of bytes ?

runtime/resolvers/metricsview_cache_key.go Outdated Show resolved Hide resolved
runtime/resolvers/metricsview_cache_key.go Outdated Show resolved Hide resolved
Comment on lines 126 to 128
func (r *metricsViewCacheKeyResolver) Refs() []*runtimev1.ResourceName {
return []*runtimev1.ResourceName{}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should have a ref to the metrics view to ensure invalidation if the underlying model is updated when streaming is false?

Copy link
Member Author

@k-anshul k-anshul Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Previously when not having the cacheable interface I was calling this for every metrics view and adding a ref on metrics_view would lead to circular dependency but now given it is called by resolvers it can add a ref on metrics view.
Though invalidation due to model update still happens because the queries also have the metricsview and hence model as ref.

runtime/resolvers/metricsview_cache_key.go Outdated Show resolved Hide resolved
runtime/resolvers/metricsview_cache_key.go Outdated Show resolved Hide resolved
runtime/resolvers/sql.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants