Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental ability to cache References #8111

Merged
merged 3 commits into from
May 16, 2024

Conversation

snazy
Copy link
Member

@snazy snazy commented Feb 22, 2024

Adds ability to cache Reference objects, avoiding round-trips to the backend database, beneficial for read heavy workloads.

Reference-caching can be enabled via two new configuration options: to define the expiration time for References (holding the current HEAD/tip) and to define the expiration time for non-existing References.

Looking up references happens via the name of the reference, usually without the reference type (aka whether it is a branch or a tag), so Nessie has to look up both types - the given name as a branch and the given name as a tag. This is where negative-caching comes into play, because that caches the existing entry and the non-existing "other" reference type. Hence, if you enable reference-caching, it is recommended to also enable negative reference-caching.

Operations that are about to change a reference (committing and reference create/assign/delete operations), always consult the backing database, implicitly refreshing the cache.

Mutliple Nessie (against the same repository) do not communicate with each other. If for example a commit happened against one Nessie instance, the other instances may or may not return the new commit. This is why this feature is still experimental and only useful for Nessie setups with a single Nessie instance. This change will be later with another change to allow distributed cache-eviction, so that other Nessie instances accessing the same repository work fine.

@snazy snazy force-pushed the cache-references branch 9 times, most recently from 6dcb278 to 036a501 Compare February 27, 2024 09:42
@snazy snazy force-pushed the cache-references branch 2 times, most recently from 1fab3a0 to c3f1303 Compare March 18, 2024 10:18
@snazy snazy force-pushed the cache-references branch from c3f1303 to a2ca5b3 Compare March 28, 2024 17:50
@snazy snazy force-pushed the cache-references branch 7 times, most recently from 9672006 to ee9cca6 Compare May 6, 2024 10:34
@snazy snazy changed the title WIP: Experimental ability to cache References Experimental ability to cache References May 6, 2024
@snazy snazy marked this pull request as ready for review May 6, 2024 10:34
@snazy snazy force-pushed the cache-references branch 2 times, most recently from ba20f8d to b27e9b9 Compare May 14, 2024 07:31
* <p>Database specific implementations of {@link Persist} must not implement this function.
*/
@Nullable
default Reference fetchReferenceForUpdate(@Nonnull String name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ForUpdate" looks a bit misleading to me... The client does not have to update it after fetching. How about fetchCurrentReference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bypassingCache" would be more correct - but do we care that much?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ForUpdate" looks really odd in code that just tries to load the latest value 🤷

Why not add a parameter to the old method: fetchReference(String name, boolean refresh), for example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That caused like 5248 additional changed lines of code ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we could have it as an overloaded method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep the impact low (it's would be rather cosmetic).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding fetchReference(String name, boolean refresh) instead of fetchReferenceForUpdate(@Nonnull String name) is about the same in terms of impact, is it not?.. At the same time I find that fetchReference(name, true) is more intuitive (more precise) than fetchReferenceForUpdate() at the call sites. fetchReferenceBypassingCache(name) also works for me, but I think it will bring the term "cache" to implementation that do not care about caching, which is again a bit awkward.

It could also be fetchReference(String name, FetchMode mode) with FetchMode { LOCAL, FRESH }... or similar. WDYT?

snazy added 3 commits May 15, 2024 17:25
Adds ability to cache `Reference` objects, avoiding round-trips to the backend database, beneficial for read heavy workloads.

Reference-caching can be enabled via two new configuration options: to define the expiration time for `Reference`s (holding the current HEAD/tip) and to define the expiration time for non-existing `Reference`s.

Looking up references happens via the name of the reference, usually without the reference type (aka whether it is a branch or a tag), so Nessie has to look up both types - the given name as a branch and the given name as a tag. This is where negative-caching comes into play, because that caches the existing entry and the non-existing "other" reference type. Hence, if you enable reference-caching, it is recommended to also enable negative reference-caching.

Operations that are about to change a reference (committing and reference create/assign/delete operations), always consult the backing database, implicitly refreshing the cache.

Mutliple Nessie (against the same repository) do not communicate with each other. If for example a commit happened against one Nessie instance, the other instances may or may not return the new commit. This is why this feature is still experimental and only useful for Nessie setups with a _single_ Nessie instance. This change will be later with another change to allow distributed cache-eviction, so that other Nessie instances accessing the same repository work fine.
@snazy snazy force-pushed the cache-references branch from b27e9b9 to 8d9360a Compare May 15, 2024 15:55
@dimas-b
Copy link
Member

dimas-b commented May 15, 2024

Should we perhaps expose a way for getting a fresh Reference at the client API level?.. mostly as a way to allow clients to avoid references flipping back to an old version on consecutive API calls.

@dimas-b
Copy link
Member

dimas-b commented May 15, 2024

Also, do we want to keep it "experimental" (PR title) still?

@snazy
Copy link
Member Author

snazy commented May 16, 2024

Should we perhaps expose a way for getting a fresh Reference at the client API level?.. mostly as a way to allow clients to avoid references flipping back to an old version on consecutive API calls.

I don't understand, TBH. How should a client "flip back" to an old version? The cached value is always up to date (assuming a single-node Nessie instance, later with #8463 on all instances)

Also, do we want to keep it "experimental" (PR title) still?

At least until #8463 is not merged.

@snazy snazy merged commit cfb7f69 into projectnessie:main May 16, 2024
16 checks passed
@snazy snazy deleted the cache-references branch May 16, 2024 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants