Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash: A waiting list match is a formal revalidation #4032

Closed
wants to merge 10 commits into from

Conversation

dridi
Copy link
Member

@dridi dridi commented Dec 20, 2023

From the main commit message:

Not only does the waiting list operate on an objcore, it also passes a reference to requests before they reembark a worker. This way when an objcore is already present during lookup we can attempt a cache hit without holding the objhead lock.

Instead of repurposing the req::hash_objhead field into an equivalent req::hash_objcore, the field is actually removed. In order to signal that a request comes back from its waiting list, the life time of the req::waitinglist flag is extended until cnt_lookup() is reentered.

This change creates a new first step in the lookup process:

  • rush match (when coming from the waiting list)
  • hash lookup (unless coming from the waiting list)
  • objhead lookup (unless there was a rush hit)

If the rush match is a hit, the objhead lock is briefly acquired to release req's reference, to rely solely on the objcore's reference like a normal hit.

This shifts the infamous waiting list serialization phenomenon to the vary header match. Knowing that a rushed object is guaranteed to lead to a cache hit allows the rush policy to be applied wholesale, instead of exponentially. Only requests incompatible with the objcore vary header may reenter the waiting list, a scenario no different from the spurious rush wakeups when operating solely on objhead updates.

If a cacheable object was inserted in the cache, but already expired, this behavior enables cache hits. This can be common with multi-tier Varnish setups where one Varnish server may serve a graced object to an other, but true of any origin server that may serve stale yet valid responses.

The waiting list enables a proper response-wide no-cache behavior from now on, but the built-in VCL prevents it by default. This is also the first step towards implementing no-cache and private support at the header field granularity.

This is a mix bag of cherry-picks from #3992 and new original commits. See individual commit messages for more details.

@dridi
Copy link
Member Author

dridi commented Jan 2, 2024

Rebased against master and force-pushed to solve a merge conflict introduced by 0835051.

@dridi
Copy link
Member Author

dridi commented Jan 3, 2024

Rebased against master and force-pushed to solve a merge conflict introduced by 7aeca4d.

Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the overall goal to make rushes more specific. Also, the Vary check of the referenced oc upon waitinglist return seems to make a lot of sense to me.
But when it comes to the details of the changes, it seems I do not understand their motivation well enough, and until I understand the motivation, they seem questionable to me.

bin/varnishd/cache/cache_hash.c Outdated Show resolved Hide resolved
bin/varnishd/cache/cache_req_fsm.c Outdated Show resolved Hide resolved
@@ -568,7 +570,7 @@ HSH_Lookup(struct req *req, struct objcore **ocp, struct objcore **bocp)
return (HSH_MISS_EXP);
}

AN(busy_found);
CHECK_OBJ_NOTNULL(busy_oc, OBJCORE_MAGIC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: the oc was already magic-checked within the VTAILQ_FOREACH() (L426)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I did this to really mark the difference between a mere boolean and an actual struct pointer, but it's purely cosmetic.

case HSH_HITMISS_EXP:
CHECK_OBJ_NOTNULL(oc, OBJCORE_MAGIC);
AZ(busy_oc);
xid = VXID(ObjGetXID(wrk, oc));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existed before, just noticed it here: If we gained a reference, we can move ObjGetXID() (could be expensive) and EXP_Dttl (not so much) outside the critical region.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I definitely focused on preserving the same behavior during the split.

oh = hash->lookup(wrk, req->digest, &wrk->wpriv->nobjhead);
}

lr = hsh_objhead_lookup(oh, req, &oc, &busy_oc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above: I think it should be clear when oc gained a reference and when not, and we might want to at least add a comment that busy_oc never gains a reference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way gh displays the comments, this "above" actually turned into "below". It was above when I added it.

bin/varnishd/cache/cache_fetch.c Show resolved Hide resolved
oh = oc->objhead;
CHECK_OBJ(oh, OBJHEAD_MAGIC);

Lck_Lock(&oh->mtx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take the oh->mtx a second time and duplicate the rush code for what exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by second time?

The point here is to unbusy the objcore on behalf of the nonexistent fetch task that would normally have ensured it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finally seeing the double lock problem, which happens below the code referenced by your comment.

Addressed in ea4ba6f.

bin/varnishd/cache/cache_hash.c Show resolved Hide resolved
CHECK_OBJ_NOTNULL(req->hash_objhead, OBJHEAD_MAGIC);
oh = req->hash_objhead;
oh = req->objcore->objhead;
(void)HSH_DerefObjCore(wrk, &req->objcore);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds to oh mtx contention

Copy link
Member Author

@dridi dridi Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I didn't find a clean solution to avoid that. In theory, this is compensated by less contention when rushing cacheable objects.

edit: which should only be prevented by a vary mismatch.

@@ -731,7 +731,12 @@ hsh_rush1(const struct worker *wrk, struct objcore *oc, struct rush *r)
Lck_AssertHeld(&oh->mtx);

AZ(oc->flags & OC_F_BUSY);
max = oc->flags ? cache_param->rush_exponent : INT_MAX;
if (oc->flags & OC_F_WITHDRAW)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have now removed the rush limit from HSH_DerefObjCore() and replaced it with a new HSH_Withdraw() function and a new flag, to basically just implement what was the rush limit before.

How are we not just shuffling stuff from left to right here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am driving this based on the objcore, so yes I'm shifting the rush policy from call sites to the objcore state, and no that's not just shuffling things around.

The "withdraw" logic should have been part of the "abandon" logic but I forgot about it by the time I "completed" the pull request. It was intended from the beginning though, as a couple commit message announce an increase of waiting list activity that should be temporary.

The good news I guess is that when I realized I had forgotten this detail I came up with "withdraw" that describes the use case better than "abandon" (and avoids overloading the latter).

@dridi
Copy link
Member Author

dridi commented Jan 4, 2024

Discussion on IRC, click to expand.

15:26 < dridi> slink: morning
15:26 < slink> bonjour dridi
15:26 < dridi> would you like to discuss waiting list? if you are not available don't worry, there's no rush
15:27 < slink> 1sec
15:27 * dridi counts to one
15:29 < slink> from "great evils of humanity" we present to you: the symbolic second
15:29 < slink> #4032 (comment)
15:30 * dridi finished counting to one
15:30 < slink> You might be right, that the back and forth was mostly due to the review, but in general, I am not so sure if the split is helpful. It might be. This is not paraphrasing "I think you are wrong", I am really not sure
15:31 < dridi> but at the same time i agree with you that if better, it's not all neat and tidy
15:32 < dridi> miss vs expired miss for example...
15:33 < dridi> do you need time to catch up with my other replies?
15:33 < slink> I was wondering if having a function boundary at an interface like #4032 (comment) made sense, where the split-out function would just return some objects and leave the decision logic in HSH_Lookup
15:34 < dridi> right
15:34 < slink> But my main concern really is that, with the abstraction we should be totally clear about when we take a reference and when not.
15:34 < dridi> so you want to return all of oc, exp_oc and busy_oc, right?
15:35 < dridi> it probably deserves to be tested
15:35 < slink> I have not tried it, so I do not know how the code looks, I just have the feeling that it might be cleaner
15:35 < dridi> oh btw a minute ago i found one more thing to squash, i'll push it now
15:35 < slink> so my main irritation is regarding OC_F_BUSY.
15:36 < dridi> done
15:36 < slink> My understanding is that we currently only clear this flag by calling HSH_Unbusy(), and I do not see why we should change that
15:37 < slink> How is the busy flag relevant for the waitinglist optimization?
15:37 < dridi> sure, i can take a stab at it now
15:37 < dridi> so, when i started looking at the waiting list a year ago, i noticed that we trigger rushes from many places
15:38 < dridi> directly, or indirectly like when we free an objcore when the last ref is gone
15:39 < slink> yes. This goes back to when I looked at this, I had a pr/patch which would reduce the rushes to what I thought was a correct minimum and it was feared by others that we would miss rushes, leading to what surfaces as hangs/lockups
15:39 < dridi> so my reasoning was that we were lacking a step to clearly signal objcore usability or lack thereof
15:39 < slink> So I agree on the premise that we have too many rushes.
15:39 < dridi> now, HSH_Unbusy() was the step to clearly signal objcore usability by other tasks
15:39 < dridi> but it was only the happy path
15:40 < dridi> my conclusion was that we needed something for the failing path, embodied by HSH_Fail()
15:40 < dridi> now HSH_Fail() is interesting because it may happen after HSH_Unbusy() was called because of streaming
15:41 < dridi> at this point i thought i was done with ensure this step was always reached (triggering a rush)
15:41 < dridi> the test suite failed, that's when i realized we could never fetch a miss, that's the withdraw case
15:42 < dridi> at the same time i realized that a purge transition was acting like a withdrawn miss
15:42 < dridi> hence the introduction of HSH_Withdraw()
15:42 < slink> hmmm
15:42 < dridi> it was supposed to exist regardless to bring back the rush-just-one case
15:43 < dridi> and with that behind, all that was left failing in my test suite were synth responses and request bodies
15:43 < dridi> so to me, the BUSY flag meant that we were waiting for a caching outcome when parked on the waiting list
15:44 < dridi> since we'd never park for either a synth resp or a request body, they don't need the flag
15:45 < slink> My mental model has always been that Unbusy signals when the headers are complete such that we can reason on vary and potentially other criteria (catflap).
15:45 < dridi> yes, for the streaming case
15:45 < slink> also for non-streaming case, IMHO.
15:46 < slink> When we also have the body, we can still reason on the headers.
15:46 < dridi> yes, but you get notified when everything is ready, not just the headers
15:46 < dridi> ok, we agree
15:46 < slink> Now I am not sure if I devoured your patch enough, but if there are some OCs which basically never un-busy because they basically never busy, can
15:46 < slink> t we rush right away?
15:47 < dridi> no, because if they are never looked up with a busy flag (synth and vrb) there's nothing on their waiting lists
15:47 < dridi> private_oh's waiting list should always be empty anyway
15:48 < dridi> we should maybe add assert(oh != private_oh) in hsh_rush
15:48 < slink> we should to that straight away
15:48 < dridi> i will push a commit to the branch
15:48 < slink> So sure, vrb can never wake up anything, and for synth, we should rugh before the synth, not after
15:49 < slink> but I somehow don't understand why we need to change the BUSY semantics to do that
15:49 < dridi> yes, that's what HSH_Withdraw() does
15:50 < slink> I fear we'll be adding special casing
15:50 < dridi> if a miss doesn't fetch because it goes to synth, it rushes potential waiters
15:50 < dridi> special casing is already the norm
15:51 < slink> but it needs to, because another request now needs to fetch, right?
15:51 < dridi> whenever you deref an objcore you have to figure out what to rush between 0, 1, policy, nuggets, etc
15:51 < slink> the idea was that it should rush 1 only
15:51 < slink> the idea was that at the call site of the deref we know
15:52 < slink> but as I said, I am aware that the calls are not optimal
15:52 < dridi> knowledge at the call site is what makes the waiting list unwieldy today
15:52 < slink> my naive idea would be to just fix those call sites, add your vary check to lookup and be done?
15:53 < slink> I am not convinced that fixing the call sites is not sufficient.
15:54 < slink> Again, I KNOW that at the time my PR was "softened" out of fear that we could miss to rush where we need to
15:55 < dridi> having an exponential rush for cacheable objects is what forces us to keep rushes from everywhere
15:55 < slink> and I think your attempt is the first after like a decade to approach this topic again
15:55 -!- slink [[email protected]] has quit [Connection closed]
15:55 -!- slink [[email protected]] has joined #varnish-hacking
15:55 -!- slink1 [[email protected]] has joined #varnish-hacking
15:55 < dridi> what's the last reply you saw?
15:56 < slink1> flaky wifi.
15:56 < slink1> (15:52:21) dridi: knowledge at the call site is what makes the waiting list unwieldy today
15:56 < slink1> my naive idea would be to just fix those call sites, add your vary check to lookup and be done?
15:56 < slink1> I am not convinced that fixing the call sites is not sufficient.
15:56 < slink1> Again, I KNOW that at the time my PR was "softened" out of fear that we could miss to rush where we need to
15:56 < slink1> and I think your attempt is the first after like a decade to approach this topic again
15:56 < dridi> 15:55 < dridi> having an exponential rush for cacheable objects is what forces us to keep rushes from everywhere
15:57 < slink1> ah right, that is a good point indeed.
15:57 < dridi> so yes, there is this "unbusy" thing i mentioned earlier
15:58 < dridi> the mandatory step where unbusy is the happy path, fail is the sad path, and withdraw is the lack of fetch
15:58 < dridi> and in addition, the objcore conveys the rush outcome
15:58 < dridi> exponential or wholesale
15:59 < slink1> hmmm
15:59 < dridi> wholesale allows us to be done with it at the mandatory unbusy step
15:59 < slink1> wholesale could still wake up too many waiters which then race
15:59 < dridi> fail triggers an immediate exponential rush, whereas before it was delayed until the deref
16:00 < dridi> waiters don't race, they come back with an oc ref
16:00 < slink1> "race" to get the right object
16:00 < dridi> no
16:00 < dridi> that race is actually gone with this patch series
16:00 < dridi> you give the req the oc ref at rush time
16:01 < slink1> but that's still a bet right?
16:01 < dridi> so when they wake up and execute again, they already have everything they need
16:01 < slink1> it could still turn out to not match vary
16:01 < dridi> yes
16:01 < slink1> that's what i mean
16:02 < dridi> vary is the only reason to still see spurious wake ups and even waiting list serialization
16:02 < slink1> and the more waiting reqs we wake up, the more they will only sommon again on the next waiting list
16:02 < slink1> at least that is my understanding
16:02 < dridi> you bring an interesting point
16:02 < dridi> should we move the rush match at rush time?
16:03 < dridi> rush all compatible reqs, and up to rush_exponent incompatible variants?
16:03 < dridi> i'm not sure we can do that from the worker triggering the rush
16:04 < dridi> it also becomes an O(n) match for that worker vs O(1) for individual reqs
16:04 < slink1> oh and btw, I think the catflap needs to be considered in hsh_rush_match()
16:05 < dridi> oh, good catch
16:05 -!- slink [[email protected]] has quit [Ping timeout: 301 seconds]
16:05 < dridi> that one's going to be a bummer though, we need to lock the oh, don't we?
16:06 < slink1> yes
16:06 < dridi> i'm tempted to fail the match if there is a vcf
16:07 < slink1> At this point I feel like I need to think about this more. I am sure that we can optimize the rushes, and I absolutely agree with your analysis that there are too many
16:07 < dridi> not just too many rushes, also not enough wake ups!
16:07 < dridi> the test case is verbatim the same as the vtc of the previous pull request
16:07 < slink1> In an ideal world, what I would like to do is change the existing code (fix DerefObjCore, add earlier rushes where we can) and then understand what we can not do without more involved changes like changing the OC_F_BUSY semantics etc
16:08 < dridi> it enables more waiting list hits
16:08 < slink1> You might be right
16:09 < slink1> I am just not capable of seeing if you are.
16:09 < dridi> i appreciate the time you already spent on it
16:09 < slink1> it is interesing and I do value the time and effort you put into it

@nigoroll
Copy link
Member

nigoroll commented Jan 8, 2024

I looked at this again, and overall, I would like to ask for the following:

For one, I would prefer to split refactoring and semantic changes: I am not convinced of the HSH_Lookup() change as proposed, and IIUC is has nothing to do with the rest of the PR.

IIUC, the other changes can be summarized as:

  • Retire hash_objhead and just use req->objcore for the waitinglist
  • Optimize HSH_Lookup with hsh_rush_match()
  • Rush earlier when we transition to a private object

I do not see how we need to refactor so much to achieve these goals, and I would prefer if we first addressed these based on the existing interfaces (with some adjustments, maybe).

In particular, the OC_F_BUSY change looks dubious to me.

But most importantly, there is something about an argument from 0a87e86 which I do not understand:

Knowing that a rushed object is guaranteed to lead to a cache hit allows the rush policy to be applied wholesale

I really do not understand how this patch series achieves this, I don't even understand how it can be achieved at all for the general (Vary) case. Yes, this is true if the busy object has no Vary (we do not know if the busy object has a Vary before it gets unbusied), but in general, do I not understand or is it just not true?

@dridi
Copy link
Member Author

dridi commented Jan 22, 2024

IIUC, the other changes can be summarized as:

  • Retire hash_objhead and just use req->objcore for the waitinglist
  • Optimize HSH_Lookup with hsh_rush_match()
  • Rush earlier when we transition to a private object

That's a summary capturing the essential changes, but slightly off.

The req->hash_objhead retirement is not so much about "just using" req->objcore (req->waitinglist's scope is also extended) but rather about driving the rush policy from an "unbusied" objcore.

The hsh_rush_match() function is not here as an optimization, but part of the removal of waiting list seralization (modulus incompatible variants).

And the new rush "latches" are not so much here to rush earlier but really to nail consistent semantics for OC_F_BUSY (and in that regard other rush paths go away for the same reason).

Knowing that a rushed object is guaranteed to lead to a cache hit allows the rush policy to be applied wholesale

I really do not understand how this patch series achieves this, I don't even understand how it can be achieved at all for the general (Vary) case.

It does so as far as compatible variants go, so a Vary: User-Agent header could still trigger severe serialization when coupled with zero TTL and grace. I never claimed otherwise, and this is what I made my first presentation on the topic.

Please check slide 12: https://github.com/varnishcache/varnish-cache/wiki/VDD23Q1#compliance

but in general, do I not understand or is it just not true?

I think you missed it from 0a87e86's commit message:

This shifts the infamous waiting list serialization phenomenon to the vary header match.

Rewinding in your comment:

I do not see how we need to refactor so much to achieve these goals, and I would prefer if we first addressed these based on the existing interfaces (with some adjustments, maybe).

In particular, the OC_F_BUSY change looks dubious to me.

Some of the refactoring (like the lookup split in two steps) is me feeling uneasy about touching this part, and in particular the locking logic. Separating the lookup search from lookup outcome really helped me (and you seemed to appreciate one of the effects on is_hitmiss handling but I can no longer find your comment). The rest of the refactoring is me preparing the landscape to land the change.

I can try a patch series without the non-essential refactoring. You found a few things that require the change to be somehow amended anyway (like the overlooked VCF).

Now regarding the OC_F_BUSY semantics I understood, resulting in changes in this patch series, they are the following:

  • during a lookup we consider "busy" objcores as a fallback
  • busy objcores don't have a caching outcome, yet
  • only fetch tasks can insert cacheable objects
    • the first time, they can later be revived by HSH_Insert(), and trigger a rush
  • therefore busy objcores can only belong to fetch tasks
  • therefore the following cases should not block a lookup:
    • synthetic responses
    • request bodies
    • purges (they actually do, until the purge is performed)
    • misses transitioning to vcl_synth

In the last case, the moment we know we won't trigger a fetch task there is no reason (besides hurting latency to sell more licenses) to keep the OC_F_BUSY flag. Likewise, when a fetch task fails, the object should be "unbusied" immediately (we just got the cacheability outcome) instead of waiting for the last objcore ref to be dropped.

I hope this answers all of your questions, thank you for your patience!

@dridi dridi marked this pull request as draft January 22, 2024 12:50
@nigoroll
Copy link
Member

FTR, need to devour this:

(20:07:05) slink: done reading once, I will need to get back to it again. There is still something about your "guaranteed to lead to a cache hit" argument which I do not get, but I am having trouble phrasing it. Maybe this regarding "It does so as far as compatible variants go": How do we know variants are compatible before doing a vary compare? How can we do better than wait for a response and then wake up waiters? Sure, we could check the waiters vary before waking them up, but that moves the burden of the vary check to a single thread vs. multiple, and if your patch does this, I must have overlooked it.
(20:13:20) dridi: also
(20:13:43) dridi: the deferred rush match happens outside of the objhead lock
(20:14:03) dridi: so i think it should remain the way i did it
(20:14:57) dridi: i suppose "guaranteed cache hit" should always be followed by "modulus compatible variant"
(20:15:44) dridi: i understand your confusion
(20:16:04) dridi: i still need to amend the rush match to bail out in the presence of a cat flap
(20:16:41) dridi: to answer your question about doing it better
(20:16:56) dridi: we could have some kind of predictive vary in the form of a hash
(20:17:24) dridi: some precomputed vary outcome
(20:19:43) dridi: still, i wouldn't check it in hsh_rush1() under the oh lock

@dridi
Copy link
Member Author

dridi commented Jan 23, 2024

Yesterday I rewrote the patch series without the offending refactoring, and taking changes in a slightly different order helped better lay out the moving parts.

I tried to avoid confusing wording in the commit log, and tried to be more explicit about semantics, so I hope this will be overall easier to review.

@nigoroll
Copy link
Member

FTR, I understand you are waiting for my feedback, but this PR need a longer attention span and I need to tend to other work for the remainder of this week. Sorry.

@dridi
Copy link
Member Author

dridi commented Jan 31, 2024

My goal was to answer everything before you'd take another dive, so in that sense I succeeded ;)

Copy link
Contributor

@mbgrydeland mbgrydeland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this PR has a lot of merit. It cleans up the code, making it less magic, and will optimize significantly the case where there are multiple popular VARYants on an OH.

The PR does touch on quite a few of the scary codepaths though. I do wonder about places that rely on OC_F_BUSY being set while under lock and if there are gotchas there. I didn't do complete search for that sort of fallout in this round.

const uint8_t *vary;

oc = req->objcore;
CHECK_OBJ_ORNULL(oc, OBJCORE_MAGIC);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be CHECK_OBJ_NOTNULL(). Callers should not call it if req->objcore == NULL. That makes it obvious at the call sites that this function needs an OC to operate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection.

bin/varnishd/cache/cache_hash.c Show resolved Hide resolved
return (0);

AZ(oc->flags & OC_F_BUSY);
if (oc->flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test feels wrong to me. Are any (both present and future) OC flags always a reason for the rushing to fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You caught me red handed.

As of today, yes, all OC flags imply that the object is not serviceable in the context of a rush. In the future, much uncertain. We could add an ObjRushable(oc) function, I believe there's another spot where it would be relevant.

CHECK_OBJ_ORNULL(req->vcf, VCF_MAGIC);
AN(hash);

hsh_prealloc(wrk);
if (DO_DEBUG(DBG_HASHEDGE))
hsh_testmagic(req->digest);

if (req->hash_objhead != NULL) {
if (hsh_rush_match(req)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be if (req->objcore != NULL && hsh_rush_match(req)), ref comment above. Makes it obvious that this path is only taken if a req->objcore reference exists before calling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection, this is basically the call site for your first comment.

* the objcore for pass objects. The other reference is held by
* the current fetch task.
*/
if (oh == private_oh) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should better be if (oc->flags & OC_F_PRIVATE). Paves the way for multiple private_ohs to reduce lock contention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

bin/varnishd/cache/cache_hash.c Show resolved Hide resolved

if (oc->flags & (OC_F_WITHDRAW|OC_F_FAILED))
max = 1;
else if (oc->flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has me concerned. Why is this the right choice for any OC_F_* flag other than withdraw and failed? And that reasoning would also apply also to any future flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other candidate for ObjRushable(oc).

@@ -30,7 +30,8 @@

/*lint -save -e525 -e539 */

OC_FLAG(BUSY, busy, (1<<1)) //lint !e835
OC_FLAG(WITHDRAW, withdraw, (1<<0)) //lint !e835
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be named WITHDRAWN instead? When set I believe the past tense is the right one, as in it has already been withdrawn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@dridi
Copy link
Member Author

dridi commented Feb 5, 2024

I added 6 SQUASHME commits to address @mbgrydeland's review and one more to add an assertion.

Copy link
Contributor

@mbgrydeland mbgrydeland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe most of my concerns have been addressed.

Though I do have a new concern that I'm uncertain of how will impact the changeset. What drives the rushing of HFP/HFM objects with this patch set? It looks to me like it will only rush once for this case, and not have those that was woken because of a HFP/HFM wake up new requests again. Or am I missing something?

@@ -369,7 +369,9 @@ hsh_rush_match(struct req *req)
CHECK_OBJ_NOTNULL(oc, OBJCORE_MAGIC);

AZ(oc->flags & OC_F_BUSY);
if (oc->flags)
if (oc->flags & (OC_F_WITHDRAWN|OC_F_HFM|OC_F_HFP|OC_F_CANCEL|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are HFM and HFP being excluded here? Aren't they supposed to explicitly do the rushing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the request has already been rushed, and it's failing to match an objcore with either HFP or HFM, so it will proceed with a regular lookup. In all likelihood it will proceed in either vcl_pass or vcl_miss accordingly, unless another compatible object landed in the cache between rush time and lookup time (when we grab the objhead lock).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it now how that comes together, at least for HFP objects. The rushing cascade will continue in HSH_Lookup() when the HFP flag is found and hsh_deref_objhead_unlock() is called at https://github.com/Dridi/varnish-cache/blob/objcore_rush/bin/varnishd/cache/cache_hash.c#L556. hsh_deref_objhead_unlock() does a wake of rush_exponent tasks when finding a HFP object.

Though I worry that there is no similar effect for HFM with this patch set. For those, only rush_exponent tasks will be woken during HSH_Unbusy(). But these will not rush again during their HSH_Lookup() (https://github.com/Dridi/varnish-cache/blob/objcore_rush/bin/varnishd/cache/cache_hash.c#L581), pausing the cascade. It will wake again when one of those does HSH_Unbusy().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HFM will cascade, but not as soon as HFP. Unlike HFP, HFM may break the exponential cascade.

Since we treat it as a miss, and retain the possibility of inserting a cacheable object on the next fetch, we will either have a wholesale rush (cacheable) or an exponential rush at "unbusy" time (if all rush_exponent fetch tasks trigger rush_exponent waiting list rushes of their own).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking this again, I guess this behaviour of HFM is the same as before this PR, and with that knowledge I don't have any objection to this PR.

Though I was caught by surprise wrt to the delay inherent to the HFM rushing. My understanding of this was that HFM would behave the same as HFP. The rushing is only used to space out the task scheduling internal to Varnish, and not a method of being gentle with the backend. I think it's worth revisiting this logic regardless of this PR, and seeing if we have the intended behaviour here.

@mbgrydeland mbgrydeland self-requested a review February 12, 2024 13:40
@nigoroll
Copy link
Member

I still did not find the hours which this PR requires for another review round, but I just tried it with one of my test setups and ran into this null pointer deref straight away:

#6  0x000055f99a94b07b in child_signal_handler (s=11, si=0x7f048982eaf0, c=0x7f048982e9c0) at cache/cache_main.c:326
        buf = "Signal 11 (Segmentation fault) received at 0x58 si_code 1", '\000' <repeats 966 times>
        sa = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
        req = 0x7f0476664a60
        a = 0x58 <error: Cannot access memory at address 0x58>
        p = 0x7f0480d903f8 "8\003ـ\004\177"
        info = 0x0
#7  <signal handler called>
No locals.
#8  0x000055f99a93dab2 in HSH_Lookup (req=0x7f0476664a60, ocp=0x7f0480d8f670, bocp=0x7f0480d8f668) at cache/cache_hash.c:435
        wrk = 0x7f0480d90338
        oh = 0x0
        oc = 0x7f0468a03880
        exp_oc = 0x7f0480d8f620
        vr = 0x7f0476664bb8
        exp_t_origin = 0
        busy_found = 1
        vary = 0x7f0480d8f640 "\220\366؀\004\177"
        boc_progress = -1
        xid = 0
        dttl = 0
#9  0x000055f99a95d470 in cnt_lookup (wrk=0x7f0480d90338, req=0x7f0476664a60) at cache/cache_req_fsm.c:578
        oc = 0x7f0468a03880
        busy = 0x0
        lr = HSH_BUSY
        had_objcore = 1
#10 0x000055f99a959d07 in CNT_Request (req=0x7f0476664a60) at cache/cache_req_fsm.c:1192
        ctx = {{magic = 2161702624, syntax = 32516, vclver = 2594048900, method = 22009, vpi = 0x7f0480d903b8, msg = 0x7f0476201010, vsl = 0x5800000000, vcl = 0x7f0476200f60, 
            ws = 0x7f0480d8f710, sp = 0x55f99a95f747 <ses_get_attr+343>, req = 0x7f0480d8f740, http_req = 0x800000094, http_req_top = 0x7f0476200f20, http_resp = 0x128, bo = 0x7f0480d8f730, 
            http_bereq = 0x55f99a960327 <SES_Get_proto_priv+39>, http_beresp = 0x7f0480d8f740, now = 6.8999794284505472e-310, specific = 0x7f0480d8f750, 
            called = 0x55f99a99d16e <http1_getstate+30>}}
        wrk = 0x7f0480d90338
        nxt = REQ_FSM_MORE
#11 0x000055f99a99ce0a in HTTP1_Session (wrk=0x7f0480d90338, req=0x7f0476664a60) at http1/cache_http1_fsm.c:392
        hs = 22009
        sp = 0x7f0476200f20
        st = 0x55f99aa51e5e <H1PROC> "HTTP1::Proc"
        i = 32516
#12 0x000055f99a99c350 in http1_req (wrk=0x7f0480d90338, arg=0x7f0476664a60) at http1/cache_http1_fsm.c:87
        req = 0x7f0476664a60
#13 0x000055f99a98d382 in Pool_Work_Thread (pp=0x7f0482200000, wrk=0x7f0480d90338) at cache/cache_wrk.c:496
        tp = 0x7f0480d8f858
        tpx = {list = {vtqe_next = 0x7f0480d51360, vtqe_prev = 0x7f0482200078}, func = 0x55f99a99c1c0 <http1_req>, priv = 0x7f0476664a60}
        tps = {list = {vtqe_next = 0x0, vtqe_prev = 0x0}, func = 0x55f99a953320 <pool_stat_summ>, priv = 0x7f0482205000}
        tmo = 0
        now = 1707760746.3839386
        i = 5
        reserve = 6
...

(gdb) fr 8
#8  0x000055f99a93dab2 in HSH_Lookup (req=0x7f0476664a60, ocp=0x7f0480d8f670, bocp=0x7f0480d8f668) at cache/cache_hash.c:435
435			boc_progress = oc->boc == NULL ? -1 : oc->boc->fetched_so_far;

@dridi
Copy link
Member Author

dridi commented Feb 13, 2024

It is embarrassing to find that I forgot to query the boc under the oh lock, especially since we discussed it when the construct was first introduced (see #3566 (comment)). At the very least we found out quickly, thanks!

Taken care of in 141308d, could you please give it another try?

dridi added a commit that referenced this pull request Feb 28, 2024
This adds coverage for a non-delivery transition from vcl_hit, where the
busy objcore would drop its sole reference in the event of a grace hit.
The lack of coverage was visible in the gcov dashboard:

    669   580   if (busy != NULL) {
    670   0     	(void)HSH_DerefObjCore(wrk, &busy, 0);
    671   0     	VRY_Clear(req);
    672   0     }

There should now be at least one pass inside this block.

Refs #4032
@dridi
Copy link
Member Author

dridi commented Feb 28, 2024

Thanks to @simonvik who is helping me getting traffic through this pull request we found that a test case for vmod_saintmode would panic. It became quickly obvious to me that this case that slipped through the cracks shows that the model for this pull request is coherent.

It's the new HSH_Withdraw() building block that solves the problem, and in hindsight it's obvious that an objcore is effectively being withdrawn.

The one-liner fix is in cf932ca and coverage was added directly to the master branch in 3eeb8c4.

@nigoroll
Copy link
Member

FYI: For a multitude of reasons I still did not find the few hours I would like to have for quiet work to thoroughly review the new state of this PR. I could spend about half an hour with it, and I currently feel both under- and overwhelmed by it. Underwhelmed, because I still fail to see why the goals of this PR need so many changes (I see this as my own mistake), and I am overwhelmed by the fact that at this point the PR can effectively only be reviewed as one large diff due to the incremental changes.

I am still worried about the change of the busy state model and the general impact this might have, I wonder if we really need the withdrawn facility and I am not convinced that we actually DTRT in all the different places.

@nigoroll
Copy link
Member

nigoroll commented Mar 1, 2024

Just in case I will be away for more before-release bugwashing: I would really prefer to move this one to the next cycle to get the chance for a thorough review. See previous comments for reasons.

@dridi
Copy link
Member Author

dridi commented Mar 4, 2024

I fixed an incorrect assertion that surfaced from production traffic. It otherwise ran successfully on Friday and over the weekend.

@dridi
Copy link
Member Author

dridi commented Mar 4, 2024

From Friday's bugwash: I will squash all the commits that followed @mbgrydeland's review, close this pull request and open a new one based on this patch series to start a new discussion.

dridi added 10 commits March 4, 2024 13:53
This is not a nice caching outcome, but it is a good reason to trigger
a new fetch if there are clients on the waiting list. If the fetch task
failed before reaching HSH_Unbusy() we make sure to "unbusy" it anyway.

The OC_F_FAILED flag will prevent it from being looked up, so there is
no reason to delay the rush until the final objcore reference is dropped.

This also means that once the client task that triggered the fetch task
resumes it can only be operating on a non-busy object.
The BOS_PREP_STREAM state is retired. It was used as a stop-gap to
"unbusy" objects before waking up the client task for good. Instead,
the HSH_Unbusy() and HSH_Fail() functions are called once the gap is
closed, depending on the outcome.

This removes one unnecessary signal and consolidates multiple call
sites, ensuring that objects always drop the OC_F_BUSY flag from a
central location, for fetch tasks.
This is the counterpart of HSH_Unbusy() for the cases where the req task
will not schedule a fetch task, at which point we know we can already
wake up a request, if there is one on the waiting list.

This encapsulates the "wake up only one request" semantic using a new
OC_F_WITHDRAWN flag. Although this flag is not being used yet, it will
when the state of the objcore determines the rush policy instead of the
call site.
Objcores created in vcl_synth cannot be looked up, therefore there is
never an object going on the waiting list because it sees a synthetic
objcore's OC_F_BUSY flag.
Objcores created for request bodies cannot be looked up, therefore there
is never an object going on the waiting list because it sees a req.body
objcore's OC_F_BUSY flag.
Rushing the waiting list because an objcore was dropped was a source
of spurious wakeups, including the case where the dropped objcore is not
even relevant to the objhead waiting list. It is now guaranteed that an
object either dropped its OC_F_BUSY flag once ready (otherwise failed or
withdrawn) or never had this flag to begin with for objcores that can't
be looked up.
Operating on an objcore allows to preserve the different rush strategies
that exist so far:

- rush one request for failed fetch tasks
- rush one request for withdrawn objects
- rush rush_exponent requests otherwise

For the cases where no rush is expected, a null objcore is passed.

For cacheable objects, all waiters are optimistically woken up at once.
From now on, the policy will be derived from the objcore.
This reduces a great deal of spurious activity on the private_oh waiting
list and removes unncessary conditionals when dealing with cacheable (or
at least searchable) objects.

There is still some waiting list no-op activity via HSH_Fail() but it
can be considered negligible compared to the happy path.
Not only does the waiting list operate on an objcore, it also passes a
reference to requests before they reembark a worker. This way when an
objcore is already present during lookup we can attempt a cache hit
without holding the objhead lock.

Instead of repurposing the req::hash_objhead field into an equivalent
req::hash_objcore, the field is actually removed. In order to signal
that a request comes back from its waiting list, the life time of the
req::waitinglist flag is extended until cnt_lookup() is reentered.

If the rushed objcore matches a request, the lookup can result in a
hit without entering the regular lookup critical section. The objhead
lock is briefly acquired to release req's reference on the objhead, to
rely solely on the objcore's objhead reference like a normal hit.

This shifts the infamous waiting list serialization phenomenon to the
vary header match. The rush policy is applied wholesale on cacheable
objects instead of exponentially. This improves waiting list latency
when the outcome is a cache hit, but forces incompatible variants to
reenter a lookup and potentially disembark into the waiting list again.
Since most spurious rushes at every corner of objhead activity are gone,
this change puts all the spurious activity on the incompatible variants
alone instead of all objects on more occasions.

If a cacheable object was inserted in the cache, but already expired,
this behavior enables cache hits. This can be common with multi-tier
Varnish setups where one Varnish server may serve a graced object to
an other, but true of any origin server that may serve stale yet valid
responses.

The waiting list enables a proper response-wide no-cache behavior from
now on, but the built-in VCL prevents it by default. This is also the
first step towards implementing no-cache and private support at the
header field granularity.
@dridi
Copy link
Member Author

dridi commented Mar 5, 2024

#4073

@dridi dridi closed this Mar 5, 2024
@dridi dridi deleted the objcore_rush branch March 5, 2024 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants