vcl_vrt: Skip VCL execution if the client is gone #3998

dridi · 2023-10-13T16:43:10Z

This is the main commit message:

Once a client is reportedly gone, processing its VCL task(s) is just a
waste of resources. The execution of client-facing VCL is intercepted
and an artificial return(fail) is returned in that scenario.

Thanks to the introduction of the universal return(fail) proper error
handling and resource tear down is already in place, which makes this
change safe modulus unknown bugs. This adds a circuit breaker anywhere
in the client state machine where there is VCL execution.

A new Reset time stamp is logged to convey when a task does not complete
because the client is gone. This is a good complement to the walk away
feature and its original circuit breaker for the waiting list, but this
has not been integrated yet.

While the request is technically failed, it won't increase the vcl_fail
counter, and a new req_reset counter is incremented.

Refs #3835
Refs 61a15cb
Refs e5efc2c
Refs ba54dc9
Refs 6f50a00
Refs b881699

This change is limited to h2 reset in a broader sense than just receiving an RST_STREAM frame, and it is conceivable to achieve a similar attack for HTTP/1 sessions with a rapid TCP reset, but this would also be less effective. It should also be easier to detect and mitigate with firewall rules.

However implementing client polling for HTTP/1 clients is not as straightforward as h2, so an HTTP/1 client can effectively trigger a workload and not stick around, yet the VCL will be processed until completion. We could probably use the waiter facility, but this needs some research.

dridi · 2023-10-13T16:50:17Z

I intend to submit a port to the 6.0 branch after getting this change approved, if approved.

nigoroll · 2023-10-16T12:48:14Z

bin/varnishd/http2/cache_http2_session.c

+	CHECK_OBJ_NOTNULL(req, REQ_MAGIC);
+	CAST_OBJ_NOTNULL(r2, req->transport_priv, H2_REQ_MAGIC);
+
+	Lck_Lock(&r2->h2sess->sess->mtx);


I am not convinced that we should take the lock here. Yes, if we don't, we might return a false positive, but only intermittently and only for bad requests. While the lock overhead hits everyone.

bugwash: +1

TomasKorbar · 2023-10-16T14:01:04Z

@dridi memory situation improved even further after applying this change, CPU load improved significantly. I am trying to backport this to 6.6.2 but unfortunately the 02014 test is failing for me.

*** c3 rx: stream: 0, type: WINDOW_UPDATE (8), flags: 0x00, size: 4
*** c3.0 winup->size: 1048576
*** c3 rx: stream: 1, type: WINDOW_UPDATE (8), flags: 0x00, size: 4
*** c3.1 winup->size: 1048576
---- c3.1 Wrong frame type WINDOW_UPDATE (8) wanted RST_STREAM
** c3 Waiting for stream 0

Seems the test is expecting RST_STREAM but for some reason is not getting it. @dridi got any idea what could be wrong?

dridi · 2023-10-16T14:48:33Z

@dridi memory situation improved even further after applying this change, CPU load improved significantly.

Thank you very much for your testing. Your observations match my expectations, and this change was approved.

Seems the test is expecting RST_STREAM but for some reason is not getting it. @dridi got any idea what could be wrong?

I had similar problems while working on this patch series, changing the behavior sometimes breaks test cases without making them irrelevant and t02014.vtc took the blunt of it.

I suspect you are witnessing the lack of h2 rxbuf for request bodies (#3661) and I don't think it will be easy to make this test case stable. Just ignore it and you should be fine.

TomasKorbar · 2023-10-17T07:14:30Z

Okay, will do. Thanks a lot for the suggestion.

It was particularly hard to follow once we reach client c3.

The goal is for top-level transports to report whether the client is still present or not.

Once a client is reportedly gone, processing its VCL task(s) is just a waste of resources. The execution of client-facing VCL is intercepted and an artificial return(fail) is returned in that scenario. Thanks to the introduction of the universal return(fail) proper error handling and resource tear down is already in place, which makes this change safe modulus unknown bugs. This adds a circuit breaker anywhere in the client state machine where there is VCL execution. A new Reset time stamp is logged to convey when a task does not complete because the client is gone. This is a good complement to the walk away feature and its original circuit breaker for the waiting list, but this has not been integrated yet. While the request is technically failed, it won't increase the vcl_fail counter, and a new req_reset counter is incremented. This new behavior is guarded by a new vcl_req_reset feature flag, enabled by default. Refs varnishcache#3835 Refs 61a15cb Refs e5efc2c Refs ba54dc9 Refs 6f50a00 Refs b881699

The error check is not performed in a critical section to avoid contention, at the risk of not seeing the error until the next transport poll.

With #3998 we need to ensure streams are not going to skip vcl_recv if reset faster than reaching this step for the request task. The alternative to prevent the vcl_req_reset feature from interfering is to simply disable it.

Noticed while porting #3998 to the 6.0 branch with a varnishtest more sensitive to timing.

With varnishcache#3998 we need to ensure streams are not going to skip vcl_recv if reset faster than reaching this step for the request task. The alternative to prevent the vcl_req_reset feature from interfering is to simply disable it.

Noticed while porting varnishcache#3998 to the 6.0 branch with a varnishtest more sensitive to timing.

With varnishcache#3998 we need to ensure streams are not going to skip vcl_recv if reset faster than reaching this step for the request task. The alternative to prevent the vcl_req_reset feature from interfering is to simply disable it.

Noticed while porting varnishcache#3998 to the 6.0 branch with a varnishtest more sensitive to timing.

With #3998 we need to ensure streams are not going to skip vcl_recv if reset faster than reaching this step for the request task. The alternative to prevent the vcl_req_reset feature from interfering is to simply disable it.

Noticed while porting #3998 to the 6.0 branch with a varnishtest more sensitive to timing.

dridi added b=enhancement r=trunk c=varnishd c=H/2 b=cleanup labels Oct 13, 2023

This was referenced Oct 16, 2023

h2: Add a rate limit facility for h/2 RST handling ("Rapid reset" mitigation) #3997

Merged

Handling of CVE-2023-44487 / HTTP2 Rapid Reset #3996

Closed

nigoroll approved these changes Oct 16, 2023

View reviewed changes

dridi added 4 commits October 17, 2023 13:50

vtc: Avoid cycling the barrier in t02014

a703010

It was particularly hard to follow once we reach client c3.

transport: New poll method

7d53599

The goal is for top-level transports to report whether the client is still present or not.

http2_session: Implement transport polling

4a895f4

The error check is not performed in a critical section to avoid contention, at the risk of not seeing the error until the next transport poll.

dridi force-pushed the vcl_h2_reset branch from 5983c50 to 4a895f4 Compare October 17, 2023 11:56

dridi mentioned this pull request Oct 18, 2023

h2: Integrate h2_rxbuf_storage (6.0) #4004

Merged

dridi merged commit 05a6869 into varnishcache:master Oct 18, 2023
1 check passed

dridi deleted the vcl_h2_reset branch October 18, 2023 08:15

dridi mentioned this pull request Oct 18, 2023

h2: the :scheme pseudo header is not optional #4005

Merged

dridi added a commit that referenced this pull request Oct 18, 2023

vtc: Missing synchronization in t02025

b4c5030

Noticed while porting #3998 to the 6.0 branch with a varnishtest more sensitive to timing.

dridi mentioned this pull request Oct 18, 2023

vcl_vrt: Skip VCL execution if the client is gone (6.0) #4006

Merged

dridi added a commit to dridi/varnish-cache that referenced this pull request Oct 18, 2023

vtc: Missing synchronization in t02025

e3847e1

Noticed while porting varnishcache#3998 to the 6.0 branch with a varnishtest more sensitive to timing.

dridi mentioned this pull request Oct 18, 2023

h2: Rapid reset mitigations (7.4) #4009

Merged

dridi added a commit to dridi/varnish-cache that referenced this pull request Oct 18, 2023

vtc: Missing synchronization in t02025

ae7b0b5

Noticed while porting varnishcache#3998 to the 6.0 branch with a varnishtest more sensitive to timing.

dridi mentioned this pull request Oct 18, 2023

h2: Rapid reset mitigations (7.3) #4011

Merged

dridi added a commit that referenced this pull request Oct 24, 2023

vtc: Missing synchronization in t02025

9b2b352

Noticed while porting #3998 to the 6.0 branch with a varnishtest more sensitive to timing.

This was referenced Apr 30, 2024

Intermittent timeout in r02310.vtc #4098

Closed

vtc: Stabilize r02310.vtc #4099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vcl_vrt: Skip VCL execution if the client is gone #3998

vcl_vrt: Skip VCL execution if the client is gone #3998

dridi commented Oct 13, 2023

dridi commented Oct 13, 2023

nigoroll Oct 16, 2023

dridi Oct 16, 2023

TomasKorbar commented Oct 16, 2023

dridi commented Oct 16, 2023

TomasKorbar commented Oct 17, 2023

vcl_vrt: Skip VCL execution if the client is gone #3998

vcl_vrt: Skip VCL execution if the client is gone #3998

Conversation

dridi commented Oct 13, 2023

dridi commented Oct 13, 2023

nigoroll Oct 16, 2023

Choose a reason for hiding this comment

dridi Oct 16, 2023

Choose a reason for hiding this comment

TomasKorbar commented Oct 16, 2023

dridi commented Oct 16, 2023

TomasKorbar commented Oct 17, 2023