Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue where successful streaming gRPC requests are incorrectly reported as cancelled #6471

Merged

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented Oct 24, 2023

What this PR does

This PR fixes multiple related issues:

  • It fixes the issue where query requests that stream chunks but return no series are reported as cancelled rather than successful in traces and metrics reported by queriers (77c191d)
  • It also fixes a similar issue as above for other server streaming gRPC calls (ae15605)
  • It fixes the issue where query requests that stream chunks and have some series are reported as cancelled (5f6d927 for streaming from ingesters, 91fd892 for streaming from store-gateways)

4453e0a separately refactors the ingester stream reader (SeriesChunksStreamReader) to follow the same structure as the store-gateway stream reader (storeGatewayStreamReader).

I'd suggest reviewing each commit individually.

Which issue(s) this PR fixes or relates to

(none)

Checklist

  • Tests updated
  • [n/a] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@charleskorn charleskorn force-pushed the charleskorn/fix-incorrect-cancellation-metrics-and-traces branch 2 times, most recently from af9dd49 to af430e4 Compare October 24, 2023 08:44
@charleskorn charleskorn force-pushed the charleskorn/fix-incorrect-cancellation-metrics-and-traces branch from af430e4 to 795f704 Compare October 24, 2023 09:07
@charleskorn charleskorn marked this pull request as ready for review October 24, 2023 09:14
@charleskorn charleskorn requested a review from a team as a code owner October 24, 2023 09:14
@charleskorn
Copy link
Contributor Author

Putting back to draft to investigate flaky integration test introduced in this PR

@charleskorn charleskorn marked this pull request as draft October 24, 2023 09:44
pkg/util/grpc.go Outdated Show resolved Hide resolved
Copy link
Contributor

@colega colega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll trust the test. The "CloseAndExhaust" also looks good.

LGTM.

Co-authored-by: Oleg Zaytsev <[email protected]>
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me. I'm just wondering if by draining the stream we may add extra latency, but on the other side the gRPC doc you linked is very clear that if we don't do it resources may be leaked.

Will wait for you to spot the issue causing flaky integration test (whether it's just an issue in the test or a more serious one).

@charleskorn charleskorn force-pushed the charleskorn/fix-incorrect-cancellation-metrics-and-traces branch from ebd9ee0 to 11ddaa9 Compare October 26, 2023 02:21
@charleskorn charleskorn changed the title Fix issue where successful query requests that return no series are incorrectly reported as cancelled Fix issue where successful streaming gRPC requests are incorrectly reported as cancelled Oct 26, 2023
@charleskorn charleskorn force-pushed the charleskorn/fix-incorrect-cancellation-metrics-and-traces branch from 532caff to 25f55a9 Compare October 26, 2023 03:25
@charleskorn charleskorn force-pushed the charleskorn/fix-incorrect-cancellation-metrics-and-traces branch from 25f55a9 to 91fd892 Compare October 26, 2023 04:41
@charleskorn
Copy link
Contributor Author

I'm just wondering if by draining the stream we may add extra latency, but on the other side the gRPC doc you linked is very clear that if we don't do it resources may be leaked.

By the time we get to draining the stream, we've either:

  • successfully reached the end of the stream, in which case draining the stream would effectively be a no-op
  • encountered some kind of error has occurred that the gRPC library already knows about (eg. timeout or error returned by the server), in which case we've also reached the end of the stream and draining should again be a no-op
  • encountered some kind of error on the client side and decided to abort the request, in which case we might have to do some work to drain the stream

Only in the last case would I expect any noticeable latency difference, and this should be fairly quick - worst case scenario, the client code closes the stream, the gRPC library sends a cancellation notification to the server and then the client has to wait for the server to acknowledge the cancellation and close the stream, or times out.

@charleskorn
Copy link
Contributor Author

charleskorn commented Oct 26, 2023

I've fixed the flaky test, which was due to flaky behaviour while streaming chunks - there was a race between:

  • the gRPC stream being closed cleanly and thus the request being reported as a successful request
  • the context used by the PromQL engine being cancelled once query evaluation had succeeded and thus the request being reported as cancelled

@charleskorn charleskorn marked this pull request as ready for review October 26, 2023 04:54
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work. The changes make sense to me. I left a couple of nits that I'm going to self-commit so we merge this PR today and will be included in the next week release.

For a follow up PR: I would add a failling rule to forbid the usage of CloseSend() but use CloseAndExhaust() instead.

pkg/ingester/client/streaming.go Outdated Show resolved Hide resolved
pkg/querier/block_streaming.go Outdated Show resolved Hide resolved
@pracucci
Copy link
Collaborator

Merging it to get it included in tomorrow's weekly release.

@pracucci pracucci merged commit 1c4cacc into main Oct 29, 2023
27 checks passed
@pracucci pracucci deleted the charleskorn/fix-incorrect-cancellation-metrics-and-traces branch October 29, 2023 09:05
@charleskorn
Copy link
Contributor Author

charleskorn commented Oct 30, 2023

I'm just wondering if by draining the stream we may add extra latency, but on the other side the gRPC doc you linked is very clear that if we don't do it resources may be leaked.

By the time we get to draining the stream, we've either:

  • ...
  • encountered some kind of error on the client side and decided to abort the request, in which case we might have to do some work to drain the stream

Only in the last case would I expect any noticeable latency difference, and this should be fairly quick - worst case scenario, the client code closes the stream, the gRPC library sends a cancellation notification to the server and then the client has to wait for the server to acknowledge the cancellation and close the stream, or times out.

I need to correct myself here: when the client closes the stream (ie. calls CloseSend()), this does not necessarily mean the server will promptly terminate the stream: it's valid for the server to continue sending messages (and the client to continue receiving them) after this point. It would therefore be possible for a client to hang waiting for messages that may never come in CloseAndExhaust().

I've checked all of the places that are now using CloseAndExhaust() and don't believe any of them will behave poorly in this case, but it may be wasteful for the server to continue processing the request after the client has abandoned it. Before CloseAndExhaust() was introduced, most callers would have cancelled the context used to create the stream after calling CloseSend(), so that would have aborted the request and communicated the cancellation to the server.

The main scenario I'm concerned about is store-gateway query requests that are abandoned due to hitting a query limit, which I will address in #6505. (Ingester query requests are not susceptible to the same behaviour as these use DoUntilQuorum, and DoUntilQuorum will cancel all request contexts once one returns an error.)

charleskorn added a commit that referenced this pull request Oct 30, 2023
* Fix typo in comment.

* Use more efficient `SpanLogger.DebugLog()` method in more places.

* Add linting rule to forbid the direct use of CloseSend().

* Add documentation to CloseAndExhaust.

* Update changelog entry.

* Report correct error when closing loop fails

* Avoid reading entire store-gateway query stream response when one or more calls fail.

* Silence linting warnings.

* Ensure CloseAndExhaust never blocks forever.

* Fix race in ingester streaming tests: there's no guarantee we'll close the stream synchronously after encountering an error

* Address PR feedback: reduce timeout.
grafanabot pushed a commit that referenced this pull request Oct 30, 2023
* Fix typo in comment.

* Use more efficient `SpanLogger.DebugLog()` method in more places.

* Add linting rule to forbid the direct use of CloseSend().

* Add documentation to CloseAndExhaust.

* Update changelog entry.

* Report correct error when closing loop fails

* Avoid reading entire store-gateway query stream response when one or more calls fail.

* Silence linting warnings.

* Ensure CloseAndExhaust never blocks forever.

* Fix race in ingester streaming tests: there's no guarantee we'll close the stream synchronously after encountering an error

* Address PR feedback: reduce timeout.

(cherry picked from commit 0103407)
charleskorn added a commit that referenced this pull request Oct 30, 2023
* Fix typo in comment.

* Use more efficient `SpanLogger.DebugLog()` method in more places.

* Add linting rule to forbid the direct use of CloseSend().

* Add documentation to CloseAndExhaust.

* Update changelog entry.

* Report correct error when closing loop fails

* Avoid reading entire store-gateway query stream response when one or more calls fail.

* Silence linting warnings.

* Ensure CloseAndExhaust never blocks forever.

* Fix race in ingester streaming tests: there's no guarantee we'll close the stream synchronously after encountering an error

* Address PR feedback: reduce timeout.

(cherry picked from commit 0103407)

Co-authored-by: Charles Korn <[email protected]>
francoposa pushed a commit that referenced this pull request Oct 30, 2023
…ported as cancelled (#6471)

* Add failing test.

* Fix issue where query requests that stream chunks from ingesters but return no series are reported as cancelled rather than successful in traces and metrics.

* Fix other instances of issue.

* Add changelog entry.

* Address PR feedback

Co-authored-by: Oleg Zaytsev <[email protected]>

* Add more details to assertions

* Extract method

* Fix race between query evaluation finishing and ingester chunks streaming goroutine closing the gRPC stream.

* Refactor SeriesChunksStreamReader to follow same structure as for store-gateway chunk stream reader

* Update changelog entry.

* Fix race between query evaluation finishing and store-gateway chunks streaming goroutine closing the gRPC stream.

* Apply suggestions from code review

---------

Co-authored-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
francoposa pushed a commit that referenced this pull request Oct 30, 2023
* Fix typo in comment.

* Use more efficient `SpanLogger.DebugLog()` method in more places.

* Add linting rule to forbid the direct use of CloseSend().

* Add documentation to CloseAndExhaust.

* Update changelog entry.

* Report correct error when closing loop fails

* Avoid reading entire store-gateway query stream response when one or more calls fail.

* Silence linting warnings.

* Ensure CloseAndExhaust never blocks forever.

* Fix race in ingester streaming tests: there's no guarantee we'll close the stream synchronously after encountering an error

* Address PR feedback: reduce timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants