Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address post-merge feedback for #6471 #6505

Merged
merged 11 commits into from
Oct 30, 2023
Merged

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented Oct 30, 2023

What this PR does

This PR addresses some feedback from #6471:

It also fixes an issue where the wrong error value was logged while shutting down the query-scheduler communication loop in query-frontends: df78279

I'd suggest reviewing each commit individually.

Which issue(s) this PR fixes or relates to

#6471

Checklist

  • Tests updated
  • [n/a] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Comment on lines -486 to +488
require.NoError(t, querierLoop.CloseSend())
cancel()
require.NoError(t, util.CloseAndExhaust[*schedulerpb.SchedulerToQuerier](querierLoop))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: moving the call to cancel() before closing the stream is required because otherwise CloseAndExhaust() will block: the stream has not ended, and the server does not know to close the stream without the cancellation. (Previously, CloseSend() returned immediately, and the call to cancel() was what made the test pass.)

@charleskorn charleskorn force-pushed the charleskorn/post-6471-feedback branch from 6dab7c3 to 2635d15 Compare October 30, 2023 02:24
…e the stream synchronously after encountering an error
@charleskorn charleskorn marked this pull request as ready for review October 30, 2023 03:27
@charleskorn charleskorn requested a review from a team as a code owner October 30, 2023 03:27
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes makes sense. I'm a bit worried that we're adding quite tricky logic which will be difficult to debug in production in case of issues (e.g. if a goroutine blocks, if a context leaks, etc...).

pkg/util/grpc.go Outdated Show resolved Hide resolved
return nil
go func() {
for {
if _, err := stream.Recv(); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After 3 seconds the function will return but this goroutine will continue to consume in the background. What if the consume blocks forever? We're going to leak goroutines. So question is: do we ensure that the stream context is canceled on the client side (this side) so that the Recv() will unblock and goroutine terminate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For almost all cases where we now use CloseAndExhaust, the streaming call is made in the context of an incoming HTTP or gRPC request (eg. querier evaluating a query), and that context will be cancelled when the response is returned. If for some reason that context isn't already being cancelled today, then we're potentially leaking the stream, but then CloseAndExhaust should help mitigate that by exhausting the stream and then terminating.

The only exception to this case is frontendSchedulerWorker which is used by query-frontends to queue requests to query-schedulers. In this case, the context will always be cancelled after CloseAndExhaust returns.

@charleskorn charleskorn enabled auto-merge (squash) October 30, 2023 08:26
@charleskorn charleskorn merged commit 0103407 into main Oct 30, 2023
27 checks passed
@charleskorn charleskorn deleted the charleskorn/post-6471-feedback branch October 30, 2023 08:34
grafanabot pushed a commit that referenced this pull request Oct 30, 2023
* Fix typo in comment.

* Use more efficient `SpanLogger.DebugLog()` method in more places.

* Add linting rule to forbid the direct use of CloseSend().

* Add documentation to CloseAndExhaust.

* Update changelog entry.

* Report correct error when closing loop fails

* Avoid reading entire store-gateway query stream response when one or more calls fail.

* Silence linting warnings.

* Ensure CloseAndExhaust never blocks forever.

* Fix race in ingester streaming tests: there's no guarantee we'll close the stream synchronously after encountering an error

* Address PR feedback: reduce timeout.

(cherry picked from commit 0103407)
charleskorn added a commit that referenced this pull request Oct 30, 2023
* Fix typo in comment.

* Use more efficient `SpanLogger.DebugLog()` method in more places.

* Add linting rule to forbid the direct use of CloseSend().

* Add documentation to CloseAndExhaust.

* Update changelog entry.

* Report correct error when closing loop fails

* Avoid reading entire store-gateway query stream response when one or more calls fail.

* Silence linting warnings.

* Ensure CloseAndExhaust never blocks forever.

* Fix race in ingester streaming tests: there's no guarantee we'll close the stream synchronously after encountering an error

* Address PR feedback: reduce timeout.

(cherry picked from commit 0103407)

Co-authored-by: Charles Korn <[email protected]>
francoposa pushed a commit that referenced this pull request Oct 30, 2023
* Fix typo in comment.

* Use more efficient `SpanLogger.DebugLog()` method in more places.

* Add linting rule to forbid the direct use of CloseSend().

* Add documentation to CloseAndExhaust.

* Update changelog entry.

* Report correct error when closing loop fails

* Avoid reading entire store-gateway query stream response when one or more calls fail.

* Silence linting warnings.

* Ensure CloseAndExhaust never blocks forever.

* Fix race in ingester streaming tests: there's no guarantee we'll close the stream synchronously after encountering an error

* Address PR feedback: reduce timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants