Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RB] Always cancel remote bazel run at end of CLI to prevent missing retries #8098

Open
wants to merge 1 commit into
base: rb_fix
Choose a base branch
from

Conversation

maggie-lou
Copy link
Contributor

A customer reported a bug where the remote bazel invocation was canceled due to a "stream terminated by RST_STREAM" error (which often occurs when an executor loses connection to an app server due to a rollout). That error was immediately returned to the CLI, which reported that the remote run failed. However that execution was retried, and eventually succeeded.

We should just cancel the invocation in this case, so the CLI reports the correct status of the final results of the remote run. We don't want a case where the CLI reports that the run failed, which may lead the client to rerun the remote bazel command, which in this case would've caused duplicate runs

If the invocation did already complete running, the CancelExecutions will be a no-op

Fixes #2 here: https://buildbuddy-corp.slack.com/archives/C07GMM2VBLY/p1734595804049229?thread_ts=1734595606.837269&cid=C07GMM2VBLY

Copy link
Member

@bduffany bduffany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the remote bazel invocation was canceled due to a "stream terminated by RST_STREAM" error (which often occurs when an executor loses connection to an app server due to a rollout)

We should just cancel the invocation in this case

I think we should try to find a way to avoid failures during a rollout? Workflows don't result in "failed" GitHub statuses when the executors restart, so it seems like a worse experience if remote bazel runs do fail in this case.

Workflows are retried on the server currently, because there is no "client" that would be capable of doing retries. For remote bazel at least, maybe we could transparently retry the invocation in the CLI (instead of retrying on the server) if WaitExecution reports that the execution fails? I'm not sure how to handle this for the Run API though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants