-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEVPROD-5498 Do not consider elapased communication time during group teardown #8506
Conversation
@@ -2169,6 +2169,12 @@ func (h *Host) Replace(ctx context.Context) error { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain/add links for this staging test a bit more?
Tested in staging (executions 3 vs 4 for reference) and confirmed that without the change, a host that just ran a task group with a long teardown group is unable to pick up more tasks afterwards because it immediately gets marked by the idle termination job, whereas after the change, said host is able to continue picking up tasks.
I didn't see a difference in the host event logs between execution 3 and 4 for task 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. For reference task1+task2 are part of a group, and task3 is a standalone task.
Execution 3 (with change):
task1+task2 complete on host i-0ba489af415acc35e, and then that same host is able to immediately pick up task 3 afterwards.
Execution 4 (without change):
task1+task2 complete on host i-0e34ab569dac24127, which gets hit with an idle timeout due to the task group's teardown group taking too long, and it decommissions. task3 then comes up on a new host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see task three running on i-0ba489af415acc35e, I don't see it running task 1, 2 and 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm guessing you meant i-059849d0ff65edfc0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant i-059849d0ff65edfc0, my bad
DEVPROD-5498
Description
Currently, hosts can get hit with an idle timeout due to lack of communication with the Evergreen app server if their teardown group is sufficiently long, because the teardown group happens after task completion, so no heartbeat signals are sent during that period.
Since the change introduced in #7635 adds protections against long-running teardown groups, it should be sufficient to skip the last communicated time in the idle host check if a host is actively tearing down a task group.
Testing
Tested in staging (executions 3 vs 4 for reference) and confirmed that without the change, a host that just ran a task group with a long teardown group is unable to pick up more tasks afterwards because it immediately gets marked by the idle termination job, whereas after the change, said host is able to continue picking up tasks.