Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEVPROD-5498 Do not consider elapased communication time during group teardown #8506

Merged
merged 3 commits into from
Nov 27, 2024

Conversation

hadjri
Copy link
Contributor

@hadjri hadjri commented Nov 22, 2024

DEVPROD-5498

Description

Currently, hosts can get hit with an idle timeout due to lack of communication with the Evergreen app server if their teardown group is sufficiently long, because the teardown group happens after task completion, so no heartbeat signals are sent during that period.

Since the change introduced in #7635 adds protections against long-running teardown groups, it should be sufficient to skip the last communicated time in the idle host check if a host is actively tearing down a task group.

Testing

Tested in staging (executions 3 vs 4 for reference) and confirmed that without the change, a host that just ran a task group with a long teardown group is unable to pick up more tasks afterwards because it immediately gets marked by the idle termination job, whereas after the change, said host is able to continue picking up tasks.

@hadjri hadjri requested a review from a team November 22, 2024 20:21
@@ -2169,6 +2169,12 @@ func (h *Host) Replace(ctx context.Context) error {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain/add links for this staging test a bit more?

Tested in staging (executions 3 vs 4 for reference) and confirmed that without the change, a host that just ran a task group with a long teardown group is unable to pick up more tasks afterwards because it immediately gets marked by the idle termination job, whereas after the change, said host is able to continue picking up tasks.

I didn't see a difference in the host event logs between execution 3 and 4 for task 1.

Copy link
Contributor Author

@hadjri hadjri Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. For reference task1+task2 are part of a group, and task3 is a standalone task.

Execution 3 (with change):
task1+task2 complete on host i-0ba489af415acc35e, and then that same host is able to immediately pick up task 3 afterwards.

Execution 4 (without change):
task1+task2 complete on host i-0e34ab569dac24127, which gets hit with an idle timeout due to the task group's teardown group taking too long, and it decommissions. task3 then comes up on a new host.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see task three running on i-0ba489af415acc35e, I don't see it running task 1, 2 and 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm guessing you meant i-059849d0ff65edfc0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant i-059849d0ff65edfc0, my bad

@hadjri hadjri requested a review from malikchaya2 November 27, 2024 17:35
@hadjri hadjri merged commit d0a13db into evergreen-ci:main Nov 27, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants