Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRON-1850: Include starting pods in check for stuck jobs #940

Merged
merged 2 commits into from
Apr 9, 2024

Conversation

jfongatyelp
Copy link
Contributor

We can sometimes get stuck in state starting if we somehow miss events coming in from k8s (or a pod somehow itself get stuck before it starts running).

This should now consider these situations as stuck jobs and actions alongside those that are actually running too long or waiting on a trigger, and adds some tests to verify. (These tests were confirmed to fail w/o the changes to include 'starting' jobs/actions).

tron/bin/check_tron_jobs.py Outdated Show resolved Hide resolved
@@ -251,7 +251,7 @@ def is_job_stuck(
):
next_run_time = None
for job_run in job_runs:
states_to_check = {"running", "waiting"}
states_to_check = {"running", "waiting", "starting"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, i like how we're pretty inconsistent re: sets and lists in this file :p

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also move this outside the loop so that we're not re-creating a set multiple times, but it's probably fine to punt on refactoring this script :p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm good call, though i also briefly saw we had https://github.com/Yelp/Tron/blob/master/tron/core/jobrun.py#L456 which makes me wonder if we should just reuse this and why didn't we already, but agree I'll leave the archaeology for another time.

@jfongatyelp jfongatyelp merged commit 110013b into master Apr 9, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants