-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRON-2208: Add toggle in tron config to disable retries on LOST k8s jobs #988
Conversation
The _exit_unsuccessful method in the KubernetesActionRun class has been refactored to include the is_lost_task parameter. This parameter is used to determine whether auto-retries should be skipped when the disable_retries_on_lost option is enabled in the Kubernetes configuration.
…ies_on_lost * origin/master: Adjust path after conf.py was moved
sidenote: can we add some additional context to the PR description? it can still reference the ticket, but having more context in |
yeah of course, this is just a draft PR to make sure that we agreed on the implementation 😄 |
The Kubernetes configuration has been updated to use the `non_retryable_exit_codes` option instead of `disable_retries_on_lost`. This change allows for more flexibility in specifying the exit codes that should not trigger auto-retries for Kubernetes tasks.
…ernetesClusterRepository The `non_retryable_exit_codes` attribute in the `KubernetesCluster` and `KubernetesClusterRepository` classes has been refactored to use a tuple of integers instead of a single integer. This change allows for more flexibility in specifying multiple exit codes that should not trigger auto-retries for Kubernetes tasks.
…ies_on_lost * origin/master: Minor fix to help text for kubecontext param Support running on devbox too Support multiple clusters in sync_tron_from_k8s Released 2.3.0 via make release More type fixes for defaults Initialize KubernetesClusterRepository watcher_kubeconfig_paths the same as KubernetsCluster and correct typing Released 2.2.7 via make release Only show disable warning on tronctl disable (#990) Add validation to watcher_kubeconfig_paths too, and a minimal test Update tools/sync_tron_state_from_k8s.py Bump task_processing to get watcher_kubeconfig_paths change Support old_kubeconfig_paths Update sync_tron_from_k8s to match sanitized instance labels, cleanup comments Update sync to handle in-progress retries and don't tronctl yet; add unit tests TRON-2210: Add a tool that can sync from pod state to tron w/ tronctl # Conflicts: # tron/kubernetes.py
The `debugpy` dependency in the `requirements-dev-minimal.txt` and `requirements-dev.txt` files has been updated to version 1.8.1. This update ensures compatibility and brings in the latest features and bug fixes for debugging purposes.
The .gitignore file has been updated to include the .fleet directory. This ensures that the .fleet directory is ignored by Git and not tracked in the repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice option, would be nice to see a test where k8s_options.non_retryable_exit_codes was non-empty and checking our code does the right thing tho
@@ -447,7 +447,7 @@ def test_handle_event_lost(mock_kubernetes_task): | |||
) | |||
) | |||
|
|||
assert mock_kubernetes_task.is_unknown | |||
assert mock_kubernetes_task.exit_status == exitcode.EXIT_KUBERNETES_TASK_LOST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to keep the old assert as well, since we do still want to verify 'lost' k8s events lead to marking the task as unknown
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sadly the way we're checking unknow is by checking exit code
def is_unknown(self):
return self.exit_status is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh interesting; i didn't realize assigning the LOST's an exit code changed this behavior, thanks.
Did you happen to glance through all the other places where we use is_unknown
/unknown to check this change won't change any expected behavior there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be a problem since we're only marking lost jobs with the special lost exit code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_broken() uses the is_unknown
property and it looks like there's some jobrun code that uses is_broken
to gate starting new action runs - but i'm not sure if this change impacts that in a meaningful way yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think there's any major blockers - it's probably fine to run this in non-prod for a bit
i'm not 100% sure about my 2 non-requirements.txt comments, but we should be able to sus any issues out by actually running this since the case we're handling here is pretty rare anyway :)
@@ -1,4 +1,5 @@ | |||
asynctest | |||
debugpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious: what is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VSCode Python Debugger :)
@@ -447,7 +447,7 @@ def test_handle_event_lost(mock_kubernetes_task): | |||
) | |||
) | |||
|
|||
assert mock_kubernetes_task.is_unknown | |||
assert mock_kubernetes_task.exit_status == exitcode.EXIT_KUBERNETES_TASK_LOST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_broken() uses the is_unknown
property and it looks like there's some jobrun code that uses is_broken
to gate starting new action runs - but i'm not sure if this change impacts that in a meaningful way yet
Summary
This PR addresses TRON-2208 by adding a toggle in the Tron configuration to disable automatic retries for Kubernetes jobs that are marked as "LOST". This is particularly useful for non-idempotent jobs where retries can be dangerous.
Changes
Configuration
tron/config/config_parse.py
: Addednon_retryable_exit_codes
to the Kubernetes configuration schema.tron/config/schema.py
: Updated the schema to includenon_retryable_exit_codes
.Core Logic
tron/core/actionrun.py
:_exit_unsuccessful
method to check fornon_retryable_exit_codes
and skip retries if the exit code is in this list.non_retryable_exit_codes
to use a list instead of a tuple.tron/kubernetes.py
:handle_event
method to set the exit code toEXIT_KUBERNETES_TASK_LOST
for lost tasks.non_retryable_exit_codes
to the Kubernetes configuration.tron/utils/exitcode.py
: Added a new exit codeEXIT_KUBERNETES_TASK_LOST
with a value of-12
.Tests
tests/config/config_parse_test.py
: Updated tests to includenon_retryable_exit_codes
.tests/kubernetes_test.py
: Updated tests to check for the new exit code and ensure the correct behavior when a task is marked as lost.Miscellaneous
.gitignore
: Added.fleet
to the ignore list.requirements-dev-minimal.txt
: Addeddebugpy
.requirements-dev.txt
: Addeddebugpy
.Motivation
Given that "LOST" means Tron has lost track of a pod it already thought it had started for a job, attempting to retry/start a replacement can be dangerous for non-idempotent jobs. In the current state, these will consume retries, but with some of our EKS migration methods, LOST tasks are more likely. Therefore, we should have a way to temporarily pause retries on these.
Testing
Related Issues
Notes
non_retryable_exit_codes
option if you wish to use this feature.