Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS Batch enhanced retry strategy to step functions. #68

Open
sharkinsspatial opened this issue Jan 14, 2021 · 2 comments
Open

Add AWS Batch enhanced retry strategy to step functions. #68

sharkinsspatial opened this issue Jan 14, 2021 · 2 comments

Comments

@sharkinsspatial
Copy link
Collaborator

No description provided.

@ceholden
Copy link
Collaborator

👋 I'm perusing the backlog and this one caught my eye from my past positive experience with AWS Batch retry strategies. I've seen a few other tickets also related to things like catching SPOT instance interruptions versus other failures, which possibly seem related?

In the past I've used more advanced retry policy to catch SPOT related terminations (and some networking related Docker start/stop timeouts 😭) similar to what's documented in this "best practices" doc under "Use custom retries" section,
https://docs.aws.amazon.com/batch/latest/userguide/best-practices.html#bestpractice6

It seems like we could improve our retry strategy in the DockerBatchJob construct along these lines. I'm less experienced with Step Functions but it doesn't look like "SPOT interruption" is a known error state that can be handled in a step retry config (see docs for possible States)

Assuming this ticket is still relevant,

  • Would this be useful now or later / is it worth doing?
  • Is my understanding close / far off?

Even if this isn't ready to work or relevant anymore there's some connections to the state tracking being done that is good for me to learn about (e.g., the failure tracking described in #212, which this ticket might help mitigate but not fully resolve if we're unlucky).

@sharkinsspatial
Copy link
Collaborator Author

@ceholden 👋 🙇 Thanks for poking through some of these repos and having a look. I think your assessment here is "SPOT" on 😄, pun intended. We didn't pursue this initially for a few reasons

  1. At the time this issue was created, Batch did not support "custom retries" which would allow us to only execute retries for specific exit codes. Unfortunately do to the nature of the science code we rely on we have several cases of "successful" non-zero exit codes from the containers that we would not want to retry. But now that this custom retry logic exists we can submit jobs with all the retry logic packaged in the job configuration rather than the state machine. I'm curious to see if retry logic is configured at the job level if the state machine step only produces an error state after all of the retries have failed (that would be my assumption).

  2. I also made a bit of a strategic decision about this after reviewing the patterns of SPOT interruptions we were seeing. In our case we were seeing large spikes in interruptions when some large volume on-demand users pre-empted the bulk of the instances in our clusters followed by an hour or 2 of limited instance availability on the SPOT market. In this case, Batch retry strategies don't support any type of delay or retry backoff configuration so it is likely most of our retry attempts would fail with interruptions as well since they'll be immediately submitted to the overloaded market. So instead we use log database querying approach to incrementally chunk through failed jobs and re-submit them in smaller batches. This has the advantage of delaying re-submission to a saturated market but the disadvantage that it increases latency and requires our central log database to be scalable enough to handle heavy query traffic (which has fallen down on us at massive scale previously).

All of this could be a moot point since we are exploring the possibility of switching to on-demand instances for some new reduced latency HLS targets and we also want to consider introducing some checkpointing logic into the containers so that failed jobs can resume at intermediate points.

All that to say I don't think the current retry infrastructure we use is ideal and I'd love if you and @chuckwondo and I could brainstorm some new and improved approaches for this that are better than what I cobbled together 5 years ago 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants