-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS Batch enhanced retry strategy to step functions. #68
Comments
👋 I'm perusing the backlog and this one caught my eye from my past positive experience with AWS Batch retry strategies. I've seen a few other tickets also related to things like catching SPOT instance interruptions versus other failures, which possibly seem related? In the past I've used more advanced retry policy to catch SPOT related terminations (and some networking related Docker start/stop timeouts 😭) similar to what's documented in this "best practices" doc under "Use custom retries" section, It seems like we could improve our retry strategy in the DockerBatchJob construct along these lines. I'm less experienced with Step Functions but it doesn't look like "SPOT interruption" is a known error state that can be handled in a step retry config (see docs for possible Assuming this ticket is still relevant,
Even if this isn't ready to work or relevant anymore there's some connections to the state tracking being done that is good for me to learn about (e.g., the failure tracking described in #212, which this ticket might help mitigate but not fully resolve if we're unlucky). |
@ceholden 👋 🙇 Thanks for poking through some of these repos and having a look. I think your assessment here is "SPOT" on 😄, pun intended. We didn't pursue this initially for a few reasons
All of this could be a moot point since we are exploring the possibility of switching to on-demand instances for some new reduced latency HLS targets and we also want to consider introducing some checkpointing logic into the containers so that failed jobs can resume at intermediate points. All that to say I don't think the current retry infrastructure we use is ideal and I'd love if you and @chuckwondo and I could brainstorm some new and improved approaches for this that are better than what I cobbled together 5 years ago 😆 |
No description provided.
The text was updated successfully, but these errors were encountered: