Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEVPROD-5384 Retry cloning modules and source individually and more selectively #7684

Merged
merged 14 commits into from
Apr 1, 2024

Conversation

ZackarySantana
Copy link
Contributor

@ZackarySantana ZackarySantana commented Mar 26, 2024

DEVPROD-5384

Description

This separates the retry logic in our git fetch command to be more selective and individual. First, the source will attempt to clone (with max retries of 5 times). If it fails the 5th time, it aborts and returns an error. Then, the modules will attempt to clone (with a max of 5 too). Each module clones in parallel and fails separately, one failing only restarts itself. If one fails 5 times, the task is considered unrecoverable and errors out.

Previously we did up to 5 attempts of: clone source, clone modules, apply patches (commit queue, uncommitted, etc), and other things and if any failed, restarted the entire process from the beginning. Now, only the cloning retries- and the others do not. Since the others do not retry, I removed the verbose option and made it such that they always output the verbose. I searched our splunk logs and found no errors for many of them in the last 30 days, so I don't think there's much risk. If we do notice a big uptick in task failures, we might want to consider reverting this or implementing a fix which just retries applying patches or whatever is causing the uptick.

Testing

Manual testing in staging (link if you look at the honeycomb span, you'll notice that the modules keep retrying and one beats out the other and cancels the context of said other- causing there to be only one error and only one retried a full 5 times. In this case, parsley tried cloning 5 times and stopped spruce which was at only its 3rd) and added unit testing

@ZackarySantana ZackarySantana self-assigned this Mar 26, 2024
@ZackarySantana ZackarySantana marked this pull request as ready for review March 28, 2024 14:45
@ZackarySantana ZackarySantana requested a review from a team March 28, 2024 14:45
ctx,
func() (bool, error) {
if attemptNum > 2 {
opts.useVerbose = true // use verbose for the last 2 attempts
logger.Task().Error(message.Fields{
"message": "running git clone with verbose output",
"message": fmt.Sprintf("running git %s clone with verbose output", fetchType),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we try to wrap strings with '%s' in errors just in case the string is "" so we would notice

attribute.String(cloneMethodAttribute, opts.method),
))
defer span.End()
// add retry fetch here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you left in a comment

if useVerbose {
apply = fmt.Sprintf("GIT_TRACE=1 %s", apply)
}
apply := fmt.Sprintf("GIT_TRACE=1 git apply --binary --index < '%s'", patchFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we want to apply verbose to everything?

Copy link
Contributor Author

@ZackarySantana ZackarySantana Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched operations that we only run once to be verbose only. This is more as a cautious measure and to ensure debugging is possible. Before, the retry functionality captured applying patches- so only on retries it would apply verbose.

I made apply patches not retry since they're much more stable- if it failed to apply the patch, that means something is wrong with the patch of some kind. I searched our logs and it wasn't making our retry function before error.

@ZackarySantana ZackarySantana requested a review from bynn April 1, 2024 13:48
@ZackarySantana ZackarySantana merged commit 81efff6 into evergreen-ci:main Apr 1, 2024
3 of 4 checks passed
@ZackarySantana ZackarySantana deleted the DEVPROD-5384 branch April 1, 2024 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants