-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEVPROD-5384 Retry cloning modules and source individually and more selectively #7684
DEVPROD-5384 Retry cloning modules and source individually and more selectively #7684
Conversation
agent/command/git.go
Outdated
ctx, | ||
func() (bool, error) { | ||
if attemptNum > 2 { | ||
opts.useVerbose = true // use verbose for the last 2 attempts | ||
logger.Task().Error(message.Fields{ | ||
"message": "running git clone with verbose output", | ||
"message": fmt.Sprintf("running git %s clone with verbose output", fetchType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we try to wrap strings with '%s' in errors just in case the string is "" so we would notice
agent/command/git.go
Outdated
attribute.String(cloneMethodAttribute, opts.method), | ||
)) | ||
defer span.End() | ||
// add retry fetch here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you left in a comment
if useVerbose { | ||
apply = fmt.Sprintf("GIT_TRACE=1 %s", apply) | ||
} | ||
apply := fmt.Sprintf("GIT_TRACE=1 git apply --binary --index < '%s'", patchFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we want to apply verbose to everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched operations that we only run once to be verbose only. This is more as a cautious measure and to ensure debugging is possible. Before, the retry functionality captured applying patches- so only on retries it would apply verbose.
I made apply patches not retry since they're much more stable- if it failed to apply the patch, that means something is wrong with the patch of some kind. I searched our logs and it wasn't making our retry function before error.
DEVPROD-5384
Description
This separates the retry logic in our git fetch command to be more selective and individual. First, the source will attempt to clone (with max retries of 5 times). If it fails the 5th time, it aborts and returns an error. Then, the modules will attempt to clone (with a max of 5 too). Each module clones in parallel and fails separately, one failing only restarts itself. If one fails 5 times, the task is considered unrecoverable and errors out.
Previously we did up to 5 attempts of: clone source, clone modules, apply patches (commit queue, uncommitted, etc), and other things and if any failed, restarted the entire process from the beginning. Now, only the cloning retries- and the others do not. Since the others do not retry, I removed the verbose option and made it such that they always output the verbose. I searched our splunk logs and found no errors for many of them in the last 30 days, so I don't think there's much risk. If we do notice a big uptick in task failures, we might want to consider reverting this or implementing a fix which just retries applying patches or whatever is causing the uptick.
Testing
Manual testing in staging (link if you look at the honeycomb span, you'll notice that the modules keep retrying and one beats out the other and cancels the context of said other- causing there to be only one error and only one retried a full 5 times. In this case, parsley tried cloning 5 times and stopped spruce which was at only its 3rd) and added unit testing