Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More error tolerant web ui check #1040

Merged

Conversation

haoming29
Copy link
Contributor

Should fix #1033

The suspicion is that there's a racing condition at the registry start somehow, and the web engine is not "ready-ready" to serve requests, although it could be other issues. However, this PR will make the web ui check more robust in case of the error code returned

@haoming29 haoming29 added bug Something isn't working critical High priority for next release registry Issue relating to the registry component labels Apr 3, 2024
@haoming29 haoming29 added this to the v7.7.0 milestone Apr 3, 2024
Copy link
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this code is trying to do.

WaitUntilWorking tries a URL every 50ms until it succeeds or 10 seconds has passed. It seems the code adds another set of loops to retry on top of retries (which appears strange?) ... no?

@haoming29
Copy link
Contributor Author

haoming29 commented Apr 3, 2024

I don't understand what this code is trying to do.

WaitUntilWorking tries a URL every 50ms until it succeeds or 10 seconds has passed. It seems the code adds another set of loops to retry on top of retries (which appears strange?) ... no?

My guessing of William hitting that 404 error is that the gin server wasn't fully started before we hit the health check endpoint (since the error is intermittent/random). Since 404 instead of 200 was returned, WaitUntilWorking returns immediately and we never get a change to check it a bit later. I don't want to chance how WaitUntilWorking works because in most situations it makes sense to fail immediately if the status code is different from expected, but for the registry's case, it would probably worth waiting and checking again if the status code doesn't match

@bbockelm
Copy link
Collaborator

bbockelm commented Apr 3, 2024

Gotcha. I would rather add another option to WaitUntilWorking to make the wrong response code non-fatal. Layers of retries are a bit scary.

@haoming29
Copy link
Contributor Author

Gotcha. I would rather add another option to WaitUntilWorking to make the wrong response code non-fatal. Layers of retries are a bit scary.

Sounds good. I'll refactor the code

@jhiemstrawisc
Copy link
Member

@haoming29 is this ready for review or should I wait a bit longer?

@haoming29
Copy link
Contributor Author

@haoming29 is this ready for review or should I wait a bit longer?

Sorry I haven't get the chance to work on the refactoring. I will let you know when its' review for another round of review

@haoming29
Copy link
Contributor Author

@jhiemstrawisc Should be good for the review. The Ubuntu test failure should be a random one and re-running it will likely fix the issue

@haoming29 haoming29 requested a review from bbockelm April 4, 2024 20:47
Copy link
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jhiemstrawisc jhiemstrawisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small typos, otherwise LGTM

server_utils/server_utils.go Outdated Show resolved Hide resolved
server_utils/server_utils.go Outdated Show resolved Hide resolved
server_utils/server_utils.go Outdated Show resolved Hide resolved
server_utils/server_utils.go Outdated Show resolved Hide resolved
@haoming29
Copy link
Contributor Author

A few small typos, otherwise LGTM

Thanks for catching them!

@haoming29 haoming29 requested a review from jhiemstrawisc April 5, 2024 16:42
Copy link
Member

@jhiemstrawisc jhiemstrawisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@jhiemstrawisc jhiemstrawisc merged commit 939eda9 into PelicanPlatform:main Apr 5, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical High priority for next release registry Issue relating to the registry component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v7.6.2 Registry repeatedly fails to start, until it doesn't
3 participants