Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watchdog: restart worker if failing #7

Merged
merged 1 commit into from
Jul 31, 2020
Merged

watchdog: restart worker if failing #7

merged 1 commit into from
Jul 31, 2020

Conversation

talset
Copy link
Member

@talset talset commented Jul 27, 2020

Workaround of #5
This commit will need to be revert when real watchdog will be unblocked

if [[ $FAIL -eq 1 ]]; then
#/bin/systemd-notify --pid=$WORKER_PID "WATCHDOG=1";
/bin/systemctl restart concourse-worker
sleep 1
Copy link

@sdurrheimer sdurrheimer Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sleep should be bigger here to give time to the concourse-worker to clean his mess before being ready.
With a sleep 1 we might up in situations where the watchdog restarts the concourse-worker in a loop.

The logic might need to be different than the systemd watchdog implementation which allowed the healthcheck to fail several times before deciding to restart the concourse-worker.
Here restarting at the first healthcheck failure is brutal when sometimes it might be a false alarm.
Ideally we would want a to retry 2 or 3 times the healthcheck to be sure the concourse-worker is KO, restart the process if it is, then wait some time for the concourse-worker to recover/clean up old things before expecting the healthcheck to be successful.

WDYT ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure for the sleep it should be good just after a worker restart but I added 3 retry (with 15sec between each) to be sure

@talset talset force-pushed the fl-watch branch 2 times, most recently from 96e806a to f735086 Compare July 27, 2020 11:47
Workaround of #5
This commit will need to be revert when real watchdog will be unblocked
@talset talset merged commit 3d69a40 into master Jul 31, 2020
@talset talset deleted the fl-watch branch July 31, 2020 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants