-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draining of concourse worker taking very long #44
Comments
hi @Marc-Assmann , sorry it's taken a while for someone to get back to you. Without being able to poke around at this environment now it's difficult to say what exactly went wrong. there are questions that I would about the deployment but I can appreciate that since this was 2 months ago, that data might not be around anymore! have you run into this problem since/is this an issue your team is still facing? |
I was able to reproduce this issue several times on 5.2.0 and 5.5.0. Our setup contains two web nodes and six workers. I have analyzed the issue and this is my theory for what happens:
I think concourse/concourse#2523 might be the same. |
The same issue also happens if the draining performs "landing", not "retiring". I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone). Any ideas? |
Hi, I've debugged this and managed to reproduce it locally via docker-compose. It happens when
Then the workerBeacon fails with some error and it is restarted. Bottom line is, that in the case of a Bosh deployment the drain script continues to run till it reaches the drain timeout, then followed by TERM/15/QUIT/2/KILL. |
Seems that the behavior can be explained by the web and worker instance groups being updated in parallel. In some cases, the timing is enough to simultaneously update a web node and worker node, which might cause the issue if the worker node is connected to the specific web node being updated. In our case having just 2 web nodes makes the probability of hitting this exact race condition quite high. It can be avoided, by not running updates of both instance groups in parallel. |
thanks @radoslav-tomov and @alexbakar for spending time investigating this!
I think the only way to get logs from a destroyed VM is to enable log forwarding at deploy time and set them up to export them to somewhere like Papertrail 😕 Next part is speculative, not necessarily a request for contributors to implement: I wonder, since BOSH introduced a new lifecycle hook for cc @cirocosta and @pivotal-jamie-klassen for additional input! |
Thanks @deniseyu for the help. For the analysis of the issue I've forwarded the logs and later checked them using Kibana. |
Hi there!
Bug Report
When deploying an update of concourse (using concourse/bosh) the update/recreation of the first worker was not finished for more than 4 hours.
Bosh shows the respective worker as failing (using bosh instances).
Logging in to the worker via bosh ssh, we observed:
So for some reason, the drain behaviour described here https://concourse-ci.org/concourse-worker.html#gracefully-removing-a-worker seems not to work.
We then issued another USR2 signal to the worker process manually using
kill -USR2 <worker pid>
.This made the worker finish running jobs and it shut down.
You can see the log down here, the worker recreation update took 4hrs29min:
The text was updated successfully, but these errors were encountered: