You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under high load, BOSH deploy command fails due to baggageclaim stop failing.
We have observed this twice on Wings already.
Incident 1
We observed that http response rates were extremely slow on Wings. To rectify the issue, it was decided to restart the system and bump the stemcell.
Doing so via a BOSH deploy resulted in baggageclaim stop failing, which resulted in the deploy failing.
Manually ssh'ing onto the VM and issuing a monit restart didn't deterministically resolve the issue on a subsequent BOSH deploy.
To resolve the failed deploy, all the workers had to be stopped via stop --hard, which took multiple tries to succeed (every try, a few more workers would be stopped). Finally, another BOSH deploy got the system back into a working state.
Concourse version: 4.2.2
Deployment type (BOSH/Docker/binary): BOSH
Infrastructure/IaaS: GCP
Steps to Reproduce
Unfortunately, there isn't a consistent way to reproduce the issue. What was observed was that the system was under high load ( workers had ~200 containers ) and were executing builds and resource checks.
Expected Results
BOSH commands such as deploy, stop, start shouldn't fail due to baggageclaim, as this results in not being able to restore the system via BOSH and requires using increasingly destructive actions in order to eventually restore the system to a working state.
The text was updated successfully, but these errors were encountered:
Bug Report
Under high load, BOSH
deploy
command fails due tobaggageclaim
stop failing.We have observed this twice on Wings already.
Incident 1
We observed that
http response
rates were extremely slow on Wings. To rectify the issue, it was decided to restart the system and bump the stemcell.Doing so via a BOSH
deploy
resulted inbaggageclaim
stop failing, which resulted in the deploy failing.Manually ssh'ing onto the VM and issuing a
monit restart
didn't deterministically resolve the issue on a subsequent BOSHdeploy
.To resolve the failed deploy, all the workers had to be stopped via
stop --hard
, which took multiple tries to succeed (every try, a few more workers would be stopped). Finally, another BOSHdeploy
got the system back into a working state.Steps to Reproduce
Unfortunately, there isn't a consistent way to reproduce the issue. What was observed was that the system was under high load ( workers had ~200 containers ) and were executing builds and resource checks.
Expected Results
BOSH commands such as
deploy
,stop
,start
shouldn't fail due to baggageclaim, as this results in not being able to restore the system via BOSH and requires using increasingly destructive actions in order to eventually restore the system to a working state.The text was updated successfully, but these errors were encountered: