Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decommissioning a broken backend takes too long #549

Open
kinvaris opened this issue Jan 5, 2017 · 7 comments
Open

Decommissioning a broken backend takes too long #549

kinvaris opened this issue Jan 5, 2017 · 7 comments
Assignees
Labels

Comments

@kinvaris
Copy link

kinvaris commented Jan 5, 2017

We've got cluster A & B. In this situation cluster A is connected with cluster B through a global backend and external local backend (and 2,1,2,1 preset).
We saw that cluster B was broken. We unlinked the external local backend from the global backend. But after listing the osds on the global backend, after an hour we still saw the backend in decommissioned mode on the global proxies.

I investigated together with @domsj, the maintenance is not doing anything important and is not consuming much resources. What we do see is a lot of connections still to the old backend. (Connections refused because the cluster B is totally dead)

@kinvaris
Copy link
Author

kinvaris commented Jan 5, 2017

The alba version on functional cluster A is 1.3.0, @domsj asked me to upgrade it to 1.3.2 because of improvements to alba handling disk/data loss (https://github.com/openvstorage/alba/releases/tag/1.3.2)

@kinvaris
Copy link
Author

kinvaris commented Jan 5, 2017

After updating from alba 1.3.0 to 1.3.1 the decommissioned alba backend is gone and the proxy does not try to connect anymore to the old backend.

To try to reproduce this issue with alba 1.3.1 I will recreate the situation with the current OVH setup, shutdown 1 backend and remove it.

@wimpers
Copy link

wimpers commented Jan 16, 2017

PLease reproduce with latest alba

@kinvaris
Copy link
Author

kinvaris commented Feb 14, 2017

I've tried to reproduce the issue and today we've observed the following:

Steps to reproduce

  • Create a global backend
  • Add 2 local backends & 1 external local backend with policy (1, 2, 1, 3)
  • Create some vdisks and add some data to the vdisks (in my case I wrote approx. 10GB of data)
  • I broke 1 external local backend (lazy umount of asd mountpoints)
  • I deleted the external local backend (success)
  • Checked the proxy list osds to see if the osd is gone, but after 15 min. it was still present. (but in decommissioned state)
  • After discussion with @domsj we saw that the old bucket was still present in some namespaces.
  • After the old bucket was gone (after 30 min.) the OSD was gone in the proxy

Conclusion

the maintenance agent should notify the namespace quicker that the old bucket is gone for good.

@domsj
Copy link
Contributor

domsj commented Feb 14, 2017

Discussed this with @toolslive, we can (and will) make an improvement here in the near future

@wimpers wimpers added this to the Gilbert milestone Feb 23, 2017
@wimpers
Copy link

wimpers commented May 30, 2017

Is that near future already over? Near future sounds like days or weeks, not 3-4 months :)

@wimpers wimpers removed this from the G milestone May 30, 2017
@domsj
Copy link
Contributor

domsj commented May 30, 2017

Sorry I can't recall what improvements we had in mind. @toolslive perhaps you can remember?
Looking at the release notes I don't see it either

@wimpers wimpers added this to the I milestone Jun 15, 2017
@wimpers wimpers modified the milestones: I, J Oct 19, 2017
@wimpers wimpers modified the milestones: J, Roadmap Mar 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants