Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataNode: Migration by Rolling Upgrade #17155

Closed
janheise opened this issue Nov 2, 2023 · 5 comments
Closed

DataNode: Migration by Rolling Upgrade #17155

janheise opened this issue Nov 2, 2023 · 5 comments
Assignees

Comments

@janheise
Copy link
Contributor

janheise commented Nov 2, 2023

Rolling upgrade of a cluster (preliminary draft, nothing tested yet, with questions)

  • keep GL running

  • you need OS/ES instances that are compatible to read from by OS 2.10.0

  • TODO: test with ES7/OS1.3

  • on every instance, install the DataNode

  • configure the same admin credentials, certificates that you use in ES/OS

  • TODO: re-add basic credentials subsystem into the DataNode for migration

  • configure DataNode to read from existing data directory

  • use rolling upgrade procedure as described in the ES/OS docs (TODO: add links)

  • should be like stop ES/OS node, start DataNode

  • DataNode should come up as a replacement

  • after replacing all existing nodes with the DataNode:

  • add a CA (TODO: new functionality in running GL)

  • add a provisioning policy (TODO: new functionality in running GL)

  • provision certificates (TODO: test/modify so it works on a running DataNode with stuff configured in the config file. Don't restart OpenSearch automatically)

  • TODO: Rolling Restart with new Config

  • remove the elastic config string from graylog.conf

  • remove the cert stuff from the datanode.conf

  • remove simple auth config from OS because we use JWT now (TODO: probably has to be removed from the cluster online)

  • make all steps in a way that they can be triggered multiple times manually if you have to fix things in between

@janheise
Copy link
Contributor Author

After feedback session with Professional Services:

  • they usually stop processing and replace the complete OS/ES cluster in one step

  • doing this would simplify upgrade quite a bit

  • remove ES config from graylog.conf so that the DataNodes are found for next reboot

  • configure DataNodes on cluster machines with JWT secrets etc. and the existing directories

  • start DataNodes

  • configure DataNodes in running Graylog

  • start DataNodes, replace ES/OS clients with newly configured clients

@janheise
Copy link
Contributor Author

After more feedback ;-)

  1. Graylog is started regularly after upgrading it to 6.0, pointing to the existing ES/OS cluster by means of the graylog.conf file
  2. we have added the ability to create a CA not only in preflight, but also while regularly started (see other issue)
  3. we create a CA, if none exists yet
  4. we create a certificate renewal policy
  5. on the old ES/OS cluster, the DataNode packages get installed (OS or Docker)
  6. additionally, to the regular installation config settings, also point the data_dir to the existing data_dirs
  7. start the DataNodes
  8. provision the DataNodes with certificates
  9. OpenSearch will not yet be started after provisioning
  10. check all indices in the old cluster for compatibility to OS 2.x
  11. calculate the necessary journal volume size for a given "default upgrade time"
  12. check, if the requirements are met
  13. stop processing in Graylog
  14. manually shut down the old ES/OS cluster
  15. press the "start DataNode cluster" button in graylog
  16. re-inject the new clients with the new URL
  17. check, if everything looks fine
  18. re-enable processing in Graylog
  19. ask the user to remove the old elasticsearch_hosts setting from graylog.conf for future restarts

@janheise
Copy link
Contributor Author

Things to discuss:
After doing some research, we need to find a way to make OpenSearch reload the security information during the upgrade because we switch from whatever was used before (no auth, basic auth) to JWT auth.
The info is held in a special index inside OpenSearch.
The "regular" way to trigger this is by running the securityadmin.sh script. This needs a working connection to OpenSearch which we don't have any more after we switch out implementations (because we also change certificates etc.). This way will also be probably deprecated in 3.x
I looked at the src and it seems that two ways might be successful:

  • writing a small plugin that triggers the reload like it's done inside the security plugin
  • removing the security index and also the existence of said index from the metadata by file/filesystem modification

I'd like to discuss if we want to go down one of the two routes or search for an alternative.

@janheise
Copy link
Contributor Author

Decision: we ask users to update their security before migrating.

@janheise
Copy link
Contributor Author

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant