DataNode: Migration by Rolling Upgrade #17155

janheise · 2023-11-02T07:58:00Z

Rolling upgrade of a cluster (preliminary draft, nothing tested yet, with questions)

keep GL running
you need OS/ES instances that are compatible to read from by OS 2.10.0
TODO: test with ES7/OS1.3
on every instance, install the DataNode
configure the same admin credentials, certificates that you use in ES/OS
TODO: re-add basic credentials subsystem into the DataNode for migration
configure DataNode to read from existing data directory
use rolling upgrade procedure as described in the ES/OS docs (TODO: add links)
should be like stop ES/OS node, start DataNode
DataNode should come up as a replacement
after replacing all existing nodes with the DataNode:
add a CA (TODO: new functionality in running GL)
add a provisioning policy (TODO: new functionality in running GL)
provision certificates (TODO: test/modify so it works on a running DataNode with stuff configured in the config file. Don't restart OpenSearch automatically)
TODO: Rolling Restart with new Config
remove the elastic config string from graylog.conf
remove the cert stuff from the datanode.conf
remove simple auth config from OS because we use JWT now (TODO: probably has to be removed from the cluster online)
make all steps in a way that they can be triggered multiple times manually if you have to fix things in between

The text was updated successfully, but these errors were encountered:

janheise · 2023-11-13T08:09:31Z

After feedback session with Professional Services:

they usually stop processing and replace the complete OS/ES cluster in one step
doing this would simplify upgrade quite a bit
remove ES config from graylog.conf so that the DataNodes are found for next reboot
configure DataNodes on cluster machines with JWT secrets etc. and the existing directories
start DataNodes
configure DataNodes in running Graylog
start DataNodes, replace ES/OS clients with newly configured clients

janheise · 2023-11-16T17:13:48Z

After more feedback ;-)

Graylog is started regularly after upgrading it to 6.0, pointing to the existing ES/OS cluster by means of the graylog.conf file
we have added the ability to create a CA not only in preflight, but also while regularly started (see other issue)
we create a CA, if none exists yet
we create a certificate renewal policy
on the old ES/OS cluster, the DataNode packages get installed (OS or Docker)
additionally, to the regular installation config settings, also point the data_dir to the existing data_dirs
start the DataNodes
provision the DataNodes with certificates
OpenSearch will not yet be started after provisioning
check all indices in the old cluster for compatibility to OS 2.x
calculate the necessary journal volume size for a given "default upgrade time"
check, if the requirements are met
stop processing in Graylog
manually shut down the old ES/OS cluster
press the "start DataNode cluster" button in graylog
re-inject the new clients with the new URL
check, if everything looks fine
re-enable processing in Graylog
ask the user to remove the old elasticsearch_hosts setting from graylog.conf for future restarts

janheise · 2023-12-11T10:25:22Z

Things to discuss:
After doing some research, we need to find a way to make OpenSearch reload the security information during the upgrade because we switch from whatever was used before (no auth, basic auth) to JWT auth.
The info is held in a special index inside OpenSearch.
The "regular" way to trigger this is by running the securityadmin.sh script. This needs a working connection to OpenSearch which we don't have any more after we switch out implementations (because we also change certificates etc.). This way will also be probably deprecated in 3.x
I looked at the src and it seems that two ways might be successful:

writing a small plugin that triggers the reload like it's done inside the security plugin
removing the security index and also the existence of said index from the metadata by file/filesystem modification

I'd like to discuss if we want to go down one of the two routes or search for an alternative.

janheise · 2023-12-11T15:53:44Z

Decision: we ask users to update their security before migrating.

janheise · 2024-03-14T09:43:10Z

done

mako42 assigned janheise Dec 14, 2023

janheise closed this as completed Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataNode: Migration by Rolling Upgrade #17155

DataNode: Migration by Rolling Upgrade #17155

janheise commented Nov 2, 2023 •

edited

Loading

janheise commented Nov 13, 2023

janheise commented Nov 16, 2023

janheise commented Dec 11, 2023

janheise commented Dec 11, 2023

janheise commented Mar 14, 2024

DataNode: Migration by Rolling Upgrade #17155

DataNode: Migration by Rolling Upgrade #17155

Comments

janheise commented Nov 2, 2023 • edited Loading

janheise commented Nov 13, 2023

janheise commented Nov 16, 2023

janheise commented Dec 11, 2023

janheise commented Dec 11, 2023

janheise commented Mar 14, 2024

janheise commented Nov 2, 2023 •

edited

Loading