-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TransportConfigUpdateAction causing a race condition during node bootstrap leading to 403/500 errors #3204
Comments
Thank you for filing this issue @shikharj05! The proposed fix is on the right track, but I think a call to Instead of relying on the
and over in TransportConfigUpdateAction, the nodeOperation method can be updated to:
|
Thanks for the quick response @cwperks !
Right, however - the method is called only in two cases-
Your fix works too! Should we maintain two flags or just rely on the same one? |
Oh, you're right. I'm mistaken. A single flag would be better. I see now that if the |
I like the idea of waiting until ConfigRepo sets DynamicConfigFactory to be available. Is the solution to add a getter for DCF once ConfigRepository is available, so that initialized flag can be accessed? |
I don't think we need to keep the |
Let's consider a scenario where the startup thread has loaded the configurations from index. Method call to A config update request comes to the node and sees that DCF is not yet initialised and skips the update flow. Startup thread gets scheduled for CPU again and finishes initialisation of in-memory security config state. We will miss the config update in this case. So, The wait can be done in |
[Triage] Thank you for filing this issue @shikharj05. This issue appears to have a clear path of action forward so marking triaged. |
Update: While trying to implement a CountDownLatch in place of isInitialized, I ran some preexisting tests to verify the new behavior. In doing so, I discovered that ConfigRepository's bgThread.start() which inherently calls the while loop is never triggered.(This was because the default init is set to false. Checkout initOnNodeStart() for details). The call directly passes goes to TransportConfigUpdateAction nodeOperations(). After manually modifying the test setup to trigger bgThread.start(), I encountered latch TimeoutException. This is due to following:
This tells that onChange needs to be called atleast once to set DynamicConfigFactory as initialized. I'm currently unsure of how to achieve this prior to TransportConfigUpdateAction's nodeOperations() call. |
A possible solution could be to set the countdown latch to 0 in DCF constructor indicating that this is initialized. However, I'm not sure about the effects of doing so. For this: @Override
protected ConfigUpdateNodeResponse nodeOperation(final NodeConfigUpdateRequest request) {
try {
dynamicConfigFactory.waitForInit();
} catch (InterruptedException e) {
// thread interrupted while waiting for DCF to be initialized
throw new OpenSearchException("Error while waiting for DynamicConfigFactory to be initialized: ", e);
}
configurationRepository.reloadConfiguration(CType.fromStringValues(request.request.getConfigTypes()));
backendRegistry.get().invalidateCache();
return new ConfigUpdateNodeResponse(clusterService.localNode(), request.request.getConfigTypes(), null);
} to work, DCF latch needs to be counted down in the constructor. |
One of the side-effects I see is that the security/src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java Lines 321 to 325 in 2ad8272
From what I can see in the above condition, DCF is never initialized on NodeBootStrap and is only initialized (although partially) via TransportAction i.e. updates only. Should we keep this behavior consistent? |
If we keep as is then the latch await will be stuck in an infinite wait (or timeout) as DCF is never initialized before onChange call. This will only leave an option to keep the code as is for nodeOperation. |
@DarshitChanpura and @cwperks, were either one of you able to replicate Shikhar's issue? I am looking at this now but so far have not encountered it. I am going to try to run with docker one more time before swapping to a manual multinode setup in-case docker prevents this issue from arising. I was not able to directly kill the opensearch process with docker since the container host lacks most command line features--I bounced the container instead. So perhaps this is preventing the failure (though I want to say this should not be the case--opensearch restart is opensearch restart). I figured I would check if either of you had a specific configuration setup that worked for you when you were looking at this. Thanks! |
Another place this could be fixed is directly in DynamicConfigFactory.onChange which is called in both instances where the configCache can be invalidated/populated. The 2 instances are:
In particular, there is an issue on this line where it expects the In the problematic case where When that returns empty (i.e. |
With @cwperks' help, I have a repo where you can see the issue occur by following these steps:
You should now see the error of the node coming back up but failing to connect to the cluster and the script will throw 500s. |
For fixes, I have tried modifying
at line 173 (the top of the Sadly, that does not actually work so going to try the above. |
@scrawfor99 my bad, it would need to go in the |
So with Shikhar's change, the nodes fail to come up even with running the securityadmin tool. They just go into 503s saying Security is not initialized. I am going to try @cwperks revised idea and if that does not work look into the locks discussed by DC. |
I have tried several versions of both @cwperks and @shikharj05's suggestions and not yet been able to fix the issue. Here is an example of the logs even with the changes.
From what I have found the issue is that the statement:
will never trigger. Basically, it is never going to be null because even on the empty configurations created on line 63 of the SecurityDynamicConfiguration class, they are assumed to be v7. Going to look into figuring out a better fx |
@scrawfor99 Does securityadmin produce an error? If so, what is the error message? |
@scrawfor99 I opened a PR against my local fork here. There needs to be a different flag that TransportConfigUpdateAction can use that signals to it when its ok to start accepting transport requests. The securityadmin script can be used to initialize security and relies on running these TransportConfigUpdateActions to initialize security. Those use-cases need to be accounted for. |
@cwperks I'll see about getting this merged and backported |
@peternied Thank you for your help on this issue :). |
… has completed (#3810) ### Description Introduces another variable on DynamicConfigFactory called `bgThreadComplete` that behaves differently than the `initialized` variable. `bgThreadComplete` is a flag that signals to TransportConfigUpdateAction that it can start accepting updates. There are 2 ways the security index can be created from scratch: 1. If `plugins.security.allow_default_init_securityindex` is set to **true** it will create the security index and load all yaml files 2. If `plugins.security.allow_default_init_securityindex` is set to **false**, the security index is not created on bootstrap and requires a user to run securityadmin to initialize security. When securityadmin is utilized, the cluster does depend on [TransportConfigUpdateAction](https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/tools/SecurityAdmin.java#L975-L977) to initialize security so there still needs to be an avenue where this can update the config before `initialized` is set to **true** This PR sets `bgThreadComplete` to **false** on node startup and explicitly sets it to **true** once its ok for TransportConfigUpdateAction to start accepting transport actions. In case 2) up above, it can be set to **true** before DynamicConfigFactory is `initialized` so that it can accept requests from securityadmin. * Category (Enhancement, New feature, Bug fix, Test fix, Refactoring, Maintenance, Documentation) Bug fix ### Issues Resolved - Resolves #3204 ### Check List - [X] New functionality includes testing - [ ] New functionality has been documented - [X] Commits are signed per the DCO using --signoff By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). --------- Signed-off-by: Craig Perkins <[email protected]> Signed-off-by: Peter Nied <[email protected]> Signed-off-by: Peter Nied <[email protected]> Co-authored-by: Peter Nied <[email protected]> Co-authored-by: Peter Nied <[email protected]>
… has completed (opensearch-project#3810) Introduces another variable on DynamicConfigFactory called `bgThreadComplete` that behaves differently than the `initialized` variable. `bgThreadComplete` is a flag that signals to TransportConfigUpdateAction that it can start accepting updates. There are 2 ways the security index can be created from scratch: 1. If `plugins.security.allow_default_init_securityindex` is set to **true** it will create the security index and load all yaml files 2. If `plugins.security.allow_default_init_securityindex` is set to **false**, the security index is not created on bootstrap and requires a user to run securityadmin to initialize security. When securityadmin is utilized, the cluster does depend on [TransportConfigUpdateAction](https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/tools/SecurityAdmin.java#L975-L977) to initialize security so there still needs to be an avenue where this can update the config before `initialized` is set to **true** This PR sets `bgThreadComplete` to **false** on node startup and explicitly sets it to **true** once its ok for TransportConfigUpdateAction to start accepting transport actions. In case 2) up above, it can be set to **true** before DynamicConfigFactory is `initialized` so that it can accept requests from securityadmin. * Category (Enhancement, New feature, Bug fix, Test fix, Refactoring, Maintenance, Documentation) Bug fix - Resolves opensearch-project#3204 - [X] New functionality includes testing - [ ] New functionality has been documented - [X] Commits are signed per the DCO using --signoff By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). --------- Signed-off-by: Craig Perkins <[email protected]> Signed-off-by: Peter Nied <[email protected]> Signed-off-by: Peter Nied <[email protected]> Co-authored-by: Peter Nied <[email protected]> Co-authored-by: Peter Nied <[email protected]> (cherry picked from commit 045d4ef)
… has completed (opensearch-project#3810) Introduces another variable on DynamicConfigFactory called `bgThreadComplete` that behaves differently than the `initialized` variable. `bgThreadComplete` is a flag that signals to TransportConfigUpdateAction that it can start accepting updates. There are 2 ways the security index can be created from scratch: 1. If `plugins.security.allow_default_init_securityindex` is set to **true** it will create the security index and load all yaml files 2. If `plugins.security.allow_default_init_securityindex` is set to **false**, the security index is not created on bootstrap and requires a user to run securityadmin to initialize security. When securityadmin is utilized, the cluster does depend on [TransportConfigUpdateAction](https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/tools/SecurityAdmin.java#L975-L977) to initialize security so there still needs to be an avenue where this can update the config before `initialized` is set to **true** This PR sets `bgThreadComplete` to **false** on node startup and explicitly sets it to **true** once its ok for TransportConfigUpdateAction to start accepting transport actions. In case 2) up above, it can be set to **true** before DynamicConfigFactory is `initialized` so that it can accept requests from securityadmin. * Category (Enhancement, New feature, Bug fix, Test fix, Refactoring, Maintenance, Documentation) Bug fix - Resolves opensearch-project#3204 - [X] New functionality includes testing - [ ] New functionality has been documented - [X] Commits are signed per the DCO using --signoff By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). --------- Signed-off-by: Craig Perkins <[email protected]> Signed-off-by: Peter Nied <[email protected]> Signed-off-by: Peter Nied <[email protected]> Co-authored-by: Peter Nied <[email protected]> Co-authored-by: Peter Nied <[email protected]>
What is the bug?
Partial configuration is loaded/cached in security plugin objects/memory in case of a race condition during OpenSearch process bootstrap on a node. For e.g. this can lead to missing role mappings in-memory causing invalid 403s to the requests landing on the buggy node.
Details:
During a node bootstrap, initOnNodeStart is kicked-off after Rest layer is ready. However, a node joins the cluster and replica shard for security index is assigned after transport channel is ready and cluster manager is discovered. Hence, if a TransportConfigUpdateAction triggers an update for a particular config (let's say securityconfig), DynamicConfigFactory object gets initialized and this condition is never true leading to partial config in memory.
How can one reproduce the bug?
Steps to reproduce the behavior:
A temporary fix here is to either flush the cache using
flush
API OR send a dummy config update request.What is the expected behavior?
Do not partially initialize objects.
What is your host/environment?
Do you have any screenshots?
NA
Do you have any additional context?
A fix here would be to check the state of DynamicConfigFactory here before calling reload.
The text was updated successfully, but these errors were encountered: