-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kamaji broken after namespace removal #491
Comments
I'm a bit confused here, @gecube. May I ask you a precise way to replicate this and which Namespace have been deleted? |
Hi @prometherion, I'm currently ivestigating this issue: I found that namespace has been stuck in terminating state:
in describe I can see that it is because of kamaji finalizer:
Inside this namespace I can see that secret is not deleted:
|
from the kamaji logs it's only seen:
|
Some additional details. The issue was observed on https://github.com/aenix-io/cozystack installation.
Checking the pods I found that they could not start because of absence of kubeconfig:
...
|
We added the finalizer to the Datastore secret since this is required to delete the Datastore data, such as key prefixes for etcd, and schemas for RDBMS. I'm a bit lost with the Cozystack terminology, thanks for the patience here, please, may I ask you to confirm these are the right steps to reproduce?
|
Unfortunately I cannot reproduce this behavior :-( I just see this secret is created, and kamaji is not trying to remove it nor finalizer from it |
@gecube reading again the reported logs, it seems to me Kamaji is not able to delete the given Tenant since the connection with the related etcd is broken ( Where is the Datastore located? Furthermore, what's the error causing the CrashLoopbackoff for the Kamaji pod? I wonder about some health checks, or is it a nil pointer dereference? |
It seems the problem occurrs only when datastore is not available, I was able to reproduce it:
check that namespace is still holding the secret for accessing database:
it has finalizer which is blocking namespace removal. Kamaji removes |
I was able to reproduce this, but Kamaji is not in
I see in the logs kamaji tries to connect to the given Datastore, and that's ok: the problem here is that Kamaji is not aware of your business logic. Not sure if it's the case, but let's take for granted you're deleting the Datastore/etcd in the same Namespace where the Tenant Control Plane resides: we know the Tenant Control Plane has a dependency with the Datastore that must be finalized prior the deletion of the Datastore itself. Kamaji is not aware you're deleting the entire Namespace and the etcd is gone, so it tries to constantly reconcile the finalizer by performing the clean-up. I would suggest you, if it's possible, to have an order in the actions, as we have with the creation of a Tenant Control Plane where:
With the same principle, the deletion requires:
It could be your I don't think this is a bug report we have to address, it sounds more an edge case where you have to orchestrate better your platform on top of Kamaji. |
@prometherion thanks for the reproduction. I think that we could not rely on removal order anyway. If we can implement the order for applying objects - there are many mechanism for it, particularly in Helm itself or FluxCD, but for the removal we can expect anything. Like user comes and removes the namespace completely, because he is not aware of complex logic under the hood. And we can do nothing on platform level with it. The only option (as I believe) - is to write all controllers in such a manner that:
|
I'd like to help here, but unfortunately, it's out of our control. If the user makes the Datastore unavailable for any reason, and user deletes the TenantControlPlane, Kamaji still relies on the clean-up of those resources. It makes sense since we don't want to have etcd with orphaned keys, and given the context here (such as having etcd unreachable, and the user deleting the Namespace) it sounds like an edge case. The addressable bug here is the CrashLoopBackOff which is non-reproducible, at least with v1.0.0. As I said before, without being nasty, it's not a bug per se, but the typical Kubernetes scenario where there's a chain of dependencies that must be known by the user, or if it's orchestrated by a third-party platform, it must know and orchestrated accordingly. I'm going to close this issue but:
|
@prometherion Why does it use a finalizer, unlike other secret resources?
Predictable Reproduction Flow
As I understand, tenant data is deleted with admin privileges. Is datastore secret necessary before deleting a tenant(#376)? |
All the Datastore actions, besides creation, are achieved using the limited account. We could switch over the root credentials given by the Datastore resource, and retrieve the scheme from the Tenant Control Plane status. It would require a refactoring tho, something that I'm not able to manage right now: happy to receive contributions tho, as well as providing guidance through the code base. |
Do you consider it a bug that the datastore-secret still remains even after the tenantControlPlane has been deleted? It is related to the finalizer(finalizer.kamaji.clastix.io/datastore-secret) These are logs about it.
In my view, The current TenantControlPlane controller cannot guarantee the removal of that finalizer consistently. |
I'm unable to replicate the issue.
JFI I'm running on Kamaji on |
Have there been any efforts or work related to this issue? If not, this behavior does not occur consistently. This behavior occurs due to the following reason:
|
@jds9090 agree, this occurs not every time, so need a lot of tries to confirm the existence of the issue. |
For example, it might be necessary to add a finalizer to the tenantControlPlane to ensure the deletion of the datastore-secret. However, it's still unclear whether this approach is appropriate. |
@jds9090 the Trying to understand who's changing that |
Wondering if it eventually succeeds or not, since I'm not able to replicate. And I was thinking that maybe we could wrap this portion of code in a kamaji/internal/resources/datastore/datastore_storage_config.go Lines 70 to 92 in fdd0035
|
Unfortunately, there is no information regarding the last-applied-configuration.
|
A retry mechanism seems like it would be helpful here. |
I believe it could occur in a scenario like this. (Action 1)
(Action 2)
|
I’ve been reviewing the approach for Tenant Kubernetes API server access to etcd, which relies solely on TLS-based authentication without using user and password information. This setup appears to apply to the Operator as well, as it also establishes etcd connections without utilizing user and password credentials. I understand that schema information is being used for tenant separation. if there are any scenarios where user and password information would be necessary for etcd connections. |
Thank you! I will let you know if it happens again. |
details will be below @kvaps
The text was updated successfully, but these errors were encountered: