Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kamaji broken after namespace removal #491

Closed
gecube opened this issue Jul 15, 2024 · 26 comments
Closed

Kamaji broken after namespace removal #491

gecube opened this issue Jul 15, 2024 · 26 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@gecube
Copy link

gecube commented Jul 15, 2024

2024-07-15T11:19:08Z	ERROR	controller-runtime.source.EventHandler	failed to get informer from cache	{"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: admissionregistration.k8s.io/v1: Get \"https://kubernetes-test0.tenant-test0.svc:6443/apis/admissionregistration.k8s.io/v1?timeout=10s\": context deadline exceeded"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:56
2024-07-15T11:19:09Z	INFO	starting CertificateLifecycle handling	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-konnectivity-certificate","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-konnectivity-certificate", "reconcileID": "ade0284e-a277-4c70-848a-5138b70dddfb"}
2024-07-15T11:19:09Z	INFO	certificate is still valid, enqueuing back	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-konnectivity-certificate","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-konnectivity-certificate", "reconcileID": "ade0284e-a277-4c70-848a-5138b70dddfb", "after": "87460h59m37.007380834s"}
{"level":"warn","ts":"2024-07-15T11:19:09.490703Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d82380/etcd-0.etcd-headless.tenant-root.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2024-07-15T11:19:11Z	INFO	starting CertificateLifecycle handling	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-front-proxy-client-certificate","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-front-proxy-client-certificate", "reconcileID": "b6da9941-a533-461d-bd5d-bb5245a328c5"}
2024-07-15T11:19:12Z	INFO	certificate is still valid, enqueuing back	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-front-proxy-client-certificate","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-front-proxy-client-certificate", "reconcileID": "b6da9941-a533-461d-bd5d-bb5245a328c5", "after": "8572h56m3.319115679s"}
2024-07-15T11:19:13Z	INFO	starting CertificateLifecycle handling	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-api-server-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-api-server-certificate", "reconcileID": "9455cd48-738a-4578-93a4-8aaff668e902"}
{"level":"warn","ts":"2024-07-15T11:19:11.887367Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d82380/etcd-0.etcd-headless.tenant-root.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2024-07-15T11:19:14Z	INFO	certificate is still valid, enqueuing back	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-api-server-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-api-server-certificate", "reconcileID": "9455cd48-738a-4578-93a4-8aaff668e902", "after": "7438h49m57.410941272s"}
2024-07-15T11:19:13Z	ERROR	unable to delete datastore data	{"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"kubernetes-test0","namespace":"tenant-georg"}, "namespace": "tenant-georg", "name": "kubernetes-test0", "reconcileID": "e1226656-9803-4468-87f4-735077b3dd88", "resource": "datastore-setup", "error": "unable to delete the datastore: cannot delete database: context deadline exceeded", "errorVerbose": "context deadline exceeded\ncannot delete database\ngithub.com/clastix/kamaji/internal/datastore/errors.NewCannotDeleteDatabaseError\n\t/workspace/internal/datastore/errors/errors.go:29\ngithub.com/clastix/kamaji/internal/datastore.(*EtcdClient).DeleteDB\n\t/workspace/internal/datastore/etcd.go:133\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:211\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nunable to delete the datastore\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:212\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
github.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete
	/workspace/internal/resources/datastore/datastore_setup.go:147
github.com/clastix/kamaji/internal/resources.HandleDeletion
	/workspace/internal/resources/resource.go:88
github.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile
	/workspace/controllers/tenantcontrolplane_controller.go:151
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231
2024-07-15T11:19:15Z	INFO	starting CertificateLifecycle handling	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-api-server-kubelet-client-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-api-server-kubelet-client-certificate", "reconcileID": "1df91c30-3fbe-4d91-8796-d79c011b8c18"}
2024-07-15T11:19:15Z	ERROR	resource deletion failed	{"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"kubernetes-test0","namespace":"tenant-georg"}, "namespace": "tenant-georg", "name": "kubernetes-test0", "reconcileID": "e1226656-9803-4468-87f4-735077b3dd88", "resource": "datastore-setup", "error": "unable to delete the datastore: cannot delete database: context deadline exceeded", "errorVerbose": "context deadline exceeded\ncannot delete database\ngithub.com/clastix/kamaji/internal/datastore/errors.NewCannotDeleteDatabaseError\n\t/workspace/internal/datastore/errors/errors.go:29\ngithub.com/clastix/kamaji/internal/datastore.(*EtcdClient).DeleteDB\n\t/workspace/internal/datastore/etcd.go:133\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:211\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nunable to delete the datastore\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:212\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
github.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile
	/workspace/controllers/tenantcontrolplane_controller.go:152
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231
2024-07-15T11:19:15Z	INFO	certificate is still valid, enqueuing back	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-api-server-kubelet-client-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-api-server-kubelet-client-certificate", "reconcileID": "1df91c30-3fbe-4d91-8796-d79c011b8c18", "after": "7438h49m56.211676939s"}
2024-07-15T11:19:16Z	INFO	starting CertificateLifecycle handling	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-admin-kubeconfig","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-admin-kubeconfig", "reconcileID": "7cbc598a-d89e-4ce6-9082-35d2c11a0d5f"}
2024-07-15T11:19:19Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "validatingwebhookconfiguration", "controllerGroup": "admissionregistration.k8s.io", "controllerKind": "ValidatingWebhookConfiguration", "source": "kind source: *v1.ValidatingWebhookConfiguration"}
2024-07-15T11:19:18Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet", "source": "kind source: *v1.DaemonSet"}
2024-07-15T11:19:19Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet", "source": "kind source: *v1.ServiceAccount"}
2024-07-15T11:19:19Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ClusterRoleBinding"}
2024-07-15T11:19:18Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "kind source: *v1.ConfigMap"}
2024-07-15T11:19:19Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "channel source: 0xc001b8ee00"}
2024-07-15T11:19:20Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "validatingwebhookconfiguration", "controllerGroup": "admissionregistration.k8s.io", "controllerKind": "ValidatingWebhookConfiguration", "source": "channel source: 0xc001afcb40"}
2024-07-15T11:19:20Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting Controller	{"controller": "validatingwebhookconfiguration", "controllerGroup": "admissionregistration.k8s.io", "controllerKind": "ValidatingWebhookConfiguration"}
2024-07-15T11:19:20Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "kind source: *v1.ConfigMap"}
2024-07-15T11:19:20Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "source": "channel source: 0xc001b8e840"}
2024-07-15T11:19:20Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting Controller	{"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}
2024-07-15T11:19:20Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "source": "kind source: *v1.Secret"}
2024-07-15T11:19:20Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "source": "channel source: 0xc001b8f600"}
2024-07-15T11:19:20Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting Controller	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret"}
2024-07-15T11:19:19Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting Controller	{"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}
2024-07-15T11:19:20Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ClusterRoleBinding"}
2024-07-15T11:19:21Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ClusterRoleBinding"}
2024-07-15T11:19:21Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet", "source": "kind source: *v1.ClusterRoleBinding"}
2024-07-15T11:19:19Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ClusterRole"}
2024-07-15T11:19:21Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "channel source: 0xc001b8ff00"}
2024-07-15T11:19:21Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting Controller	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-07-15T11:19:22Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ServiceAccount"}
2024-07-15T11:19:22Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet", "source": "channel source: 0xc001afd300"}
2024-07-15T11:19:22Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ServiceAccount"}
2024-07-15T11:19:22Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting Controller	{"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet"}
2024-07-15T11:19:22Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.Service"}
2024-07-15T11:19:23Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ConfigMap"}
2024-07-15T11:19:22Z	INFO	certificate is still valid, enqueuing back	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-admin-kubeconfig","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-admin-kubeconfig", "reconcileID": "7cbc598a-d89e-4ce6-9082-35d2c11a0d5f", "after": "8572h56m10.304161855s"}
2024-07-15T11:19:22Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.Role"}
2024-07-15T11:19:23Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.RoleBinding"}
2024-07-15T11:19:23Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.Deployment"}
2024-07-15T11:19:24Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.ConfigMap"}
2024-07-15T11:19:24Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "channel source: 0xc001b8e380"}
2024-07-15T11:19:24Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting Controller	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-07-15T11:19:24Z	INFO	starting CertificateLifecycle handling	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-datastore-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-datastore-certificate", "reconcileID": "b8c0393d-c525-4a20-94d3-888de01f3b99"}
2024-07-15T11:19:25Z	ERROR	Reconciler error	{"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"kubernetes-test0","namespace":"tenant-georg"}, "namespace": "tenant-georg", "name": "kubernetes-test0", "reconcileID": "e1226656-9803-4468-87f4-735077b3dd88", "error": "unable to delete the datastore: cannot delete database: context deadline exceeded", "errorVerbose": "context deadline exceeded\ncannot delete database\ngithub.com/clastix/kamaji/internal/datastore/errors.NewCannotDeleteDatabaseError\n\t/workspace/internal/datastore/errors/errors.go:29\ngithub.com/clastix/kamaji/internal/datastore.(*EtcdClient).DeleteDB\n\t/workspace/internal/datastore/etcd.go:133\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:211\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nunable to delete the datastore\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).deleteDB\n\t/workspace/internal/resources/datastore/datastore_setup.go:212\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Setup).Delete\n\t/workspace/internal/resources/datastore/datastore_setup.go:146\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:333
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231
2024-07-15T11:19:26Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "kind source: *v1.DaemonSet"}
2024-07-15T11:19:26Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting EventSource	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding", "source": "channel source: 0xc001afdbc0"}
2024-07-15T11:19:26Z	INFO	soot_tenant-leotolstoi_kubernetes-test5	Starting Controller	{"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-07-15T11:19:27Z	INFO	certificate is still valid, enqueuing back	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-datastore-certificate","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-datastore-certificate", "reconcileID": "b8c0393d-c525-4a20-94d3-888de01f3b99", "after": "1842h32m50.20592689s"}
2024-07-15T11:19:29Z	INFO	starting CertificateLifecycle handling	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-scheduler-kubeconfig","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-scheduler-kubeconfig", "reconcileID": "6dbf9d38-869a-40f4-b1fc-c206bf6a8097"}
2024-07-15T11:19:32Z	INFO	certificate is still valid, enqueuing back	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test5-scheduler-kubeconfig","namespace":"tenant-leotolstoi"}, "namespace": "tenant-leotolstoi", "name": "kubernetes-test5-scheduler-kubeconfig", "reconcileID": "6dbf9d38-869a-40f4-b1fc-c206bf6a8097", "after": "8572h57m15.009970538s"}
2024-07-15T11:19:32Z	ERROR	controller-runtime.source.EventHandler	failed to get informer from cache	{"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: apps/v1: Get \"https://kubernetes-test5.tenant-leotolstoi.svc:6443/apis/apps/v1?timeout=10s\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:56
2024-07-15T11:19:33Z	INFO	starting CertificateLifecycle handling	{"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"kubernetes-test0-controller-manager-kubeconfig","namespace":"tenant-test0"}, "namespace": "tenant-test0", "name": "kubernetes-test0-controller-manager-kubeconfig", "reconcileID": "f1f2aaa1-2320-4565-b8c3-056465a8a2c6"}
kubectl get pods -n cozy-kamaji                                             
NAME                     READY   STATUS             RESTARTS          AGE
kamaji-c7448f786-v6cmt   0/1     CrashLoopBackOff   388 (2m14s ago)   6d19h

details will be below @kvaps

@prometherion
Copy link
Member

I'm a bit confused here, @gecube. May I ask you a precise way to replicate this and which Namespace have been deleted?

@kvaps
Copy link
Contributor

kvaps commented Jul 15, 2024

Hi @prometherion, I'm currently ivestigating this issue:

I found that namespace has been stuck in terminating state:

NAME             STATUS        AGE
tenant-georg03   Terminating   28h

in describe I can see that it is because of kamaji finalizer:

Conditions:
  Type                                         Status  LastTransitionTime               Reason                Message
  ----                                         ------  ------------------               ------                -------
  NamespaceDeletionDiscoveryFailure            False   Mon, 15 Jul 2024 12:48:25 +0200  ResourcesDiscovered   All resources successfully discovered
  NamespaceDeletionGroupVersionParsingFailure  False   Mon, 15 Jul 2024 12:48:25 +0200  ParsedGroupVersions   All legacy kube types successfully parsed
  NamespaceDeletionContentFailure              False   Mon, 15 Jul 2024 12:48:55 +0200  ContentDeleted        All content successfully deleted, may be waiting on finalization
  NamespaceContentRemaining                    True    Mon, 15 Jul 2024 12:48:25 +0200  SomeResourcesRemain   Some resources are remaining: secrets. has 1 resource instances
  NamespaceFinalizersRemaining                 True    Mon, 15 Jul 2024 12:48:25 +0200  SomeFinalizersRemain  Some content in the namespace has finalizers remaining: finalizer.kamaji.clastix.io/datastore-secret in 1 resource instances
Status:       Terminating
Conditions:
  Type                                         Status  LastTransitionTime               Reason                Message
  ----                                         ------  ------------------               ------                -------
  NamespaceDeletionDiscoveryFailure            False   Mon, 15 Jul 2024 12:48:25 +0200  ResourcesDiscovered   All resources successfully discovered
  NamespaceDeletionGroupVersionParsingFailure  False   Mon, 15 Jul 2024 12:48:25 +0200  ParsedGroupVersions   All legacy kube types successfully parsed
  NamespaceDeletionContentFailure              False   Mon, 15 Jul 2024 12:48:55 +0200  ContentDeleted        All content successfully deleted, may be waiting on finalization
  NamespaceContentRemaining                    True    Mon, 15 Jul 2024 12:48:25 +0200  SomeResourcesRemain   Some resources are remaining: secrets. has 1 resource instances
  NamespaceFinalizersRemaining                 True    Mon, 15 Jul 2024 12:48:25 +0200  SomeFinalizersRemain  Some content in the namespace has finalizers remaining: finalizer.kamaji.clastix.io/datastore-secret in 1 resource instances

Inside this namespace I can see that secret is not deleted:

NAME                                      NAMESPACE       AGE
secret/kubernetes-test0-datastore-config  tenant-georg03  27h
apiVersion: v1
data:
  DB_CONNECTION_STRING: ""
  DB_PASSWORD: <redacted>
  DB_SCHEMA: <redacted>
  DB_USER: <redacted>
kind: Secret
metadata:
  annotations:
    kamaji.clastix.io/checksum: b476dd8320d286bd6ef6fdf0bde47c42
  creationTimestamp: "2024-07-14T08:09:43Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-07-14T08:10:39Z"
  finalizers:
  - finalizer.kamaji.clastix.io/datastore-secret
  labels:
    kamaji.clastix.io/component: datastore-config
    kamaji.clastix.io/name: kubernetes-test0
    kamaji.clastix.io/project: kamaji
  name: kubernetes-test0-datastore-config
  namespace: tenant-georg03
  ownerReferences:
  - apiVersion: kamaji.clastix.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: TenantControlPlane
    name: kubernetes-test0
    uid: 424d8606-235b-4ed6-9706-7497bf97f194
  resourceVersion: "108976218"
  uid: c9a61283-8283-472c-bdf7-f59fb6b3631d
type: Opaque

@kvaps
Copy link
Contributor

kvaps commented Jul 15, 2024

from the kamaji logs it's only seen:

2024-07-15T12:02:27Z    INFO    resource have been deleted, skipping    {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"kubernetes-test0","namespace":"tenant-georg03"}, "namespace": "tenant-georg03", "name": "kubernetes-test0", "reconcileID": "cafc76d2-0041-4e7e-ae70-085214c06675"}

@gecube
Copy link
Author

gecube commented Jul 15, 2024

Some additional details. The issue was observed on https://github.com/aenix-io/cozystack installation.
The actions lead to the issue:

  • create tenant in terms of cozystack (i.e. it is equivalent to installation of HelmRelease which is creating in its turn some namespace, let's say tenant-georg03)
  • install via HelmRelease cluster inside of tenant-georg03 namespace. Wait till it is initialised, work with cluster several days.
  • remove helmrelease that manages the namespace, so effectively sending the DELETE to kube-api
  • the namespace is stucked in Terminating state, kamaji is in crashloopbackoff, effectively BLOCKING the creation of the new clusters:
kubectl get helmrelease -n tenant-georg25
NAME                               AGE   READY   STATUS
clickhouse-test4                   61m   True    Helm install succeeded for release tenant-georg25/clickhouse-test4.v1 with chart [email protected]
copy-kafka-secret                  61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
fluxcd-test4                       61m           
kafka-test4                        61m   True    Helm install succeeded for release tenant-georg25/kafka-test4.v1 with chart [email protected]
kubernetes-test4                   61m   False   Helm install failed for release tenant-georg25/kubernetes-test4 with chart [email protected]: client rate limiter Wait returned an error: context deadline exceeded
kubernetes-test4-cert-manager      61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
kubernetes-test4-cilium            61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
kubernetes-test4-csi               61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
kubernetes-test4-fluxcd            61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
kubernetes-test4-fluxcd-operator   61m   False   dependency 'tenant-georg25/kubernetes-test4' is not ready
stand-25                           61m   True    Helm install succeeded for release tenant-georg25/stand-25.v1 with chart [email protected]
kubectl describe helmrelease kubernetes-test4 -n tenant-georg25
Name:         kubernetes-test4
Namespace:    tenant-georg25
Labels:       app.kubernetes.io/managed-by=Helm
              helm.toolkit.fluxcd.io/name=stand-25
              helm.toolkit.fluxcd.io/namespace=tenant-georg25
Annotations:  meta.helm.sh/release-name: stand-25
              meta.helm.sh/release-namespace: tenant-georg25
API Version:  helm.toolkit.fluxcd.io/v2
Kind:         HelmRelease
Metadata:
  Creation Timestamp:  2024-07-15T11:13:21Z
  Finalizers:
    finalizers.fluxcd.io
  Generation:        1
  Resource Version:  110739693
  UID:               9b671d1a-9def-40b1-852b-fb2cf7158ce0
Spec:
  Chart:
    Spec:
      Chart:               kubernetes
      Reconcile Strategy:  ChartVersion
      Source Ref:
        Kind:       HelmRepository
        Name:       cozystack-apps
        Namespace:  cozy-public
      Version:      0.6.0
  Interval:         1m0s
  Release Name:     kubernetes-test4
  Values:
    Addons:
      Cert Manager:
        Enabled:  true
      Fluxcd:
        Enabled:  true
    Control Plane:
      Replicas:  2
    Host:        
    Node Groups:
      md0:
        Max Replicas:  3
        Min Replicas:  0
        Resources:
          Cpu:     4
          Memory:  8192Mi
Status:
  Conditions:
    Last Transition Time:  2024-07-15T11:18:24Z
    Message:               Failed to install after 1 attempt(s)
    Observed Generation:   1
    Reason:                RetriesExceeded
    Status:                True
    Type:                  Stalled
    Last Transition Time:  2024-07-15T11:18:23Z
    Message:               Helm install failed for release tenant-georg25/kubernetes-test4 with chart [email protected]: client rate limiter Wait returned an error: context deadline exceeded
    Observed Generation:   1
    Reason:                InstallFailed
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-07-15T11:18:23Z
    Message:               Helm install failed for release tenant-georg25/kubernetes-test4 with chart [email protected]: client rate limiter Wait returned an error: context deadline exceeded
    Observed Generation:   1
    Reason:                InstallFailed
    Status:                False
    Type:                  Released
  Failures:                1
  Helm Chart:              cozy-public/tenant-georg25-kubernetes-test4
  History:
    App Version:                  1.30.1
    Chart Name:                   kubernetes
    Chart Version:                0.6.0
    Config Digest:                sha256:ffeab28c6a02570626b22a3fc541f433f86ffc8c95b4b7500b8fce016871bc89
    Digest:                       sha256:0f0026055b565f13bc594fc5c829e93fafe166a66f8aa5c36b94a45751c3fc10
    First Deployed:               2024-07-15T11:13:23Z
    Last Deployed:                2024-07-15T11:13:23Z
    Name:                         kubernetes-test4
    Namespace:                    tenant-georg25
    Status:                       failed
    Version:                      1
  Install Failures:               1
  Last Attempted Config Digest:   sha256:ffeab28c6a02570626b22a3fc541f433f86ffc8c95b4b7500b8fce016871bc89
  Last Attempted Generation:      1
  Last Attempted Release Action:  install
  Last Attempted Revision:        0.6.0
  Observed Generation:            1
  Storage Namespace:              tenant-georg25
Events:
  Type     Reason         Age   From             Message
  ----     ------         ----  ----             -------
  Warning  InstallFailed  56m   helm-controller  Helm install failed for release tenant-georg25/kubernetes-test4 with chart [email protected]: client rate limiter Wait returned an error: context deadline exceeded

Last Helm logs:

2024-07-15T11:13:23.739042656Z: Starting delete for "kubernetes-test4-flux-teardown" Role
2024-07-15T11:13:23.741742858Z: Ignoring delete failure for "kubernetes-test4-flux-teardown" rbac.authorization.k8s.io/v1, Kind=Role: roles.rbac.authorization.k8s.io "kubernetes-test4-flux-teardown" not found
2024-07-15T11:13:23.741754738Z: beginning wait for 1 resources to be deleted with timeout of 5m0s
2024-07-15T11:13:23.761647555Z: creating 1 resource(s)
2024-07-15T11:13:23.765293151Z: Starting delete for "kubernetes-test4-flux-teardown" Role
2024-07-15T11:13:23.769583372Z: beginning wait for 1 resources to be deleted with timeout of 5m0s
2024-07-15T11:13:23.77429976Z: creating 27 resource(s)
2024-07-15T11:13:23.888462685Z: beginning wait for 27 resources with timeout of 5m0s
2024-07-15T11:13:23.894387715Z: Deployment is not ready: tenant-georg25/kubernetes-test4-cluster-autoscaler. 0 out of 1 expected pods are ready
2024-07-15T11:18:21.896596776Z: Deployment is not ready: tenant-georg25/kubernetes-test4-cluster-autoscaler. 0 out of 1 expected pods are ready (148 duplicate lines omitted)

Checking the pods I found that they could not start because of absence of kubeconfig:

kubectl get -n tenant-georg25 all
Warning: kubevirt.io/v1 VirtualMachineInstancePresets is now deprecated and will be removed in v2.
NAME                                                       READY   STATUS              RESTARTS   AGE
pod/chi-clickhouse-test4-clickhouse-0-0-0                  1/1     Running             0          9m
pod/kafka-test4-entity-operator-6c765b8f96-pzf9p           2/2     Running             0          7m22s
pod/kafka-test4-kafka-0                                    1/1     Running             0          8m24s
pod/kafka-test4-kafka-1                                    1/1     Running             0          8m24s
pod/kafka-test4-kafka-2                                    1/1     Running             0          8m24s
pod/kafka-test4-zookeeper-0                                1/1     Running             0          9m2s
pod/kafka-test4-zookeeper-1                                1/1     Running             0          9m2s
pod/kafka-test4-zookeeper-2                                1/1     Running             0          9m2s
pod/kubernetes-test4-cluster-autoscaler-598b659b6c-tfrll   0/1     ContainerCreating   0          9m3s
pod/kubernetes-test4-kccm-8445bbb6bb-bmwp2                 0/1     ContainerCreating   0          9m3s
pod/kubernetes-test4-kcsi-controller-8bd74cc96-l64xc       0/4     ContainerCreating   0          9m3s

...

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    10m                 default-scheduler  Successfully assigned tenant-georg25/kubernetes-test4-cluster-autoscaler-598b659b6c-tfrll to srv3
  Warning  FailedMount  17s (x13 over 10m)  kubelet            MountVolume.SetUp failed for volume "kubeconfig" : secret "kubernetes-test4-admin-kubeconfig" not found

@prometherion
Copy link
Member

We added the finalizer to the Datastore secret since this is required to delete the Datastore data, such as key prefixes for etcd, and schemas for RDBMS.

I'm a bit lost with the Cozystack terminology, thanks for the patience here, please, may I ask you to confirm these are the right steps to reproduce?

  1. Install Kamaji
  2. Create a Tenant Control Plane in its own Namespace
  3. Delete the Namespace
  4. Kamaji crashes

@prometherion prometherion self-assigned this Jul 15, 2024
@kvaps
Copy link
Contributor

kvaps commented Jul 15, 2024

Unfortunately I cannot reproduce this behavior :-(

I just see this secret is created, and kamaji is not trying to remove it nor finalizer from it

@prometherion
Copy link
Member

@gecube reading again the reported logs, it seems to me Kamaji is not able to delete the given Tenant since the connection with the related etcd is broken (context deadline exceeded).

Where is the Datastore located? Furthermore, what's the error causing the CrashLoopbackoff for the Kamaji pod? I wonder about some health checks, or is it a nil pointer dereference?

@kvaps
Copy link
Contributor

kvaps commented Jul 15, 2024

It seems the problem occurrs only when datastore is not available, I was able to reproduce it:

  1. create ns
  2. create tenantcontrolplane
  3. make the datastore unavailable
  4. remove namespace where TCP installed
  5. restart kamaji

check that namespace is still holding the secret for accessing database:

NAME                                    TYPE     DATA   AGE
kubernetes-qweqweqwe-datastore-config   Opaque   4      36m

it has finalizer which is blocking namespace removal.

Kamaji removes tenantcontrolplanes.kamaji.clastix.io from the namespace even if datastore is not available.
So if you recover datastore later the orphan secret won't be reconciled

@prometherion
Copy link
Member

I was able to reproduce this, but Kamaji is not in CrashLoopBackOff (v1.0.0)

NAME                              READY   STATUS      RESTARTS       AGE
etcd-0                            1/1     Running     7 (6d8h ago)   36d
etcd-1                            1/1     Running     7 (6d8h ago)   36d
etcd-2                            1/1     Running     7 (6d8h ago)   36d
etcd-nvme-defrag-28684290-mmg24   0/1     Completed   0              2m39s
kamaji-56649dbd78-hx2gq           1/1     Running     0              2m28s

I see in the logs kamaji tries to connect to the given Datastore, and that's ok: the problem here is that Kamaji is not aware of your business logic.

Not sure if it's the case, but let's take for granted you're deleting the Datastore/etcd in the same Namespace where the Tenant Control Plane resides: we know the Tenant Control Plane has a dependency with the Datastore that must be finalized prior the deletion of the Datastore itself. Kamaji is not aware you're deleting the entire Namespace and the etcd is gone, so it tries to constantly reconcile the finalizer by performing the clean-up.

I would suggest you, if it's possible, to have an order in the actions, as we have with the creation of a Tenant Control Plane where:

  1. a Datastore is created
  2. then, the Tenant Control Plane for the given Datastore, is created

With the same principle, the deletion requires:

  1. Deletion of the Tenant Control Plane
  2. Eventually, deletion of the Datastore

It could be your etcd is unreachable for a specific reason as I did in my test (kubectl scale sts --replicas=0): even here, once scaled up to normal, Kamaji has been able to connect to the etcd, performing clean-up, and then the TCP has been removed, as well as the secrets, and the Namespace.

I don't think this is a bug report we have to address, it sounds more an edge case where you have to orchestrate better your platform on top of Kamaji.

@gecube
Copy link
Author

gecube commented Jul 15, 2024

@prometherion thanks for the reproduction. I think that we could not rely on removal order anyway. If we can implement the order for applying objects - there are many mechanism for it, particularly in Helm itself or FluxCD, but for the removal we can expect anything. Like user comes and removes the namespace completely, because he is not aware of complex logic under the hood. And we can do nothing on platform level with it. The only option (as I believe) - is to write all controllers in such a manner that:

  1. controller never resides in the same namespace as CR it manages. Otherwise it is easy to run into the situation when controller pod already removed (as removal order is not strictly defined) and there is nobody who can handle finaliser. I already faced the same issues with even FluxCD itself and Victoria Logs operator.
  2. controller properly addresses the removal of all objects it manages.

@prometherion
Copy link
Member

I'd like to help here, but unfortunately, it's out of our control.

If the user makes the Datastore unavailable for any reason, and user deletes the TenantControlPlane, Kamaji still relies on the clean-up of those resources. It makes sense since we don't want to have etcd with orphaned keys, and given the context here (such as having etcd unreachable, and the user deleting the Namespace) it sounds like an edge case.

The addressable bug here is the CrashLoopBackOff which is non-reproducible, at least with v1.0.0.

As I said before, without being nasty, it's not a bug per se, but the typical Kubernetes scenario where there's a chain of dependencies that must be known by the user, or if it's orchestrated by a third-party platform, it must know and orchestrated accordingly.

I'm going to close this issue but:

  1. happy to open it back if you're able to provide me further details on how to replicate the CrashLoopBackOff
  2. happy to continue the discussion for an enhanced proposal
  3. given the fact once the Datastorec connection was established back correctly the deletion was completed successfully.

@prometherion prometherion added the wontfix This will not be worked on label Jul 15, 2024
@jds9090
Copy link
Contributor

jds9090 commented Oct 30, 2024

@prometherion Why does it use a finalizer, unlike other secret resources?

apiVersion: v1
data:
  DB_CONNECTION_STRING: ""
  DB_PASSWORD: xxx
  DB_SCHEMA: xxx
  DB_USER: xxx
kind: Secret
metadata:
  annotations:
    kamaji.clastix.io/checksum: 566325264c13ef54ad9af9190e64aa8e
  creationTimestamp: "2024-10-30T05:45:21Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-10-30T05:57:25Z"
  finalizers:
  - finalizer.kamaji.clastix.io/datastore-secret

Predictable Reproduction Flow

  • Request to delete tenantControlPlane
  • Request and completion of datastore-data deletion && Request and completion of soot deletion
  • Request to delete datastore-secret and failure: Conflict occurs due to object modification, retry initiated
  • Removal of tenantControlPlane's finalizers and completion of tenantControlPlane deletion
  • Unable to issue further datastore-secret deletion requests since tenantControlPlane is deleted
  • Request to delete namespace
  • Namespace status remains Terminating (datastore-secret with an unremoved finalizer still exists)

As I understand, tenant data is deleted with admin privileges. Is datastore secret necessary before deleting a tenant(#376)?

@prometherion
Copy link
Member

All the Datastore actions, besides creation, are achieved using the limited account.

We could switch over the root credentials given by the Datastore resource, and retrieve the scheme from the Tenant Control Plane status.

It would require a refactoring tho, something that I'm not able to manage right now: happy to receive contributions tho, as well as providing guidance through the code base.

@jds9090
Copy link
Contributor

jds9090 commented Oct 30, 2024

All the Datastore actions, besides creation, are achieved using the limited account.

We could switch over the root credentials given by the Datastore resource, and retrieve the scheme from the Tenant Control Plane status.

It would require a refactoring tho, something that I'm not able to manage right now: happy to receive contributions tho, as well as providing guidance through the code base.

Do you consider it a bug that the datastore-secret still remains even after the tenantControlPlane has been deleted?

It is related to the finalizer(finalizer.kamaji.clastix.io/datastore-secret)

These are logs about it.

[ubuntu@infra-bastion ~ (⎈|infra-cluster-admin@infra-cluster:N/A)]$ k get tcp -n ceabf732259b4dc4a06ad44c1f255c70
No resources found in ceabf732259b4dc4a06ad44c1f255c70 namespace.
[ubuntu@infra-bastion ~ (⎈|infra-cluster-admin@infra-cluster:N/A)]$ k get po -n ceabf732259b4dc4a06ad44c1f255c70
No resources found in ceabf732259b4dc4a06ad44c1f255c70 namespace.
[ubuntu@infra-bastion ~ (⎈|infra-cluster-admin@infra-cluster:N/A)]$ k get secret -n ceabf732259b4dc4a06ad44c1f255c70
NAME                                    TYPE     DATA   AGE
upgrading-ci-cp-test-datastore-config   Opaque   4      74m
[ubuntu@infra-bastion ~ (⎈|infra-cluster-admin@infra-cluster:N/A)]$ k get ns ceabf732259b4dc4a06ad44c1f255c70
NAME                               STATUS        AGE
ceabf732259b4dc4a06ad44c1f255c70   Terminating   75m
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Stopping and waiting for non leader election runnables
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Stopping and waiting for leader election runnables
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Shutdown signal received, waiting for all workers to finish     {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Shutdown signal received, waiting for all workers to finish     {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Shutdown signal received, waiting for all workers to finish     {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Shutdown signal received, waiting for all workers to finish     {"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Shutdown signal received, waiting for all workers to finish     {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Shutdown signal received, waiting for all workers to finish     {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Shutdown signal received, waiting for all workers to finish     {"controller": "validatingwebhookconfiguration", "controllerGroup": "admissionregistration.k8s.io", "controllerKind": "ValidatingWebhookConfiguration"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Shutdown signal received, waiting for all workers to finish     {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      All workers finished    {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      All workers finished    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      All workers finished    {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      All workers finished    {"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      All workers finished    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      All workers finished    {"controller": "daemonset", "controllerGroup": "apps", "controllerKind": "DaemonSet"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      All workers finished    {"controller": "validatingwebhookconfiguration", "controllerGroup": "admissionregistration.k8s.io", "controllerKind": "ValidatingWebhookConfiguration"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      All workers finished    {"controller": "clusterrolebinding", "controllerGroup": "rbac.authorization.k8s.io", "controllerKind": "ClusterRoleBinding"}
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Stopping and waiting for caches
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Stopping and waiting for webhooks
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Stopping and waiting for HTTP servers
2024-10-30T05:57:24Z    INFO    soot_ceabf732259b4dc4a06ad44c1f255c70_upgrading-ci-cp-test      Wait completed, proceeding to shutdown the manager
2024-10-30T05:57:24Z    INFO    marked for deletion, performing clean-up        {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "da7d140a-ecd2-4a40-ac67-03f4f9d8f527"}
2024-10-30T05:57:25Z    ERROR   resource deletion failed        {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "da7d140a-ecd2-4a40-ac67-03f4f9d8f527", "resource": "datastore-config", "error": "cannot remove DataStore Secret finalizers: Operation cannot be fulfilled on secrets \"upgrading-ci-cp-test-datastore-config\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on secrets \"upgrading-ci-cp-test-datastore-config\": the object has been modified; please apply your changes to the latest version and try again\ncannot remove DataStore Secret finalizers\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Config).Delete\n\t/workspace/internal/resources/datastore/datastore_storage_config.go:88\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
2024-10-30T05:57:25Z    ERROR   Reconciler error        {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "da7d140a-ecd2-4a40-ac67-03f4f9d8f527", "error": "cannot remove DataStore Secret finalizers: Operation cannot be fulfilled on secrets \"upgrading-ci-cp-test-datastore-config\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on secrets \"upgrading-ci-cp-test-datastore-config\": the object has been modified; please apply your changes to the latest version and try again\ncannot remove DataStore Secret finalizers\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Config).Delete\n\t/workspace/internal/resources/datastore/datastore_storage_config.go:88\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
2024-10-30T05:57:25Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"upgrading-ci-cp-test-front-proxy-client-certificate","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test-front-proxy-client-certificate", "reconcileID": "beea4d0d-dbe9-4567-b42b-d9b1fa6a2918"}
2024-10-30T05:57:25Z    INFO    resource have been deleted, skipping    {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"upgrading-ci-cp-test-front-proxy-client-certificate","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test-front-proxy-client-certificate", "reconcileID": "beea4d0d-dbe9-4567-b42b-d9b1fa6a2918"}
...
2024-10-30T05:57:25Z    INFO    resource have been deleted, skipping    {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "60b34419-f864-4200-9de6-ccbd70daecb0"}
2024-10-30T05:57:25Z    INFO    starting CertificateLifecycle handling  {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"upgrading-ci-cp-test-konnectivity-certificate","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test-konnectivity-certificate", "reconcileID": "ad6400d5-d545-4517-9fc9-e8dfbf9b5e05"}
2024-10-30T05:57:25Z    INFO    resource have been deleted, skipping    {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"upgrading-ci-cp-test-konnectivity-certificate","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test-konnectivity-certificate", "reconcileID": "ad6400d5-d545-4517-9fc9-e8dfbf9b5e05"}
2024-10-30T05:57:25Z    INFO    resource have been deleted, skipping    {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "279bfc6c-8942-40e8-8926-ce444a4d960c"}
2024-10-30T05:57:25Z    INFO    resource have been deleted, skipping    {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "f0e6238f-cdca-4939-a6b7-985fe5afbe5c"}
2024-10-30T05:57:25Z    INFO    resource have been deleted, skipping    {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "9ff1106f-6b73-4e9c-a56d-f992982f9178"}
2024-10-30T05:57:41Z    INFO    resource have been deleted, skipping    {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "370d2fb9-08c5-4f69-a2af-c34aaf037b45"}

In my view, The current TenantControlPlane controller cannot guarantee the removal of that finalizer consistently.

@prometherion
Copy link
Member

Do you consider it a bug that the datastore-secret still remains even after the tenantControlPlane has been deleted?

I'm unable to replicate the issue.

NAMESPACE    NAME      VERSION   STATUS         CONTROL-PLANE ENDPOINT   KUBECONFIG                 DATASTORE   AGE
ns-removal   k8s-130   v1.30.0   Provisioning   172.18.255.100:6443      k8s-130-admin-kubeconfig   default     10s

$: kubectl delete ns ns-removal
namespace "ns-removal" deleted

$: sleep 60  # waiting for Namespace finalizer

$: kubectl get ns ns-removal
Error from server (NotFound): namespaces "ns-removal" not found

JFI I'm running on Kamaji on latest.

@jds9090
Copy link
Contributor

jds9090 commented Oct 30, 2024

Do you consider it a bug that the datastore-secret still remains even after the tenantControlPlane has been deleted?

I'm unable to replicate the issue.

NAMESPACE    NAME      VERSION   STATUS         CONTROL-PLANE ENDPOINT   KUBECONFIG                 DATASTORE   AGE
ns-removal   k8s-130   v1.30.0   Provisioning   172.18.255.100:6443      k8s-130-admin-kubeconfig   default     10s

$: kubectl delete ns ns-removal
namespace "ns-removal" deleted

$: sleep 60  # waiting for Namespace finalizer

$: kubectl get ns ns-removal
Error from server (NotFound): namespaces "ns-removal" not found

JFI I'm running on Kamaji on latest.

Have there been any efforts or work related to this issue? If not, this behavior does not occur consistently.

This behavior occurs due to the following reason:
"The object has been modified; please apply your changes to the latest version and try again."

2024-10-30T05:57:25Z    ERROR   resource deletion failed        {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "da7d140a-ecd2-4a40-ac67-03f4f9d8f527", "resource": "datastore-config", "error": "cannot remove DataStore Secret finalizers: Operation cannot be fulfilled on secrets \"upgrading-ci-cp-test-datastore-config\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on secrets \"upgrading-ci-cp-test-datastore-config\": the object has been modified; please apply your changes to the latest version and try again\ncannot remove DataStore Secret finalizers\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Config).Delete\n\t/workspace/internal/resources/datastore/datastore_storage_config.go:88\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
2024-10-30T05:57:25Z    ERROR   Reconciler error        {"controller": "tenantcontrolplane", "controllerGroup": "kamaji.clastix.io", "controllerKind": "TenantControlPlane", "TenantControlPlane": {"name":"upgrading-ci-cp-test","namespace":"ceabf732259b4dc4a06ad44c1f255c70"}, "namespace": "ceabf732259b4dc4a06ad44c1f255c70", "name": "upgrading-ci-cp-test", "reconcileID": "da7d140a-ecd2-4a40-ac67-03f4f9d8f527", "error": "cannot remove DataStore Secret finalizers: Operation cannot be fulfilled on secrets \"upgrading-ci-cp-test-datastore-config\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on secrets \"upgrading-ci-cp-test-datastore-config\": the object has been modified; please apply your changes to the latest version and try again\ncannot remove DataStore Secret finalizers\ngithub.com/clastix/kamaji/internal/resources/datastore.(*Config).Delete\n\t/workspace/internal/resources/datastore/datastore_storage_config.go:88\ngithub.com/clastix/kamaji/internal/resources.HandleDeletion\n\t/workspace/internal/resources/resource.go:88\ngithub.com/clastix/kamaji/controllers.(*TenantControlPlaneReconciler).Reconcile\n\t/workspace/controllers/tenantcontrolplane_controller.go:151\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:270\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}

@gecube
Copy link
Author

gecube commented Oct 30, 2024

@jds9090 agree, this occurs not every time, so need a lot of tries to confirm the existence of the issue.

@jds9090
Copy link
Contributor

jds9090 commented Oct 30, 2024

For example, it might be necessary to add a finalizer to the tenantControlPlane to ensure the deletion of the datastore-secret. However, it's still unclear whether this approach is appropriate.

@prometherion
Copy link
Member

@jds9090 the The object has been modified; please apply your changes to the latest version and try again. is the root issue, I agree, may I ask you to share if there's any trace of an update action in the annotations, such as last-applied-configuration?

Trying to understand who's changing that Secret is the key!

@prometherion
Copy link
Member

so need a lot of tries to confirm the existence of the issue

Wondering if it eventually succeeds or not, since I'm not able to replicate.

And I was thinking that maybe we could wrap this portion of code in a Retry function.

func (r *Config) Delete(ctx context.Context, _ *kamajiv1alpha1.TenantControlPlane) error {
secret := r.resource.DeepCopy()
if err := r.Client.Get(ctx, types.NamespacedName{Name: r.resource.Name, Namespace: r.resource.Namespace}, secret); err != nil {
if kubeerrors.IsNotFound(err) {
return nil
}
return errors.Wrap(err, "cannot retrieve the DataStore Secret for removal")
}
secret.SetFinalizers(nil)
if err := r.Client.Update(ctx, secret); err != nil {
if kubeerrors.IsNotFound(err) {
return nil
}
return errors.Wrap(err, "cannot remove DataStore Secret finalizers")
}
return nil
}

@jds9090
Copy link
Contributor

jds9090 commented Oct 31, 2024

@jds9090 the The object has been modified; please apply your changes to the latest version and try again. is the root issue, I agree, may I ask you to share if there's any trace of an update action in the annotations, such as last-applied-configuration?

Trying to understand who's changing that Secret is the key!

Unfortunately, there is no information regarding the last-applied-configuration.

apiVersion: v1
data:
  DB_CONNECTION_STRING: ""
  DB_PASSWORD: xxx
  DB_SCHEMA: xxx
  DB_USER: xxx
kind: Secret
metadata:
  annotations:
    kamaji.clastix.io/checksum: xxx
  creationTimestamp: "2024-10-30T05:45:21Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-10-30T05:57:25Z"
  finalizers:
  - finalizer.kamaji.clastix.io/datastore-secret
  labels:
    kamaji.clastix.io/component: datastore-config
    kamaji.clastix.io/name: upgrading-ci-cp-test
    kamaji.clastix.io/project: kamaji
  name: upgrading-ci-cp-test-datastore-config
  namespace: ceabf732259b4dc4a06ad44c1f255c70
  ownerReferences:
  - apiVersion: kamaji.clastix.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: TenantControlPlane
    name: upgrading-ci-cp-test
    uid: e0a1eb79-537c-44a0-ad7b-03f12b01216c
  resourceVersion: "66001921"
  uid: 2e7c4b16-96ac-4bb0-90cf-23ae5fa5bc2c
type: Opaque

@jds9090
Copy link
Contributor

jds9090 commented Oct 31, 2024

Wondering if it eventually succeeds or not, since I'm not able to replicate.

And I was thinking that maybe we could wrap this portion of code in a Retry function.

A retry mechanism seems like it would be helpful here.

@jds9090
Copy link
Contributor

jds9090 commented Oct 31, 2024

@jds9090 the The object has been modified; please apply your changes to the latest version and try again. is the root issue, I agree, may I ask you to share if there's any trace of an update action in the annotations, such as last-applied-configuration?

Trying to understand who's changing that Secret is the key!

I believe it could occur in a scenario like this.

(Action 1)

  • Request to delete TenantControlPlane
  • TenantControlPlane deletion in progress
  • datastore-secret modified due to Action 2
  • TenantControlPlane deletion completed, datastore-secret deletion failed

(Action 2)

  • Request to delete namespace
  • deletionTimestamp value of datastore-secret modified

@jds9090
Copy link
Contributor

jds9090 commented Oct 31, 2024

All the Datastore actions, besides creation, are achieved using the limited account.

We could switch over the root credentials given by the Datastore resource, and retrieve the scheme from the Tenant Control Plane status.

It would require a refactoring tho, something that I'm not able to manage right now: happy to receive contributions tho, as well as providing guidance through the code base.

I’ve been reviewing the approach for Tenant Kubernetes API server access to etcd, which relies solely on TLS-based authentication without using user and password information. This setup appears to apply to the Operator as well, as it also establishes etcd connections without utilizing user and password credentials. I understand that schema information is being used for tenant separation. if there are any scenarios where user and password information would be necessary for etcd connections.

@prometherion
Copy link
Member

@jds9090 just merged #631 which performs a retry mechanism, hope this will fix this race condition!

@jds9090
Copy link
Contributor

jds9090 commented Nov 12, 2024

#631

Thank you! I will let you know if it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants