-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CLI to apply resources to using regular kubectl apply instead of with force #6784
Conversation
Skipping CI for Draft Pull Request. |
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6784 +/- ##
==========================================
- Coverage 71.78% 71.76% -0.02%
==========================================
Files 531 531
Lines 41295 41295
==========================================
- Hits 29642 29634 -8
- Misses 9993 10000 +7
- Partials 1660 1661 +1
☔ View full report in Codecov by Sentry. |
/hold Running a few more e2e tests locally |
/unhold |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cxbrowne1207 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test eks-anywhere-presubmit |
Issue #, if available:
#6453
Description of changes:
This PR addresses the issue where the CLI stalls on create/upgrade if it encounters a validation error at the point of force "Applying eksa yaml resources to cluster" near the end of cluster creation or upgrade, where it will stall indefinitely until it eventually times out. This is proposing opting to use a regular client side
kubectl apply
operation instead of the current client sidekubectl apply --force
, so that such issues are raised earlier and handle properly in the web-hooksNormally, the
k --force apply
would delete the resource and just recreate it, but instead it the CLI gets stuck. This is because there is a finalizer on the Cluster object resecource. Typically, theeksa-controller-manager
controls deletion of this resource using the finalizer once the dependent objects have been cleaned up, however, the controller short circuits with an error before it gets to point where it removes the finalizer for management clusters. It throws an error sayingdeleting self-managed clusters is not supported
instead, which is intended behavior. Currently, this seems to happen for only management clusters.So far, this situation has primarily been observed when upgrading a cluster using GitOps Flux, but it can be present when not using GitOps Flux as well. Here are two cases in which we’ve encountered this problem:
Case 1
When upgrading from the latest minor release to v0.17.0, the CLI stalled while "Applying eksa yaml resources to cluster". In this case, both the newly added eksaVersion field and the bundlesRef were populated in submission to the kube-apiserver, however, there is a web-hook validation that does not allow both to be set at the same time, so that was throwing an error in the kube-apiserver pod.
Though the Cluster object that was force applied was valid with only the eksaVersion specified and no bundlesRef set, it was trying to do that 3 way strategic merge and added the old bundlesRef to the new spec as well.
We had decided that a solution addressing the root cause at this point would be risky for and would require more discussion. Therefore, a temporary patch was made with this PR, which utilized an alternate ClusterSpec struct called ClusterSpecGenerate, where the bundlesRef is not omitted when empty. This addresses the issue at the time of force apply by forcing the inclusion of a nil bundlesRef field in the kube-apiserver, avoiding the 3 way strategic merge behavior that was happening here.
Case 2
A user report an issue with a cluster in a similar state to the aforementioned issue. They were trying to upgrade their EKS-A cluster managed through GitOps from Kubernetes version 1.26 to 1.27. However, they encountered a validation error:
In the cluster web-hook, we skip the immutable field validations when the Cluster resource has the paused annotation. In this case, after investigating, we saw that the user's
Cluster
object resource in etcd did not have thepaused
annotation when retrieved at the point of failure.Upon further testing, the upgrade under the conditions described by the user succeeds, so we can conclude that the cause of the failure here was for some reason their cluster did not have the expected paused annotation. However, with the use of force apply operation was attempting to be delete the Cluster object which could have had unintended effects since did not have reconciliation paused for some reason.
Solution
So, in the case management clusters the force apply mechanism is broken completely, and if encountered a validation error would never accomplish it's purpose, stalling and then timing out eventually with a vague error. For workload clusters, there may be a situation where we are unexpectedly bypassing a valid web-hook validation as described in case 2 above. Instead of force applying the
Cluster
spec resources, we should update the code to do a regularkubectl apply
instead of akubectl apply --force
of the Cluster spec resources and bubble up any errors that result. This would be better for simply debugging and identifying validations that should be skipped in the web-hooks when cluster reconciliation is paused earlier. Moreover, this is the approach we currently take when validating thecluster
,machineconfig
anddatacenterconfig
objects against updates in their respective web-hooks across all providers. So, allowing the error to bubble up should not have no or little impact on the current state of things.For workload clusters, the controller would remove the finalizer when the cluster is managed by the CLI. If reconciliation is paused for the cluster after the deletion, the rest of the reconciliation process is short circuited. During the upgrade process, we pause the reconciliation on the cluster object, so at this point there should be no side-effects those things considered. However, this may still undesirable because it effectively ignores web-hook errors which may be valid instead of properly handling them.
In the event, that there is an immutable field validation that is currently not being skipped in the web-hook when cluster reconciliation is paused or handled properly otherwise, then we would have to address it or add fixes to the webhooks to skip the validations when the cluster object is paused for reconciliation if not already.
Testing (if applicable):
Documentation added/planned (if applicable):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.