Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CLI to apply resources to using regular kubectl apply instead of with force #6784

Merged
merged 1 commit into from
Oct 11, 2023

Conversation

cxbrowne1207
Copy link
Member

@cxbrowne1207 cxbrowne1207 commented Oct 6, 2023

Issue #, if available:
#6453

Description of changes:

This PR addresses the issue where the CLI stalls on create/upgrade if it encounters a validation error at the point of force "Applying eksa yaml resources to cluster" near the end of cluster creation or upgrade, where it will stall indefinitely until it eventually times out. This is proposing opting to use a regular client side kubectl apply operation instead of the current client sidekubectl apply --force, so that such issues are raised earlier and handle properly in the web-hooks

Normally, the k --force apply would delete the resource and just recreate it, but instead it the CLI gets stuck. This is because there is a finalizer on the Cluster object resecource. Typically, the eksa-controller-manager controls deletion of this resource using the finalizer once the dependent objects have been cleaned up, however, the controller short circuits with an error before it gets to point where it removes the finalizer for management clusters. It throws an error saying deleting self-managed clusters is not supported instead, which is intended behavior. Currently, this seems to happen for only management clusters.

So far, this situation has primarily been observed when upgrading a cluster using GitOps Flux, but it can be present when not using GitOps Flux as well. Here are two cases in which we’ve encountered this problem:

Case 1

When upgrading from the latest minor release to v0.17.0, the CLI stalled while "Applying eksa yaml resources to cluster". In this case, both the newly added eksaVersion field and the bundlesRef were populated in submission to the kube-apiserver, however, there is a web-hook validation that does not allow both to be set at the same time, so that was throwing an error in the kube-apiserver pod.

BundlesRef:(*v1alpha1.BundlesRef)(0xc000fce960), EksaVersion:(*v1alpha1.EksaVersion)(0xc000fc7a30)}: cannot pass both bundlesRef and eksaVersion. New clusters should use eksaVersion instead of bundlesRef“

Though the Cluster object that was force applied was valid with only the eksaVersion specified and no bundlesRef set, it was trying to do that 3 way strategic merge and added the old bundlesRef to the new spec as well.

We had decided that a solution addressing the root cause at this point would be risky for and would require more discussion. Therefore, a temporary patch was made with this PR, which utilized an alternate ClusterSpec struct called ClusterSpecGenerate, where the bundlesRef is not omitted when empty. This addresses the issue at the time of force apply by forcing the inclusion of a nil bundlesRef field in the kube-apiserver, avoiding the 3 way strategic merge behavior that was happening here.

Case 2

A user report an issue with a cluster in a similar state to the aforementioned issue. They were trying to upgrade their EKS-A cluster managed through GitOps from Kubernetes version 1.26 to 1.27. However, they encountered a validation error:

W0809 17:39:26.913668       1 dispatcher.go:216] rejected by webhook "validation.cluster.anywhere.amazonaws.com": &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"admission webhook \"validation.cluster.anywhere.amazonaws.com\" denied the request: Cluster.anywhere.eks.amazonaws.com \"psv2-cdp-dev\" is invalid: spec.kubernetesVersion: Forbidden: field is immutable 1.27", Reason:"Invalid", Details:(*v1.StatusDetails)(0xc015c02360), Code:422}}

In the cluster web-hook, we skip the immutable field validations when the Cluster resource has the paused annotation. In this case, after investigating, we saw that the user's Cluster object resource in etcd did not have the paused annotation when retrieved at the point of failure.

Upon further testing, the upgrade under the conditions described by the user succeeds, so we can conclude that the cause of the failure here was for some reason their cluster did not have the expected paused annotation. However, with the use of force apply operation was attempting to be delete the Cluster object which could have had unintended effects since did not have reconciliation paused for some reason.

Solution

So, in the case management clusters the force apply mechanism is broken completely, and if encountered a validation error would never accomplish it's purpose, stalling and then timing out eventually with a vague error. For workload clusters, there may be a situation where we are unexpectedly bypassing a valid web-hook validation as described in case 2 above. Instead of force applying the Cluster spec resources, we should update the code to do a regular kubectl apply instead of a kubectl apply --force of the Cluster spec resources and bubble up any errors that result. This would be better for simply debugging and identifying validations that should be skipped in the web-hooks when cluster reconciliation is paused earlier. Moreover, this is the approach we currently take when validating the cluster, machineconfig and datacenterconfig objects against updates in their respective web-hooks across all providers. So, allowing the error to bubble up should not have no or little impact on the current state of things.

For workload clusters, the controller would remove the finalizer when the cluster is managed by the CLI. If reconciliation is paused for the cluster after the deletion, the rest of the reconciliation process is short circuited. During the upgrade process, we pause the reconciliation on the cluster object, so at this point there should be no side-effects those things considered. However, this may still undesirable because it effectively ignores web-hook errors which may be valid instead of properly handling them.

In the event, that there is an immutable field validation that is currently not being skipped in the web-hook when cluster reconciliation is paused or handled properly otherwise, then we would have to address it or add fixes to the webhooks to skip the validations when the cluster object is paused for reconciliation if not already.

Testing (if applicable):

  • Running a few e2e upgrade tests across the providers.

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@eks-distro-bot eks-distro-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 6, 2023
@codecov
Copy link

codecov bot commented Oct 6, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (f844b7f) 71.78% compared to head (c8b9e55) 71.76%.
Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6784      +/-   ##
==========================================
- Coverage   71.78%   71.76%   -0.02%     
==========================================
  Files         531      531              
  Lines       41295    41295              
==========================================
- Hits        29642    29634       -8     
- Misses       9993    10000       +7     
- Partials     1660     1661       +1     
Files Coverage Δ
pkg/clustermanager/cluster_manager.go 71.50% <100.00%> (ø)

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cxbrowne1207 cxbrowne1207 marked this pull request as ready for review October 6, 2023 23:04
@cxbrowne1207
Copy link
Member Author

/hold

Running a few more e2e tests locally

@cxbrowne1207
Copy link
Member Author

/unhold

@cxbrowne1207
Copy link
Member Author

/approve

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cxbrowne1207

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cxbrowne1207
Copy link
Member Author

/test eks-anywhere-presubmit

@eks-distro-bot eks-distro-bot merged commit ede3e1c into aws:main Oct 11, 2023
5 checks passed
@cxbrowne1207 cxbrowne1207 deleted the update-apply-resources branch October 11, 2023 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants