Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure webhook availability during Konnectivity Agent rolling update #566

Open
dippynark opened this issue Feb 16, 2024 · 9 comments
Open
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@dippynark
Copy link

We are running Gatekeeper as a validating webhook on GKE (although I don't think the webhook implementation or cloud provider matters) and we have a test where we perform a rolling update of Gatekeeper while continuously making Kubernetes API requests that should be rejected to ensure requests/connections are being drained properly.

However, if we also delete Konnectivity Agent Pods while rolling Gatekeeper (gradually, to ensure that the Konnectivity Agent Pods aren't all down at the same time) or perform a rolling update (kubectl rollout restart deployment -n kube-system konnectivity-agent) then a few requests are allowed through (the ValidatingWebhookConfiguration is configured to fail open).

My question is whether this is an issue or whether Konnectivity Agent is behaving as expected? I guess this is happening because the long-lived HTTP keepalive connections between the Kubernetes API server and Gatekeeper (via Konnectivity Server and Konnectivity Agent) are being broken when Konnectivity Agent terminates and are not being drained properly (because there is no way for Konnectivity Agent to inspect the encrypted requests and disable keepalive before shutting down).

Should the Kubernetes API server be able to detect such TCP disconnects and retry validation after reconnecting?

@jkh52
Copy link
Contributor

jkh52 commented Feb 16, 2024

Short answer: the life of a proxied TCP connection from API Server to your Gatekeeper endpoint is not expected to survive across a rolling restart of Agent pods.

What is your use case / goal? In GKE, you should be able to use maintenance windows to prevent Agent restarts at sensitive times; note that this should also cover control plane (API Server and Konnectivity Server) restarts / maintenance. Another strategy to reduce Agent churn is to use a dedicated / stable node pool for kube-system pods (especially if you often scale cluster nodes up and down).

@dippynark
Copy link
Author

dippynark commented Feb 17, 2024

@jkh52 In general we try to configure all cluster workloads to gracefully handle being evicted and rescheduled to a different node without any user impact, just to make it easy for us to change infrastructure configuration without users noticing. Specifically for Konnectivity Agent, we would like to avoid API server requests being unexpectedly rejected (when using fail closed webhooks) or unexpectedly accepted (when using fail open webhooks).

The following are the most common scenarios that would cause Konnectivity Agent to be evicted/rescheduled (and therefore could cause a potential webhook request failure):

  • Cluster scale down (as you mentioned)
  • Google reclaiming a compute instance (we are heavy users of spot VMs so this happens regularly)
  • Changing the machine type of a node pool
  • Upgrading a node pool
  • Upgrading Konnectivity Agent

As you said, we can take steps to reduce the chance of Konnectivty Agent being evicted/rescheduled but we should expect webhook request failures under normal operation, so I think this issue can be closed unless there is any value in keeping it open

@jkh52
Copy link
Contributor

jkh52 commented Feb 17, 2024

Thanks for providing details. Sounds good, I will close this.

@jkh52 jkh52 closed this as completed Feb 17, 2024
@jkh52 jkh52 reopened this Feb 29, 2024
@jkh52
Copy link
Contributor

jkh52 commented Feb 29, 2024

Re-opening to track potential improvements around agent shutdown.

There is a fairly wide time interval between an agent's first SIGTERM and final termination (terminationGracePeriodSeconds
defaults to 30s). reference

The historical decision has been for agents stay connected to servers during this time, and continue routing data packets.

The protocol could be extended to allow a given konnectivity-agent to tell konnectivity-server that it should be considered "draining". New dials should prefer non-draining agents.

@dippynark
Copy link
Author

dippynark commented Mar 5, 2024

@jkh52 This would be amazing! An alternative would be to deploy Konnectivity Agent as a static kubelet Pod and dialing the Konnectivity Agent only on the Node where the target Pod is running so that the Pod can drain connections when the Node is drained but before Konnectivity Agent is terminated (I guess similar to how GKE used to do with the SSH daemon), but I believe this would have scalability issues for large clusters (at least according to this video)

@jkh52 jkh52 self-assigned this Mar 7, 2024
@jkh52
Copy link
Contributor

jkh52 commented Mar 7, 2024

@dippynark Note that kube-apiserver re-uses established proxy TCP connections up to a certain TTL period. If your webhook calls are frequent enough, we should expect this case. So the proposal is not "airtight", since konnectivity-server agent selection is (initial) dial-time only.

The konnectivity-client library gives kube-apiserver a net.Conn object, so it is difficult to support "will Close() soon" semantics there.

References: DialContext and conn.go

@dippynark
Copy link
Author

@jkh52 speaking to Google support, they closed my ticket related to this issue because a Google engineer on GitHub (which I am assuming is you) wanted to continue on GitHub.

Google support said there were workarounds for this issue, which I guess is referring to the your previous comment; I wouldn't really say they are workarounds, they just reduce the chance of the issue occurring, however happy for this ticket to be closed as I agree that supporting the suggested protocol improvements would be complex and even implementing retries at the API server sounds tricky due to the multiple TCP hops from API server to webhook Pod(s).

We will likely also do the following to reduce the impact of this issue:

  • Fail closed when cluster stability wouldn't be impacted
  • Implement controllers to apply corrective controls for any invalid changes that get through fail open webhooks

@jkh52
Copy link
Contributor

jkh52 commented Mar 8, 2024

Your plan sounds good, especially in the near term, since OSS improvements will take a while to rollout to GKE (likely introduced in a minor version e.g. 1.30 or 1.31). I will still take a look at this low hanging fruit idea.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2024
@jkh52 jkh52 added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

4 participants