Ensure webhook availability during Konnectivity Agent rolling update #566

dippynark · 2024-02-16T09:36:08Z

We are running Gatekeeper as a validating webhook on GKE (although I don't think the webhook implementation or cloud provider matters) and we have a test where we perform a rolling update of Gatekeeper while continuously making Kubernetes API requests that should be rejected to ensure requests/connections are being drained properly.

However, if we also delete Konnectivity Agent Pods while rolling Gatekeeper (gradually, to ensure that the Konnectivity Agent Pods aren't all down at the same time) or perform a rolling update (kubectl rollout restart deployment -n kube-system konnectivity-agent) then a few requests are allowed through (the ValidatingWebhookConfiguration is configured to fail open).

My question is whether this is an issue or whether Konnectivity Agent is behaving as expected? I guess this is happening because the long-lived HTTP keepalive connections between the Kubernetes API server and Gatekeeper (via Konnectivity Server and Konnectivity Agent) are being broken when Konnectivity Agent terminates and are not being drained properly (because there is no way for Konnectivity Agent to inspect the encrypted requests and disable keepalive before shutting down).

Should the Kubernetes API server be able to detect such TCP disconnects and retry validation after reconnecting?

The text was updated successfully, but these errors were encountered:

jkh52 · 2024-02-16T22:50:59Z

Short answer: the life of a proxied TCP connection from API Server to your Gatekeeper endpoint is not expected to survive across a rolling restart of Agent pods.

What is your use case / goal? In GKE, you should be able to use maintenance windows to prevent Agent restarts at sensitive times; note that this should also cover control plane (API Server and Konnectivity Server) restarts / maintenance. Another strategy to reduce Agent churn is to use a dedicated / stable node pool for kube-system pods (especially if you often scale cluster nodes up and down).

dippynark · 2024-02-17T10:01:55Z

@jkh52 In general we try to configure all cluster workloads to gracefully handle being evicted and rescheduled to a different node without any user impact, just to make it easy for us to change infrastructure configuration without users noticing. Specifically for Konnectivity Agent, we would like to avoid API server requests being unexpectedly rejected (when using fail closed webhooks) or unexpectedly accepted (when using fail open webhooks).

The following are the most common scenarios that would cause Konnectivity Agent to be evicted/rescheduled (and therefore could cause a potential webhook request failure):

Cluster scale down (as you mentioned)
Google reclaiming a compute instance (we are heavy users of spot VMs so this happens regularly)
Changing the machine type of a node pool
Upgrading a node pool
Upgrading Konnectivity Agent

As you said, we can take steps to reduce the chance of Konnectivty Agent being evicted/rescheduled but we should expect webhook request failures under normal operation, so I think this issue can be closed unless there is any value in keeping it open

jkh52 · 2024-02-17T17:39:52Z

Thanks for providing details. Sounds good, I will close this.

jkh52 · 2024-02-29T22:14:44Z

Re-opening to track potential improvements around agent shutdown.

There is a fairly wide time interval between an agent's first SIGTERM and final termination (terminationGracePeriodSeconds
defaults to 30s). reference

The historical decision has been for agents stay connected to servers during this time, and continue routing data packets.

The protocol could be extended to allow a given konnectivity-agent to tell konnectivity-server that it should be considered "draining". New dials should prefer non-draining agents.

dippynark · 2024-03-05T14:10:02Z

@jkh52 This would be amazing! An alternative would be to deploy Konnectivity Agent as a static kubelet Pod and dialing the Konnectivity Agent only on the Node where the target Pod is running so that the Pod can drain connections when the Node is drained but before Konnectivity Agent is terminated (I guess similar to how GKE used to do with the SSH daemon), but I believe this would have scalability issues for large clusters (at least according to this video)

jkh52 · 2024-03-07T22:21:16Z

@dippynark Note that kube-apiserver re-uses established proxy TCP connections up to a certain TTL period. If your webhook calls are frequent enough, we should expect this case. So the proposal is not "airtight", since konnectivity-server agent selection is (initial) dial-time only.

The konnectivity-client library gives kube-apiserver a net.Conn object, so it is difficult to support "will Close() soon" semantics there.

References: DialContext and conn.go

dippynark · 2024-03-08T16:01:23Z

@jkh52 speaking to Google support, they closed my ticket related to this issue because a Google engineer on GitHub (which I am assuming is you) wanted to continue on GitHub.

Google support said there were workarounds for this issue, which I guess is referring to the your previous comment; I wouldn't really say they are workarounds, they just reduce the chance of the issue occurring, however happy for this ticket to be closed as I agree that supporting the suggested protocol improvements would be complex and even implementing retries at the API server sounds tricky due to the multiple TCP hops from API server to webhook Pod(s).

We will likely also do the following to reduce the impact of this issue:

Fail closed when cluster stability wouldn't be impacted
Implement controllers to apply corrective controls for any invalid changes that get through fail open webhooks

jkh52 · 2024-03-08T18:38:12Z

Your plan sounds good, especially in the near term, since OSS improvements will take a while to rollout to GKE (likely introduced in a minor version e.g. 1.30 or 1.31). I will still take a look at this low hanging fruit idea.

k8s-triage-robot · 2024-06-06T19:32:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jkh52 closed this as completed Feb 17, 2024

jkh52 reopened this Feb 29, 2024

jkh52 self-assigned this Mar 7, 2024

jkh52 mentioned this issue Mar 12, 2024

Agent drain: implement agent side. (Remaining: server) #584

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2024

jkh52 added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2024

jkh52 mentioned this issue Jun 21, 2024

Refactor BackendManager / BackendStorage. #610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure webhook availability during Konnectivity Agent rolling update #566

Ensure webhook availability during Konnectivity Agent rolling update #566

dippynark commented Feb 16, 2024

jkh52 commented Feb 16, 2024 •

edited

Loading

dippynark commented Feb 17, 2024 •

edited

Loading

jkh52 commented Feb 17, 2024

jkh52 commented Feb 29, 2024

dippynark commented Mar 5, 2024 •

edited

Loading

jkh52 commented Mar 7, 2024

dippynark commented Mar 8, 2024

jkh52 commented Mar 8, 2024

k8s-triage-robot commented Jun 6, 2024

Ensure webhook availability during Konnectivity Agent rolling update #566

Ensure webhook availability during Konnectivity Agent rolling update #566

Comments

dippynark commented Feb 16, 2024

jkh52 commented Feb 16, 2024 • edited Loading

dippynark commented Feb 17, 2024 • edited Loading

jkh52 commented Feb 17, 2024

jkh52 commented Feb 29, 2024

dippynark commented Mar 5, 2024 • edited Loading

jkh52 commented Mar 7, 2024

dippynark commented Mar 8, 2024

jkh52 commented Mar 8, 2024

k8s-triage-robot commented Jun 6, 2024

jkh52 commented Feb 16, 2024 •

edited

Loading

dippynark commented Feb 17, 2024 •

edited

Loading

dippynark commented Mar 5, 2024 •

edited

Loading