Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap] Improve kubeadm support for declarative approaches/git-ops #2317

Closed
fabriziopandini opened this issue Sep 30, 2020 · 14 comments
Closed
Labels
kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@fabriziopandini
Copy link
Member

Kubeadm, being a CLI, does not play well with declarative approaches/git-ops workflows.

Assuming that kubeadm is divided in two main parts

  1. bootstrapping a node (transforming a machine into a node: init, join)
  2. managing an existing node (e.g. upgrades, renew certs, changing a node)

This issue is about collecting ideas and define a viable path for making 2 possible using declarative approaches, sometimes referred also as in-place mutations.

For this first iteration, I consider 1 out of scope, mainly because bootstrapping nodes with a declarative approach is already covered by Cluster API and it is clearly out of the scope of kubeadm.

@fabriziopandini
Copy link
Member Author

Prior discussion from #1698

@timothysc

As a Kubernetes Operator I would like to enable be able to declaratively control configuration changes, and upgrades in a systematic fashion.

@fabriziopandini

IMO the kubeadm operator should be responsible for two things

  • In place mutations of kubeadm generated artifacts
  • Orchestration of such mutations across nodes
    Instead, I think that we should consider out of scope everything that fits under the management of infrastructure or it is related to the management of "immutable" nodes (where "Immutable" = any operation done deploying a new node and removing the old one)

@neolit123

my other top question, this can end up being not-so-secure.

@fabriziopandini
Copy link
Member Author

For the kubeadm operator, I think we should focus on the first use case, "declaratively control configuration changes", given that this is not supported by kubeadm now and it was a top priority in the recent survey

@neolit123 neolit123 added this to the v1.21 milestone Nov 7, 2020
@neolit123 neolit123 added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. kind/design Categorizes issue or PR as related to design. labels Nov 7, 2020
@neolit123 neolit123 modified the milestones: v1.21, Next Feb 3, 2021
@fabriziopandini
Copy link
Member Author

@jhughes2112
Copy link

Interesting proposal. What I've struggled with over the past year using kubeadm is specifically what I addressed with my own scripts that (procedurally) builds clusters (on github: k8smaker). The best practices for configuring a bare metal cluster is pretty complex. Doing so with AWS is too. The preconditioning script depends strongly on the underlying OS involved. But having built it modularly, I can see how the cluster construction (init, join) can be made completely extensible while providing a fully declarative interface to the user.

It requires:

  • ssh credentials to access any new node with sudo privileges (from whence kubeadm or the proposed operator executes)
  • a preconditioning script that configures the OS from a clean state
  • a decommissioning script that resets the OS to an unused state
  • a configuration CRD that describes the nodes that should be part of the cluster

I offer an opinion: I realize this was specifically stated as out-of-scope for this proposal. I'm suggesting it should be the focus instead of day 2 operations. It seems like a lot of k8s admins have a procedure where upgrading an existing production cluster tends to be (much) more dangerous than building a new one. By automating more of the upgrades, it adds a "magicalness" to that process which results in inevitable breakage being more severe rather than less. Whereas automating the construction process drives towards a very desirable workflow for automating and simplifying the process: simply remove nodes from an existing production cluster description and add them to the new cluster.

Thanks for the consideration.

@neolit123
Copy link
Member

neolit123 commented Mar 1, 2021

It seems like a lot of k8s admins have a procedure where upgrading an existing production cluster tends to be (much) more dangerous than building a new one.

i think no matter what we do with kubernetes upgrades we will not be able to fully guarantee zero failures to the users, unless this is fully managed by some high level tooling that understand everything that the user has and wants - including node host details, infrastructure availability and all caveats of the current and next k8s version.

kubeadm or the operator can encode some details about the next k8s version or the node host, but that's all.

the so called "blue / green" cluster upgrades may seem as the better option in the eyes of the user, since the user has the control to scrap the old cluster only once the new cluster is fully working. but they also require infrastructure that some users on self hosted bare metal simply don't have.

Whereas automating the construction process drives towards a very desirable workflow for automating and simplifying the process: simply remove nodes from an existing production cluster description and add them to the new cluster.

we call these node re-place upgrades and the Cluster API is doing them. your project may have recreated parts of Cluster API, kubespray or kops, which are tools that are higher level than kubeadm.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2021
@neolit123
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2021
@neolit123
Copy link
Member

/remove-lifecycle stale

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 29, 2021
@neolit123 neolit123 added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 29, 2021
@fabriziopandini
Copy link
Member Author

/remove-lifecycle stale

@neolit123
Copy link
Member

xref related discussion about cert rotation #2652

@pacoxu
Copy link
Member

pacoxu commented Jun 20, 2022

Not sure if this is the right place to discuss on kubeadm operator. There are some threads in kubernetes/enhancements#2505.

I write a simple kubelet-reloader as a tool for kubeadm operator.

  • kubelet-reloader will watch on /usr/bin/kubelet-new.
  • once there is a different version of kubelet-new, the reloader will replace /usr/bin/kubelet and restart kubelet.

Currently the kubeadm-operator v0.1.0 can support upgrade cross versions like v1.22 to v1.24.

  • kubeadm operator will download kubectl/kubelet/kubeadm and upgrade.
  • kubelet will be placed in /usr/bin/kubelet-new for kubelet reloader.

See quick-start.

Some thoughts on the next steps

@neolit123 neolit123 added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Nov 8, 2023
@neolit123
Copy link
Member

  1. bootstrapping a node (transforming a machine into a node: init, join)
  2. managing an existing node (e.g. upgrades, renew certs, changing a node)

for 2 we decided that this should be part of a kubeadm operator, but the same time we decided to externalize it and not make SIG CL own the project. this means that there isn't anything actionable.

and yes for 1, tools that wrap kubeadm can create the declarative layer (e.g. like CAPI does) to define the topology of how many nodes etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

7 participants