Author: Derek Carr (@derekwaynecarr)
Status: Proposed
Many Linux distributions have either adopted, or plan to adopt systemd
as their init system.
This document describes how the node should be configured, and a set of enhancements that should
be made to the kubelet
to better integrate with these distributions independent of container
runtime.
This proposal does not account for running the kubelet
in a container.
To help understand this proposal, we first provide a brief summary of systemd
behavior.
systemd
manages a hierarchy of slice
, scope
, and service
units.
service
- application on the server that is launched bysystemd
; how it should start/stop; when it should be started; under what circumstances it should be restarted; and any resource controls that should be applied to it.scope
- a process or group of processes which are not launched bysystemd
(i.e. fork), like a service, resource controls may be appliedslice
- organizes a hierarchy in whichscope
andservice
units are placed. aslice
may containslice
,scope
, orservice
units; processes are attached toservice
andscope
units only, not toslices
. The hierarchy is intended to be unified, meaning a process may only belong to a single leaf node.
Classical cgroup
hierarchies were split per resource group controller, and a process could
exist in different parts of the hierarchy.
For example, a process p1
could exist in each of the following at the same time:
/sys/fs/cgroup/cpu/important/
/sys/fs/cgroup/memory/unimportant/
/sys/fs/cgroup/cpuacct/unimportant/
In addition, controllers for one resource group could depend on another in ways that were not always obvious.
For example, the cpu
controller depends on the cpuacct
controller yet they were treated
separately.
Many found it confusing for a single process to belong to different nodes in the cgroup
hierarchy
across controllers.
The Kernel direction for cgroup
support is to move toward a unified cgroup
hierarchy, where the
per-controller hierarchies are eliminated in favor of hierarchies like the following:
/sys/fs/cgroup/important/
/sys/fs/cgroup/unimportant/
In a unified hierarchy, a process may only belong to a single node in the cgroup
tree.
The Kernel direction for cgroup
management is to promote a single-writer model rather than
allowing multiple processes to independently write to parts of the file-system.
In distributions that run systemd
as their init system, the cgroup tree is managed by systemd
by default since it implicitly interacts with the cgroup tree when starting units. Manual changes
made by other cgroup managers to the cgroup tree are not guaranteed to be preserved unless systemd
is made aware. systemd
can be told to ignore sections of the cgroup tree by configuring the unit
to have the Delegate=
option.
See: http://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=
A slice
corresponds to an inner-node in the cgroup
file-system hierarchy.
For example, the system.slice
is represented as follows:
/sys/fs/cgroup/<controller>/system.slice
A slice
is nested in the hierarchy by its naming convention.
For example, the system-foo.slice
is represented as follows:
/sys/fs/cgroup/<controller>/system.slice/system-foo.slice/
A service
or scope
corresponds to leaf nodes in the cgroup
file-system hierarchy managed by
systemd
. Services and scopes can have child nodes managed outside of systemd
if they have been
delegated with the Delegate=
option.
For example, if the docker.service
is associated with the system.slice
, it is
represented as follows:
/sys/fs/cgroup/<controller>/system.slice/docker.service/
To demonstrate the use of scope
units using the docker
container runtime, if a
user launches a container via docker run -m 100M busybox
, a scope
will be created
because the process was not launched by systemd
itself. The scope
is parented by
the slice
associated with the launching daemon.
For example:
/sys/fs/cgroup/<controller>/system.slice/docker-<container-id>.scope
systemd
defines a set of slices. By default, service and scope units are placed in
system.slice
, virtual machines and containers registered with systemd-machined
are
found in machine.slice
, and user sessions handled by systemd-logind
in user.slice
.
The kubelet
reads and writes to the cgroup
tree during bootstrapping
of the node. In the future, it will write to the cgroup
tree to satisfy other
purposes around quality of service, etc.
The kubelet
must cooperate with systemd
in order to ensure proper function of the
system. The bootstrapping requirements for a systemd
system are different than one
without it.
The kubelet
will accept a new flag to control how it interacts with the cgroup
tree.
--cgroup-driver=
- cgroup driver used by the kubelet.cgroupfs
orsystemd
.
By default, the kubelet
should default --cgroup-driver
to systemd
on systemd
distributions.
The kubelet
should associate node bootstrapping semantics to the configured
cgroup driver
.
The proposal makes no changes to the definition as presented here: https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/node-allocatable.md
The node will report a set of allocatable compute resources defined as follows:
[Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]
The kubelet
will continue to interface with cAdvisor
to determine node capacity.
The node may set aside a set of designated resources for non-Kubernetes components.
The kubelet
accepts the followings flags that support this feature:
--system-reserved=
- A set ofResourceName
=ResourceQuantity
pairs that describe resources reserved for host daemons.--system-container=
- Optional resource-only container in which to place all non-kernel processes that are not already in a container. Empty for no container. Rolling back the flag requires a reboot. (Default: "").
The current meaning of system-container
is inadequate on systemd
environments.
The kubelet
should use the flag to know the location that has the processes that
are associated with system-reserved
, but it should not modify the cgroups of
existing processes on the system during bootstrapping of the node. This is
because systemd
is the cgroup manager
on the host and it has not delegated
authority to the kubelet
to change how it manages units
.
The following describes the type of things that can happen if this does not change: https://bugzilla.redhat.com/show_bug.cgi?id=1202859
As a result, the kubelet
needs to distinguish placement of non-kernel processes
based on the cgroup driver, and only do its current behavior when not on systemd
.
The flag should be modified as follows:
--system-container=
- Name of resource-only container that holds all non-kernel processes whose resource consumption is accounted under system-reserved. The default value is cgroup driver specific. systemd defaults to system, cgroupfs defines no default. Rolling back the flag requires a reboot.
The kubelet
will error if the defined --system-container
does not exist
on systemd
environments. It will verify that the appropriate cpu
and memory
controllers are enabled.
The node may set aside a set of resources for Kubernetes components:
--kube-reserved=:
- A set ofResourceName
=ResourceQuantity
pairs that describe resources reserved for host daemons.
The kubelet
does not enforce --kube-reserved
at this time, but the ability
to distinguish the static reservation from observed usage is important for node accounting.
This proposal asserts that kubernetes.slice
is the default slice associated with
the kubelet
and kube-proxy
service units defined in the project. Keeping it
separate from system.slice
allows for accounting to be distinguished separately.
The kubelet
will detect its cgroup
to track kube-reserved
observed usage on systemd
.
If the kubelet
detects that its a child of the system-container
based on the observed
cgroup
hierarchy, it will warn.
If the kubelet
is launched directly from a terminal, it's most likely destination will
be in a scope
that is a child of user.slice
as follows:
/sys/fs/cgroup/<controller>/user.slice/user-1000.slice/session-1.scope
In this context, the parent scope
is what will be used to facilitate local developer
debugging scenarios for tracking kube-reserved
usage.
The kubelet
has the following flag:
--resource-container="/kubelet":
Absolute name of the resource-only container to create and run the Kubelet in (Default: /kubelet).
This flag will not be supported on systemd
environments since the init system has already
spawned the process and placed it in the corresponding container associated with its unit.
This proposal asserts that the reservation of compute resources for any associated
container runtime daemons is tracked by the operator under the system-reserved
or
kubernetes-reserved
values and any enforced limits are set by the
operator specific to the container runtime.
Docker
If the kubelet
is configured with the container-runtime
set to docker
, the
kubelet
will detect the cgroup
associated with the docker
daemon and use that
to do local node accounting. If an operator wants to impose runtime limits on the
docker
daemon to control resource usage, the operator should set those explicitly in
the service
unit that launches docker
. The kubelet
will not set any limits itself
at this time and will assume whatever budget was set aside for docker
was included in
either --kube-reserved
or --system-reserved
reservations.
Many OS distributions package docker
by default, and it will often belong to the
system.slice
hierarchy, and therefore operators will need to budget it for there
by default unless they explicitly move it.
rkt
rkt has no client/server daemon, and therefore has no explicit requirements on container-runtime reservation.
The kubelet
does not enforce the system-reserved
or kube-reserved
values by default.
The kubelet
should support an additional flag to turn on enforcement:
--system-reserved-enforce=false
- Optional flag that if true tells thekubelet
to enforce thesystem-reserved
constraints defined (if any)--kube-reserved-enforce=false
- Optional flag that if true tells thekubelet
to enforce thekube-reserved
constraints defined (if any)
Usage of this flag requires that end-user containers are launched in a separate part
of cgroup hierarchy via cgroup-root
.
If this flag is enabled, the kubelet
will continually validate that the configured
resource constraints are applied on the associated cgroup
.
The kubelet
supports a cgroup-root
flag which is the optional root cgroup
to use for pods.
This flag should be treated as a pass-through to the underlying configured container runtime.
If --cgroup-enforce=true
, this flag warrants special consideration by the operator depending
on how the node was configured. For example, if the container runtime is docker
and its using
the systemd
cgroup driver, then docker
will take the daemon wide default and launch containers
in the same slice associated with the docker.service
. By default, this would mean system.slice
which could cause end-user pods to be launched in the same part of the cgroup hierarchy as system daemons.
In those environments, it is recommended that cgroup-root
is configured to be a subtree of machine.slice
.
$ROOT
|
+- system.slice
| |
| +- sshd.service
| +- docker.service (optional)
| +- ...
|
+- kubernetes.slice
| |
| +- kubelet.service
| +- docker.service (optional)
|
+- machine.slice (container runtime specific)
| |
| +- docker-<container-id>.scope
|
+- user.slice
| +- ...
system.slice
corresponds to--system-reserved
, and contains any services the operator brought to the node as normal configuration.kubernetes.slice
corresponds to the--kube-reserved
, and contains kube specific daemons.machine.slice
should parent all end-user containers on the system and serve as the root of the end-user cluster workloads run on the system.user.slice
is not explicitly tracked by thekubelet
, but it is possible thatssh
sessions to the node where the user launches actions directly. Any resource accounting reserved for those actions should be part ofsystem-reserved
.
The container runtime daemon, docker
in this outline, must be accounted for in either
system.slice
or kubernetes.slice
.
In the future, the depth of the container hierarchy is not recommended to be rooted
more than 2 layers below the root as it historically has caused issues with node performance
in other cgroup
aware systems (https://bugzilla.redhat.com/show_bug.cgi?id=850718). It
is anticipated that the kubelet
will parent containers based on quality of service
in the future. In that environment, those changes will be relative to the configured
cgroup-root
.
The kubelet
will set the following:
sysctl -w vm.overcommit_memory=1
sysctl -w vm.panic_on_oom=0
sysctl -w kernel/panic=10
sysctl -w kernel/panic_on_oops=1
The kubelet
at bootstrapping will set the oom_score_adj
value for Kubernetes
daemons, and any dependent container-runtime daemons.
If container-runtime
is set to docker
, then set its oom_score_adj=-999
+----------+ +----------+ +----------+
| | | | | Pod |
| Node <-------+ Container<----+ Lifecycle|
| Manager | | Manager | | Manager |
| +-------> | | |
+---+------+ +-----+----+ +----------+
| |
| |
| +-----------------+
| | |
| | |
+---v--v--+ +-----v----+
| cgroups | | container|
| library | | runtimes |
+---+-----+ +-----+----+
| |
| |
+---------+----------+
|
|
+-----------v-----------+
| Linux Kernel |
+-----------------------+
The kubelet
should move to an architecture that resembles the above diagram:
- The
kubelet
should not interface directly with thecgroup
file-system, but instead should use a commoncgroups library
that has the proper abstraction in place to work with eithercgroupfs
orsystemd
. Thekubelet
should just uselibcontainer
abstractions to facilitate this requirement. Thelibcontainer
abstractions as currently defined only support anApply(pid)
pattern, and we need to separate that abstraction to allow cgroup to be created and then later joined. - The existing
ContainerManager
should separate node bootstrapping into a separateNodeManager
that is dependent on the configuredcgroup-driver
. - The
kubelet
flags for cgroup paths will convert internally as part of cgroup library, i.e./foo/bar
will just convert tofoo-bar.slice
This proposal re-enforces that it is inappropriate at this time to depend on --cgroup-root
as the
primary mechanism to distinguish and account for end-user pod compute resource usage.
Instead, the kubelet
can and should sum the usage of each running pod
on the node to account for
end-user pod usage separate from system-reserved and kubernetes-reserved accounting via cAdvisor
.
Docker versions <= 1.0.9 did not have proper support for -cgroup-parent
flag on systemd
. This
was fixed in this PR (moby/moby#18612). As result, it's expected
that containers launched by the docker
daemon may continue to go in the default system.slice
and
appear to be counted under system-reserved node usage accounting.
If operators run with later versions of docker
, they can avoid this issue via the use of cgroup-root
flag on the kubelet
, but this proposal makes no requirement on operators to do that at this time, and
this can be revisited if/when the project adopts docker 1.10.
Some OS distributions will fix this bug in versions of docker <= 1.0.9, so operators should
be aware of how their version of docker
was packaged when using this feature.