Presents a proposal for enhancing the security of Kubernetes clusters using SELinux and simplifying the implementation of SELinux support within the Kubelet by removing the need to label the Kubelet directory with an SELinux context usable from a container.
The current Kubernetes codebase relies upon the Kubelet directory being labeled with an SELinux context usable from a container. This means that a container escaping namespace isolation will be able to use any file within the Kubelet directory without defeating kernel MAC (mandatory access control). In order to limit the attack surface, we should enhance the Kubelet to relabel any bind-mounts into containers into a usable SELinux context without depending on the Kubelet directory's SELinux context.
- No API changes allowed
- Behavior must be fully backward compatible
- No new admission controllers - make incremental improvements without huge refactorings
- As a cluster operator, I want to avoid having to label the Kubelet directory with a label usable from a container, so that I can limit the attack surface available to a container escaping its namespace isolation
- As a user, I want to run a pod without an SELinux context explicitly specified and be isolated using MCS (multi-category security) on systems where SELinux is enabled, so that the pods on each host are isolated from one another
- As a user, I want to run a pod that uses the host IPC or PID namespace and want the system to do the right thing with regard to SELinux, so that no unnecessary relabel actions are performed
As previously stated, the current codebase relies on the Kubelet directory
being labeled with an SELinux context usable from a container. The Kubelet
uses the SELinux context of this directory to determine what SELinux context
tmpfs
mounts (provided by the EmptyDir memory-medium option) should receive.
The problem with this is that it opens an attack surface to a container that
escapes its namespace isolation; such a container would be able to use any
file in the Kubelet directory without defeating kernel MAC.
When no SELinux context is specified, Kubernetes should just do the right thing, where doing the right thing is defined as isolating pods with a node- unique set of categories. Node-uniqueness means unique among the pods scheduled onto the node. Long-term, we want to have a cluster-wide allocator for MCS labels. Node-unique MCS labels are a good middle ground that is possible without a new, large, feature.
Containers in pods that use the host IPC or PID namespaces need access to
other processes and IPC mechanisms on the host. Therefore, these containers
should be run with the spc_t
SELinux type by the container runtime. The
spc_t
type is an unconfined type that other SELinux domains are allowed to
connect to. In the case where a pod uses one of these host namespaces, it
should be unnecessary to relabel the pod's volumes.
Docker and rkt both use the libcontainer SELinux library. This library
provides a method, GetLxcContexts
, that returns the a unique SELinux
contexts for container processes and files used by them. GetLxcContexts
reads the base SELinux context information from a file at /etc/selinux/<policy- name>/contexts/lxc_contexts
and then adds a process-unique MCS label.
Docker and rkt both leverage this call to determine the 'starting' SELinux contexts for containers.
Docker's behavior when no SELinux context is defined for a container is to give the container a node-unique MCS label.
On the Docker runtime, the containers in a Kubernetes pod share the IPC and PID namespaces of the pod's infra container.
Docker's behavior for containers sharing these namespaces is as follows: if a container B shares the IPC namespace of another container A, container B is given the SELinux context of container A. Therefore, for Kubernetes pods running on docker, in a vacuum the containers in a pod should have the same SELinux context.
Known issue: When the seccomp profile is set on a docker container that shares the IPC namespace of another container, that container will not receive the other container's SELinux context.
In the case of a pod that shares the host IPC or PID namespace, this flag is
simply ignored and the container receives the spc_t
SELinux type. The
spc_t
type is unconfined, and so no relabeling needs to be done for volumes
for these pods. Currently, however, there is code which relabels volumes into
explicitly specified SELinux contexts for these pods. This code is unnecessary
and should be removed.
Docker is capable of relabeling bind-mounts into containers using the :Z
bind-mount flag. However, in the current implementation of the docker runtime
in Kubernetes, the :Z
option is only applied when the pod's SecurityContext
contains an SELinux context. We could easily implement the correct behaviors
by always setting :Z
on systems where SELinux is enabled.
rkt's behavior when no SELinux context is defined for a pod is similar to Docker's -- an SELinux context with a node-unique MCS label is given to the containers of a pod.
Containers (apps, in rkt terminology) in rkt pods share an IPC and PID namespace by default.
Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux context.
Using the host IPC and PID namespaces is not currently supported by rkt.
- The
selinux
package should provide a methodSELinuxEnabled
that returns whether SELinux is enabled, and is built for all platforms (the libcontainer SELinux is only built on linux) - The
SelinuxContextRunner
interface should be renamed toSELinuxRunner
and be changed to have the same method names and signatures as the libcontainer methods its implementations wrap - The
SELinuxRunner
interface only needsGetfilecon
, which is used by the rkt code
package selinux
// Note: the libcontainer SELinux package is only built for Linux, so it is
// necessary to have a NOP wrapper which is built for non-Linux platforms to
// allow code that links to this package not to differentiate its own methods
// for Linux and non-Linux platforms.
//
// SELinuxRunner wraps certain libcontainer SELinux calls. For more
// information, see:
//
// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go
type SELinuxRunner interface {
// Getfilecon returns the SELinux context for the given path or returns an
// error.
Getfilecon(path string) (string, error)
}
- The
relabelVolumes
method inkubelet_volumes.go
is not needed and can be removed - The
GenerateRunContainerOptions
method inkubelet_pods.go
should no longer callrelabelVolumes
- The
makeHostsMount
method inkubelet_pods.go
should set theSELinuxRelabel
attribute of the mount for the pod's hosts file totrue
- The
makeMountBindings
should be changed to: - No longer accept the
podHasSELinuxLabel
parameter - Always use the
:Z
bind-mount flag when SELinux is enabled and the mount has theSELinuxRelabel
attribute set totrue
- The
runContainer
method should be changed to always use the:Z
bind-mount flag on the termination message mount when SELinux is enabled
The should not be any required changes for the rkt runtime; we should test to ensure things work as expected under rkt.
- The
VolumeHost
interface contains a method calledGetRootContext
; this is an artifact of the old assumptions about the Kubelet directory's SELinux context and can be removed - The
empty_dir.go
file should be changed to be completely agnostic of SELinux; no behavior in this plugin needs to be differentiated when SELinux is enabled
The VolumeHost
abstraction is used in a couple of PV controllers as NOP
implementations. These should be altered to no longer include GetRootContext
.