- System requirements
- Setting up the cluster
- Running the
cmk isolate
Hello World Pod - Validating the environment
- Troubleshooting and recovery
Related:
Kubernetes >= v1.5.0 (excluding v1.8.0, details below)
All of template manifests provided with CMK are using serviceaccount which is
defined in cmk-serviceaccount
manifest. Before first
CMK run, operator should use it to define cmk-serviceaccount
. This step isn't
obligatory on Kubernetes 1.5 but it's strongly recomended. Kubernetes 1.6
requires it because of RBAC authorization method which will use it to deliver
API access from inside of CMK pod(s).
From Kubernetes 1.6 RBAC has became default authorization method.
Operator needs to prepare additional ClusterRole and
ClusterRoleBindings in order to deploy CMK.Those are
provided in cmk-rbac-rules
manifest. In this case operator
must also use provided serviceaccount manifest as well.
From Kubernetes 1.7 Custom Resource Definitions has replaced Third Party Resource.
Only in Kubernetes 1.7 both are compatible. Operator must migrate from TRP to CRD.
To cmk-rbac-rules
manifest ClusterRole and ClusterRoleBindings have been added for CRD.
CMK will detect the version Kubernetes itself and will be use Custom Resource Definitions
if Kubernetes version is 1.7 else Third Party Resource to create Nodereport and Reconcilereport.
Additionally Taints have been moved from alpha to beta and are no logner present in node metadata
but directly in spec
. Please note that if pod manifest has nodeName: <nodename>
selector, taints tolerations are not needed.
Kubernetes 1.8.0 is not supported due to extended resources issue(it's impossible to create extended resource). Use Kubernetes 1.8.1+ instead.
From Kubernetes 1.9.0 mutating admission controller is being used to update any pod which
definition contains any container requesting CMK Extended Resources. CMK webhook modifies
it by injecting environmental variable CMK_NUM_CORES
with its value set to a number of cores
specified in the Extended Resource request. This allows cmk isolate
to assign multiple
CPU cores to given process.
On top of that webhook applies additional changes to the pod which are defined in
the configuration file. By default, configuration deployed during cmk cluster-init
adds
CMK installation and host /proc filesystem volumes, CMK service account, tolerations required
for a pod to be scheduled on the CMK enabled node and appropriately annotates pod. Containers
specifications are updated with volume mounts (referencing volumes added to the pod) and
environmental variable CMK_PROC_FS
.
The mutating admission controller is set up by default using mutual TLS, where the webhook
service looks to authenticate the Kubernetes API server as well. This requires that the Kubernetes
API server be set up to pass webhooks a specified certificate and key. By default the webhook
looks to authenticate the certificate it gets passed with the CA file that the Kubernetes API
server passes in to each pod when they are created. You can pass in the CA file location you
want to use when running the webhook by using the --cafile
argument. You can also set the
argument --insecure
to True and the webhook service will revert back to regular TLS. To
set up the Kubernetes API server to pass webhook services certificates and keys, do the
following:
When starting the Kubernetes API server, set the `--admission-control-config-file`
to the location of your admission control configuration file, for example
/var/lib/kubernetes/cmk_config.yaml.
In the admission control configuration file, specify where the WebhookAdmissionConfiguration
controller should read the credentials, which are stored in a kubeConfig file. This kubeConfig
file contains the certificate and key data, base64 encoded, that the webhook service will
use. This certificate should be the one used by your Kubernetes cluster or admin, as it
needs to be validated against the Kubernetes CA.
The official Kubernetes documentation for setting up the Kubernetes API server to send webhook services certificates can be found here: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#authenticate-apiservers
https://kubernetes.io/docs/admin/authorization/rbac/#rolebinding-and-clusterrolebinding
This section describes the setup required to use the CMK
software.
Notes:
- The recommended way to prepare Kubernetes nodes for the
CMK
software is to runcmk cluster-init
as a Pod as described in cluster setup instructions usingcmk cluster-init
. - The cluster setup instructions using manually created Pods should only be used if and
only if running
cmk cluster-init
fails for some reason.
Prepare the nodes by running cmk cluster-init
using these instructions.
- Concepts
- Preparing nodes by running
cmk cluster-init
(recommended) - Preparing nodes by running each
CMK
subcommand as a Pod (use only if required)
Term | Meaning |
---|---|
CMK nodes |
The operator can choose any number of nodes in the kubernetes cluster to work with CMK . These participating nodes will be referred as CMK nodes. |
Pod | A Pod is an abstraction in Kubernetes to represent one or more containers and their configuration. It is the smallest schedulable unit in Kubernetes. |
OIR | Acronym for Opaque Integer Resource. In Kubernetes, OIR allow cluster operators to advertise new node-level resources that would be otherwise unknown to the system. |
Volume | A volume is a directory (on host file system). In Kubernetes, a volume has the same lifetime as the Pod that uses it. Many types of volumes are supported in Kubernetes. |
hostPath |
hostPath is a volume type in Kubernetes. It mounts a file or directory from the host file system into the Pod. |
CMK
nodes can be prepared by using cmk cluster-init
subcommand. The subcommand is expected to
be run as a pod. The cmk-cluster-init-pod template can be used to run cmk cluster-init
on a
Kubernetes cluster. When run on a Kubernetes cluster, the Pod spawns two Pods per node at most in order to prepare
each node.
The only value that requires change in the cmk-cluster-init-pod template is the args
field,
which can be modified to pass different options.
Following are some example modifications to the args
field:
- args:
# Change this value to pass different options to cluster-init.
- "/cmk/cmk.py cluster-init --host-list=node1,node2,node3"
The above command prepares nodes "node1", "node2" and "node3" for the CMK
software using default options.
- args:
# Change this value to pass different options to cluster-init.
- "/cmk/cmk.py cluster-init --all-hosts"
The above command prepares all the nodes in the Kubernetes cluster for the CMK
software using default options.
- args:
# Change this value to pass different options to cluster-init.
- "/cmk/cmk.py cluster-init --host-list=node1,node2,node3 --cmk-cmd-list=init,discover"
The above command prepares nodes "node1", "node2" and "node3" but only runs the cmk init
and cmk discover
subcommands on each of those nodes.
- args:
# Change this value to pass different options to cluster-init.
- "/cmk/cmk.py cluster-init --host-list=node1,node2,node3 --num-exclusive-cores=3 --num-shared-cores=1 --excl-non-isolcpus=11-15"
The above command prepares nodes "node1", "node2" and "node3" to have 3 cores placed in the exclusive pool, 1 core placed in the shared pool, and the cores 11-15 placed in the exclusive-non-isolcpus pool. The exclusive-non-isolcpus pool will isolate pods from other pods in the cluster, but will not use cores that are governed by isolcpus.
For more details on the options provided by cmk cluster-init
, see this description.
Notes:
- The instructions provided in this section should only be used if and only if running
cmk cluster-init
fails for some reason. - The subcommands described below should be run in the same order.
- The documentation in this section assumes that the
cmk
binary is installed on the host under/opt/bin
. - In all the pod templates used in this section, the name of container image used is
cmk:v1.5.2
. It is expected that thecmk
container image is built and cached locally in the host. Theimage
field will require modification if the container image is hosted remotely (e.g., in https://hub.docker.com/).
The CMK
nodes in the kubernetes cluster should be initialized in order to be used with the CMK software using
cmk-init
. To initialize the CMK
nodes, the cmk-init-pod template can be used.
cmk init
takes the --num-exclusive-cores
and the --num-shared-cores
flags. In the
cmk-init-pod template, the values to these flags can be modified. The value for --num-exclusive-cores
and
--num-shared-cores
can be set by changing the values for the NUM_EXCLUSIVE_CORES
and NUM_SHARED_CORES
environment variables,
respectively.
Values that might require modification in the cmk-init-pod template are shown as snippets below:
env:
- name: NUM_EXCLUSIVE_CORES
# Change this to modify the value passed to `--num-exclusive-cores` flag.
value: '4'
- name: NUM_SHARED_CORES
# Change this to modify the value passed to `--num-shared-cores` flag.
value: '1'
All the CMK
nodes in the Kubernetes cluster should be patched with CMK
OIR slots using
cmk discover
. The OIR slots are advertised as the exclusive pools need to be allocated exclusively.
The number of slots advertised should be equal to the number of cpu lists under the exclusive pool, as determined
by examining the CMK
configuration configmap. cmk-discover-pod template can be used to
advertise the CMK
OIR slots.
After running this Pod in a node, the node will be patched with `pod.alpha.kubernetes.io/opaque-int-resource-cmk' OIR.
In order to reconcile from an outdated CMK
configuration state, each CMK
node should run
cmk reconcile
periodically. cmk reconcile
can be run periodically using the
cmk-reconcile-daemonset template.
In the cmk-reconcile-daemonset template, the time between each invocation of cmk reconcile
can be adjusted by changing the value of the CMK_RECONCILE_SLEEP_TIME environment variable. The value specifies time
in seconds.
Values that might require modification in the cmk-reconcile-daemonset template are shown as snippets below:
env:
- name: CMK_RECONCILE_SLEEP_TIME
# Change this to modify the sleep interval between consecutive
# cmk reconcile runs. The value is specified in seconds.
value: '60'
cmk install
is used to create a zero-dependency binary of the CMK
software and place it on the host
filesystem. Subsequent containers can isolate themselves by mounting the install directory from the host and then
calling cmk isolate
. To run it on all the CMK
nodes, the cmk-install-pod template
can be used.
cmk install
takes the --install-dir
flag. In the cmk-install-pod template, the value for
--install-dir
can be configured by changing the path
value of the hostPath
for the cmk-install-dir
.
Values that might require modification in the cmk-install-pod template are shown as snippets below:
volumes:
- hostPath:
# Change this to modify the CMK installation dir in the host file system.
path: "/opt/bin"
name: cmk-install-dir
cmk webhook
is used to run mutating admission webhook server. Whenever there's a request to create a new pod,
the webhook can capture that request, check whether any of the containers requests or limits number of the CMK
Extended Resources and update pod and its container specification appropriately. This allows to simplify deployment
of workloads taking advantage of CMK, by reducing the number of requirements to the minimum.
...
spec:
containers:
resources:
requests:
cmk.intel.com/exclusive-cores: 2
...
In order to deploy CMK mutating webhook a number of resources needs to be created on the cluster. But even before that, operator needs to have X509 private key and TLS certificate in PEM format generated. Certificates can be self-signed, although using ceritificates signed by proper CA or Kubernetes Certificates API is highly recommended. After meeting that requirement, steps to deploy webhook are as follows:
- Certificates in PEM format should be then encoded to Base64 format and placed in the Mutating Admission Configuration and Secret templates.
- Update config map template. Config map contains 2
configuration files
server.yaml
andmutations.yaml
. Configuration options are described in the cmk command-line tool documentation. - Create secret, service and
config map using
kubectl create -f ...
command. - Run
cmk webhook
pod defined in the webhook pod template usingkubectl create -f ...
command. - If the
cmk webhook
pod is running correctly, create Mutating Admission Configuration object.
CMK
is able to use multiple sockets. During cluster initialization, init
module will distribute cores from all sockets
across pools. To prevent a situation, where exclusive pool or shared pool are spawned only on a single socket
operator is able to use one of two mode
policies: packed
and spread
. Those policies define how cores are assigned to
specific pool:
- packed mode will put cores in the following order:
Note: This policy is not topology aware, so there is a possibility that one pool won't spread on multiple sockets.
- spread mode will put following cores order:
Note: This policy is topology aware, so CMK will try to spread pools on each socket.
To select appropriate mode
operator can select it during initialization with --shared-mode
or --exclusive-mode
parameters.
Those parameters can be used with cluster-init
and init
. If operator use two different modes, then those policies
will be mixed. In that case exclusive pool is resolving before shared pool.
CMK supports some power management capabilities on the latest Xeon processors, one of these Speed Select Technology - Base Frequency (SST-BF). CMK is able to discover SST-BF configured nodes through the use of node labels, discovers the SST-BF configured cores and ensures these cores are placed in the exclusive pool. This enables users to use these special cores for their containerized workloads, getting guaranteed performance.
- More information on SST-BF can be found here
- More information on configuring a Kubernetes cluster to take advantage of these Power Management capabilites can be found here
To utilize SST-CP cores with CMK, the cores need to be set up before CMK is initialised. More information about setting up the cores can be found here. The SST-CP capable node must also be labeled correctly.
The node gets labeled correctly using Node Feature Discovery, which will use a script provided in the CMK Github repository (located at resources/scripts/sst-cp.sh) to determine whether the node is configured to use SST-CP. This file needs to be moved to the correct place so NFD can find it.
After NFD has been set up in your Kubernetes cluster, the folders /etc/kubernetes/node-feature-discovery/source.d/ and /etc/kubernetes/node-feature-discovery/features.d/ should have been created. To move this SST-CP discovery script to the correct location, move into the directory where you cloned the CMK repository. Then copy the file:
cp resources/scripts/sst-cp.sh /etc/kubernetes/node-feature-discovery/source.d/
NFD will look in this location and execute the script, labeling the node if SST-CP is correctly configured. Then simply initialise CMK with the recommended script, providing the correct number of cores for the exclusive and shared pools, and the correct cores will be placed in the correct pools.
After following the instructions in the previous section, the cluster is ready to run the Hello World
Pod. The Hello
World cmk-isolate-pod template describes a simple Pod with three containers requesting CPUs from
the exclusive, shared and the infra pools, respectively, using cmk isolate
. The
pool
is requested by passing the desired value to the --pool
flag when using cmk isolate
as described in the
documentation.
cmk isolate
can use --socket-id
flag to target on which socket application should be spawned. This flag is optional,
suitable only for exclusive pool and if it's not used cmk isolate
will use first not reserved core.
cmk isolate
also takes the --install-dir
flag. In the cmk-isolate-pod template,
the value for --install-dir
can be modified by changing the path
value of the hostPath
.
Values that might require modification in the cmk-isolate-pod template are shown as snippets below:
volumes:
- hostPath:
# Change this to modify the CMK installation dir in the host file system.
path: "/opt/bin"
name: cmk-install-dir
Notes:
- The Hello World cmk-isolate-pod consumes the
pod.alpha.kubernetes.io/opaque-int-resource-cmk
Opaque Integer Resource (OIR) only in the container isolated using the exclusive pool. TheCMK
software assumes that only container isolated using the exclusive pool requests the OIR and each of these containers should consume exactly one OIR. This restricts the number of pods that can land on a Kubernetes node to the expected value. - The
cmk isolate
Hello World Pod should only be run after following the instructions provided in theSetting up the cluster
section.
Following is an example to validate the environment in one node.
- Pick a node to test. For illustration, we will use
<node-name>
as the name of the node. - Check if node has appropriate label.
kubectl get node <node-name> -o json | jq .metadata.labels
Example output:
kubectl get node cmk-02-zzwt7w -o json | jq .metadata.labels
{
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/os": "linux",
"cmk.intel.com/cmk-node": "true",
"kubernetes.io/hostname": "cmk-02-zzwt7w"
}
- Check if node has appropriate taint. (kubernetes < v1.7)
kubectl get node <node-name> -o json | jq .metadata.annotations
Example output:
kubectl get node cmk-02-zzwt7w -o json | jq .metadata.annotations
{
"scheduler.alpha.kubernetes.io/taints": "[{\"value\": \"true\", \"key\": \"cmk\", \"effect\": \"NoSchedule\"}]",
"volumes.kubernetes.io/controller-managed-attach-detach": "true"
}
- Check if node has appropriate taint. (kubernetes >= v1.7)
kubectl get node <node-name> -o json | jq .spec.taints
Example output:
kubectl get node cmk-02-zzwt7w -o json | jq .spec.taints
[
{
"effect": "NoSchedule",
"key": "cmk",
"timeAdded": null,
"value": "true"
}
]
- Check if node has the appropriate OIR. (kubernetes < v1.8)
kubectl get node <node-name> -o json | jq .status.capacity
Example output:
kubectl get node cmk-02-zzwt7w -o json | jq .status.capacity
{
"alpha.kubernetes.io/nvidia-gpu": "0",
"cpu": "16",
"memory": "14778328Ki",
"pod.alpha.kubernetes.io/opaque-int-resource-cmk": "4",
"pods": "110"
}
- Check if node has the appropriate ER. (kubernetes >= v1.8.1)
kubectl get node <node-name> -o json | jq .status.capacity
Example output:
kubectl get node cmk-02-zzwt7w -o json | jq .status.capacity
{
"alpha.kubernetes.io/nvidia-gpu": "0",
"cpu": "16",
"memory": "14778328Ki",
"cmk.intel.com/exclusive-cores": "4",
"pods": "110"
}
- Login to the node and check if
CMK
configuration directory and binary exisits. Assuming default options were used forcmk cluster-init
, you would do the following:
ls /opt/bin/
- Replace the
nodeName
in the Pod manifest below to the chosen node name and save it to a file.
apiVersion: v1
kind: Pod
metadata:
labels:
app: cmk-isolate-pod
name: cmk-isolate-pod
spec:
# Change this to the <node-name> you want to test.
nodeName: NODENAME
containers:
- args:
- "/opt/bin/cmk isolate --pool=infra sleep -- 10000"
command:
- "/bin/bash"
- "-c"
env:
- name: CMK_PROC_FS
value: "/host/proc"
image: cmk:v1.5.2
imagePullPolicy: "Never"
name: cmk-isolate-infra
volumeMounts:
- mountPath: "/host/proc"
name: host-proc
readOnly: true
- mountPath: "/opt/bin"
name: cmk-install-dir
restartPolicy: Never
volumes:
- hostPath:
# Change this to modify the CMK installation dir in the host file system.
path: "/opt/bin"
name: cmk-install-dir
- hostPath:
path: "/proc"
name: host-proc
- Run
kubectl create -f <file-name>
, where<file-name>
is name of the Pod manifest file withnodeName
field substituted as mentioned in the previous step. - Check if any process is isolated in the
infra
pool usingNodeReport
for that node. If you using third part resources (kubernetes 1.6.x and older versions)kubectl get NodeReport <node-name> -o json | jq .report.description.pools.infra
If you using custom resources definition (kubernetes 1.7.x and newer versions)kubectl get cmk-nodereport <node-name> -o json | jq .spec.report.description.pools.infra
- Follow all the above steps, but use simplified Pod manifest:
apiVersion: v1
kind: Pod
metadata:
labels:
app: cmk-isolate-pod
name: cmk-isolate-pod
spec:
# Change this to the <node-name> you want to test.
nodeName: NODENAME
containers:
- args:
- "/opt/bin/cmk isolate --pool=exclusive sleep -- 10000"
command:
- "/bin/bash"
- "-c"
env:
image: cmk:v1.5.2
imagePullPolicy: "Never"
name: cmk-isolate-infra
resources:
requests:
cmk.intel.com/exclusive-cores: 1
restartPolicy: Never
- Run
kubectl create -f <file-name>
, where<file-name>
is the name of the Pod manifest file with nodeName field substituted as mentioned in the previous section. - Run
kubectl get pod cmk-isolate-pod -o json | jq .metadata.annotations
and verify that annotation has been added:
{
"cmk.intel.com/resources-injected": "true"
}
- Run
kubectl get pod cmk-isolate-pod -o json | jq .spec.volumes
and verify that extra volumes have been injected:
[
{
"name": "default-token-xfd8q",
"secret": {
"defaultMode": 420,
"secretName": "default-token-xfd8q"
}
},
{
"hostPath": {
"path": "/proc",
"type": ""
},
"name": "cmk-host-proc"
},
{
"hostPath": {
"path": "/opt/bin",
"type": ""
},
"name": "cmk-install-dir"
}
]
- Run
kubectl get pod cmk-isolate-pod -o json | jq .spec.containers[0].env
and verify that env variables have been added to the container spec:
[
{
"name": "CMK_PROC_FS",
"value": "/host/proc"
},
{
"name": "CMK_NUM_CORES",
"value": "1"
}
]
Dynamic reconfiguration allows you to reconfigure the pool setup of your CMK nodes in your cluster without having to tear down CMK and clean up any of the configuration directories or configmap associated with CMK. The reconfigure command will look at every pod in every namespace on all of the CMK nodes in your cluster but will only reassign those pods that have been assigned cores using CMK. This knocks a considerable amount of time off of the operation and makes it a lot easier. It also means that you don't have to stop any processes that are currently running in order to reconfigure, as this method will automatically reassign any processes to the new cores in the new configuration. For example, consider the following CMK pool configuration:
data:
config: |
exclusive:
0:
3,11: []
4,12:
- '3001'
5,13: []
6,14: []
1: {}
infra:
0:
0-2,8-10: []
1: {}
shared:
0:
7,15:
- '2000, 2001'
1: {}
data:
config: |
exclusive:
0:
3,11: []
4,12:
- '3001'
1: {}
infra:
0:
0-2,8-10: []
1: {}
shared:
0:
6,14,7,15:
- '2000, 2001'
1: {}
The processes 2000 and 2001 in the shared pool will have their cpu affinity changed from the original ["7,15"] to the updated ["6,14,7,15"] when the reconfiguration has completed. In the case of the exclusive pool, you can see that the process 3001 remained in the Core List 4,12 instead of being reassigned the Core List 3,11. This is so there is no unnecessary interruption to the process running on those cores because they will be high-priority processes that require low latency and zero interrupts. If the Core List that a process is running in is not available in the updated configuration (for example if only one exclusive pool was requested in the new setup, meaning only Core List 3,11 would be assigned), then of course the exclusive process will have to be reassigned to the new Core List.
To use this reconfigure method you simply run a pod and us the reconfigure_setup
option in cmk.py. The reconfigure option requires the following parameters:
num-exclusive-cores, num-shared-cores, excl-non-isolcpus, exclusive-mode, shared-mode, cmk-img, cmk-img-pol, install-dir, saname, namespace
An example PodSpec is provided in the resources/pods folder of the repository. An example command would look like the following:
"/opt/bin/cmk isolate --pool=infra /opt/bin/cmk -- reconfigure_setup --num-exclusive-cores=2 --num-shared-cores=2 --namespace=cmk-namespace"
The parameters that are not listed in this example take their default value, which can be seen by running the cmk --help
command.
What happens if there aren't enough cores to house all of the processes in the current configuration?
This scenario will happen when, for example, your CMK configuration has three cores assigned to the exclusive pool, all of which have a process running on them, and you try to reconfigure CMK to have only two cores assigned to the exclusive pool. The reconfigure command will recognise that one of the processes will not be able to get reassigned to an exclusive core and fail out of the operation before any changes have been made to the configuration files.
The reconfigure operation will automatically detect which nodes in your cluster are CMK nodes and it will reconfigure all of them without you having to specify. It does this detection by looking for the following label in the annotations of the node:
"cmk.intel.com/cmk-node" == "true"
This label is added by the discover operation, which occurs as part cluster_init, so you don't have to add the label yourself.
If running cmk cluster-init
using the cmk-cluster-init-pod template ends up in an error,
the recommended way to start troubleshooting is to look at the logs using kubectl logs POD_NAME [CONTAINER_NAME] -f
.
For example, assuming you ran the cmk-cluster-init-pod template with default options, it
should create two pods on each node named cmk-init-install-discover-pod-<node-name>
and
cmk-reconcile-nodereport-<node-name>
, where <node-name>
should be replaced with the name of the node.
If you want to look at the logs from the container which ran the discover
subcommand in the pod, you can use
kubectl logs -f cmk-init-install-discover-pod-<node-name> discover
If you want to look at the logs from the container which ran the reconcile
subcommand in the pod, you can use
kubectl logs -f cmk-reconcile-nodereport-pod-<node-name> reconcile
If you want to remove cmk
use cmk-uninstall-pod.yaml
. nodeSelector
can help to fine-grain the deletion for specific node.