Skip to content

Commit

Permalink
[TDX] Simplified the descriptions; added Getting Started
Browse files Browse the repository at this point in the history
Signed-off-by: Jakub Ledworowski <[email protected]>
  • Loading branch information
JakubLedworowski committed Dec 12, 2024
1 parent d7e3771 commit 608e869
Showing 1 changed file with 30 additions and 99 deletions.
129 changes: 30 additions & 99 deletions ChatQnA/kubernetes/intel/README_tdx.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,86 +5,42 @@ This document outlines the deployment process for an example application utilizi
The deployment process is intended for users who want to deploy an example application:

- with pods protected by Intel TDX,
- on a single node in a cluster (acting as a master and worker) that is a Xeon 5th Gen platform or later,
- on a single node in a cluster (acting as a master and worker) that is a Xeon 4th Gen platform or later,
- running Ubuntu 24.04,
- using images pushed to public repository, like quay.io or docker hub.

It's split into 3 sections:

1. [Cluster Configuration](#cluster-configuration) - steps required to prepare components in the cluster required to use Intel TDX.
2. [Node configuration](#node-configuration) - additional steps to be performed on the node that are required to run heavy applications like OPEA ChatQnA.
3. [Deployment of services protected with Intel TDX](#deployment-of-services-protected-with-intel-tdx) - describes how to deploy an example application with services protected using Intel TDX.
## Getting Started

> [!NOTE]
> Running TDX-protected services requires the user to define the pod's resources request (cpu, memory).
>
> Due to lack of hotplugging feature in TDX, the assigned resources cannot be changed after the pod is scheduled and the resources will not be shared with any other pod.
>
> This means, that the total amount of resources assigned to all TDX-protected pods must be less than the total amount of resources available on the node, leaving room for the non-TDX pods requests.
Follow the below steps on the Xeon server node to deploy the example application:

1. [Install Ubuntu 24.04 and enable Intel TDX](https://github.com/canonical/tdx/blob/noble-24.04/README.md#setup-host-os)
2. [Install Kubernetes cluster](https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/)
3. [Install Confidential Containers Operator](https://cc-enabling.trustedservices.intel.com/intel-confidential-containers-guide/02/infrastructure_setup/#install-confidential-containers-operator)
4. Increase the kubelet timeout:

## Cluster Configuration

To prepare cluster to run Intel TDX-protected workloads, follow [Intel Confidential Computing Documentation](https://cc-enabling.trustedservices.intel.com/intel-confidential-containers-guide/01/introduction/index.html).


## Node Configuration

This section outlines required changes to be performed on each node.
These steps might be automated with various configuration management tools like Ansible, Puppet, Chef, etc.


### Kubelet Configuration

To run a complex and heavy application like OPEA, the cluster administrator must increase the kubelet timeout for container creation, otherwise the pod creation may fail due to timeout `Context deadline exceeded`.
This is required because the container creation process can take a long time due to the size of pod images and the need to download the AI models.
Run the following script on all nodes to increase the kubelet timeout to 30 minutes and restart the kubelet automatically if the setting was applied (sudo required):

```bash
echo "Setting up the environment..."
kubelet_config="/var/lib/kubelet/config.yaml"
# save the current kubelet timeout setting
previous=$(sudo grep runtimeRequestTimeout "${kubelet_config}")
# Increase kubelet timeout
sudo sed -i 's/runtimeRequestTimeout: .*/runtimeRequestTimeout: 30m/' "${kubelet_config}"
new=$(sudo grep runtimeRequestTimeout "${kubelet_config}")
# Check if the kubelet timeout setting was updated
if [[ "$previous" == "$new" ]]; then
echo "kubelet runtimeRequestTimeout setting was not updated."
else
echo "kubelet runtimeRequestTimeout setting was updated."
echo "Updated kubelet runtimeRequestTimeout setting:"
sudo grep runtimeRequestTimeout "${kubelet_config}"
echo "Restarting kubelet..."
sudo systemctl daemon-reload && sudo systemctl restart kubelet
echo "Waiting 30s for kubelet to restart..."
sleep 30
echo "kubelet restarted."
fi
```

> [!NOTE]
> The script is prepared for vanilla kubernetes installation.
> If you are using a different kubernetes distribution, the kubelet configuration file location may differ or the setting could be managed otherwise.
>
> After kubelet restart, some of the internal pods from `kube-system` namespace might be reloaded automatically.
All kubelet configuration options can be found [here](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/).


## Deployment of services protected with Intel TDX
```bash
sudo sed -i 's/runtimeRequestTimeout: .*/runtimeRequestTimeout: 30m/' "/var/lib/kubelet/config.yaml"
sudo systemctl daemon-reload && sudo systemctl restart kubelet
```

5. Deploy ChatQnA:

This section describes how to deploy an example application with services protected using Intel TDX:
```bash
kubectl apply -f cpu/xeon/manifest/chatqna_tdx.yaml
```

6. Verify all pods are running:

1. [Overview of the changes needed](#overview-of-the-changes-needed) - describes the changes required to protect a single component with Intel TDX.
2. [Example deployment of ChatQnA with TDX protection](#example-deployment-of-chatqna-with-tdx-protection) - provides a quick start to run ChatQnA example application with all services protected with Intel TDX.
3. [Customization of deployment configuration](#customization-of-deployment-configuration) - describes how to manually modify the deployment configuration to protect a single component with Intel TDX.
```bash
kubectl get pods
```


### Overview of the changes needed
## Advanced configuration

To protect a single component with Intel TDX, user must modify its manifest file.
The process is described in details in the [Demo Workload Deployment](https://cc-enabling.trustedservices.intel.com/intel-confidential-containers-guide/03/demo_workload_deployment/#pod-isolated-by-kata-containers-protected-with-intel-tdx-and-quote-verified-using-intel-trust-authority).
The details are described in the [Demo Workload Deployment](https://cc-enabling.trustedservices.intel.com/intel-confidential-containers-guide/03/demo_workload_deployment/#pod-isolated-by-kata-containers-and-protected-by-intel-tdx).

Here, we describe the required changes on the example Deployment definition below:

Expand Down Expand Up @@ -120,39 +76,11 @@ spec:
```
### Example deployment of ChatQnA with TDX protection
As an example we will use the ChatQnA application.
If you want to just give it a try, simply run:
```bash
kubectl apply -f cpu/xeon/manifest/chatqna_tdx.yaml
```

After a few minutes, the ChatQnA services should be up and running in the cluster and all of them will be protected with Intel TDX.
You may verify, that the pods are running with the TDX-protection by checking the runtime class name, e.g.:

```bash
POD_NAME=$(kubectl get pods | grep 'chatqna-tgi' | awk '{print $1}')
kubectl get pod $POD_NAME -o jsonpath='{.spec.runtimeClassName}'
```

In the output you should see:

```text
kata-qemu-tdx
```

This is a simple indicator that the pod is running in a Trust Domain protected by Intel TDX.
However, for a production use-case, the attestation process is crucial to verify the integrity of the pod.
You may read more about how to enable attestation [here](https://cc-enabling.trustedservices.intel.com/intel-confidential-containers-guide/03/demo_workload_deployment/#pod-isolated-by-kata-containers-protected-with-intel-tdx-and-quote-verified-using-intel-trust-authority).


### Customization of deployment configuration
If you want to have more control over what is protected with Intel TDX or use a different deployment file, you can manually modify the deployment configuration, by following the steps below:
1. Run the script to modify the chosen services with the changes described in [previous section](#overview-of-the-changes-needed):
1. Run the script to apply changes only to the chosen `SERVICES` on the `FILE` of your choice:

```bash
SERVICES=("llm-uservice")
Expand All @@ -167,11 +95,10 @@ If you want to have more control over what is protected with Intel TDX or use a
done
```

2. For each service, edit the deployment file to define the resources that must be assigned to the pod to run the service efficiently:
2. For each service from `SERVICES`, edit the deployment `FILE` to define the resources that must be assigned to the pod to run the service efficiently:

- The resources must be defined in the `resources` section of the pod's container definition.
- The `memory` must be at least 2x the image size.
- The `cpu` and `memory` resources must be defined at least in `limits` sections.
- By default, the pod will be assigned 1 CPU and 2048 MiB of memory, but half of it will be used for filesystem.

3. Apply the changes to the deployment configuration:
Expand All @@ -180,6 +107,10 @@ If you want to have more control over what is protected with Intel TDX or use a
kubectl apply -f chatqna.yaml
```

### Troubleshoting
> [!IMPORTANT]
> Total amount of resources assigned to all TDX-protected pods must be less than the total amount of resources available on the node, leaving room for the non-TDX pods requests.


## Troubleshoting

In case of any problems regarding pod creation, refer to [Troubleshooting guide](https://cc-enabling.trustedservices.intel.com/intel-confidential-containers-guide/04/troubleshooting/).

0 comments on commit 608e869

Please sign in to comment.