From efc13e955786d59867005d59c266c27105c21f84 Mon Sep 17 00:00:00 2001 From: Jiayi Wang Date: Fri, 6 Oct 2023 16:18:34 -0400 Subject: [PATCH] Add instruction of restoring clusters from backup --- .../cluster-backup-restore/_index.md | 9 ++ .../cluster-backup-restore/restore-cluster.md | 146 ++++++++++++++++++ 2 files changed, 155 insertions(+) create mode 100644 docs/content/en/docs/clustermgmt/cluster-backup-restore/_index.md create mode 100644 docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md diff --git a/docs/content/en/docs/clustermgmt/cluster-backup-restore/_index.md b/docs/content/en/docs/clustermgmt/cluster-backup-restore/_index.md new file mode 100644 index 0000000000000..a2b9e52307e3b --- /dev/null +++ b/docs/content/en/docs/clustermgmt/cluster-backup-restore/_index.md @@ -0,0 +1,9 @@ +--- +title: "Backup and restore cluster" +linkTitle: "Backup and restore cluster" +weight: 50 +aliases: + /docs/tasks/cluster/cluster-backup-restore/ +description: > + How to backup and restore your cluster +--- diff --git a/docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md b/docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md new file mode 100644 index 0000000000000..51affec741372 --- /dev/null +++ b/docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md @@ -0,0 +1,146 @@ +--- +title: "Restore cluster from backup" +linkTitle: "Restore cluster from backup" +weight: 20 +aliases: + /docs/tasks/cluster/cluster-backup-restore/restore-cluster/ +description: > + How to restore your EKS Anywhere cluster from backup +--- + +In certain unfortunate circumstances, an EKS Anywhere cluster may find itself in an unrecoverable state due to various factors such as a failed cluster upgrade, underlying infrastructure problems, or network issues, rendering the cluster inaccessible through conventional means. This document outlines detailed steps to guide you through the process of restoring a failed cluster from backups in these critical situations. + +## Restore a management cluster + +As EKS Anywhere management cluster contains the management components of itself plus all the workload clusters it manages, the restoring process can be more complicated than just restoring all the objects from the etcd backup. To be more specific, all the core EKS Anywhere and Cluster API custom resources, managing the lifecycle (provisioning, upgrading, operating, etc.) of the management and its workload clusters, are stored in the management cluster. This includes all the supporting infrastructure, like virtual machines, networks and load balancers. For example, after a failed cluster upgrade, the infrastructure components can change after the etcd backup was taken. Since the backup does not contain the new state of the half upgraded cluster, simply restoring it can create the virtual machines UUID and IP mismatches, rendering EKS Anywhere incapable of healing the cluster. + +Depending on whether the infrastructure components are changed or not after the etcd backup was take - e.g. machines are rolled out and recreated, new IP address assigned to the machines - different strategy needs to be applied in order to restore the management cluster. + +### Infrastructure components not changed after etcd backup was taken + +If the underlying infrastructure machines are not changed after the etcd backup was taken, simply follow the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) to restore the management cluster from the backup. + +TODO: what if cluster is not accessible? + +### Infrastructure components changed after etcd backup was taken + +If the infrastructure machines are changed after the etcd backup was taken, restoring this management cluster from the outdated etcd backup may not work. Instead, you need to create a new management cluster, and migrate all the EKS Anywhere resources of the old workload clusters to the new one, so that the new management cluster can maintain the new ownership of managing the existing workload clusters. + +1. Create a new management cluster to which you will be migrating your workload clusters later. + + You can define a cluster config similar to your old management cluster, with a different cluster name, and run cluster creation of the new management cluster. + + ```sh + eksctl anywhere create cluster -f mgmt-new.yaml + ``` + +1. Move the custom resources of all the workload clusters to the new management cluster created above. + + Using vSphere provider as an example, we are moving the Cluster API custom resources, such as `vpsherevms`, `vspheremachines` and `machines` of the **workload clusters**, from the old management cluster to the new management cluster created in above step. Using the `--filter-cluster` flag with `clusterctl move` command so that we are only targeting the custom resources from the workload clusters. + + + ```bash + MGMT_CLUSTER_OLD="mgmt-old" + MGMT_CLUSTER_NEW="mgmt-new" + MGMT_CLUSTER_NEW_KUBECONFIG=${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig + WORKLOAD_CLUSTER_1="w01" + WORKLOAD_CLUSTER_2="w02" + + # Substitute the workspace path with the workspace you are using + WORKSPACE_PATH="/home/ec2-user/eks-a" + CLUSTER_STATE_BACKUP_LATEST=$(ls -Art ${WORKSPACE_PATH}/${MGMT_CLUSTER_OLD} | grep 'cluster-state-backup' | tail -1) + CLUSTER_STATE_BACKUP_LATEST_PATH=${WORKSPACE_PATH}/${MGMT_CLUSTER_OLD}/${CLUSTER_STATE_BACKUP_LATEST}/ + + # Substitute the container version with whatever EKS Anywhere CLI version you are using + CONTAINER=public.ecr.aws/eks-anywhere/cli-tools:v0.16.2-eks-a-41 + + + # The clusterctl move command needs to be executed on each workload cluster you have. It will will only move the workload cluster resources from the EKS Anywhere backup to the new management cluster. If you have multiple workload clusters, you have to run the command for each cluster as shown below. + + # Move workload cluster w01 resources to the new management cluster mgmt-new + docker run -i --network host -w $(pwd) -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) --entrypoint clusterctl ${CONTAINER} move \ + --namespace eksa-system \ + --filter-cluster {WORKLOAD_CLUSTER_1} \ + --from-directory ${CLUSTER_STATE_BACKUP_LATEST_PATH} \ + --to-kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG} + + # Move workload cluster w02 resources to the new management cluster mgmt-new + docker run -i --network host -w $(pwd) -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) --entrypoint clusterctl ${CONTAINER} move \ + --namespace eksa-system \ + --filter-cluster {WORKLOAD_CLUSTER_2} \ + --from-directory ${CLUSTER_STATE_BACKUP_LATEST_PATH} \ + --to-kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG} + ``` + +1. Edit the cluster config file of the workload clusters. + + You need to update the cluster config file of all the workload clusters to point to the new management cluster created in step 1. For example, + + workload cluster w01 + + ```yaml + apiVersion: anywhere.eks.amazonaws.com/v1alpha1 + kind: Cluster + metadata: + name: w01 + namespace: default + spec: + managementCluster: + name: mgmt-new # This needs to be updated with the new management cluster name. + ... + ``` + + workload cluster w02 + + ```yaml + apiVersion: anywhere.eks.amazonaws.com/v1alpha1 + kind: Cluster + metadata: + name: w02 + namespace: default + spec: + managementCluster: + name: mgmt-new # This needs to be updated with the new management cluster name. + ... + ``` + + Make sure that apart from the `managementCluster` field you updated above, all the other cluster config of the workload clusters need to stay the same as the old workload clusters resources after the old management cluster fails. + +1. Apply the updated cluster config of the workload clusters in the new management cluster. + + ```bash + MGMT_CLUSTER_NEW="mgmt-new" + MGMT_CLUSTER_NEW_KUBECONFIG=${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig + + kubectl apply -f w01/w01-eks-a-cluster.yaml --kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG} + kubectl apply -f w02/w02-eks-a-cluster.yaml --kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG} + ``` + +1. Validate all clusters are in the desired state. + + ```bash + kubectl get clusters -n default --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig + + NAME AGE + mgmt-new 13h + w01 11h + w02 11h + + kubectl get clusters.cluster.x-k8s.io -n eksa-system --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig + + NAME PHASE AGE + mgmt-new Provisioned 11h + w01 Provisioned 11h + w02 Provisioned 11h + + kubectl get kcp -n eksa-system --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig + + NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION + mgmt-new mgmt-new true true 2 2 2 11h v1.27.1-eks-1-27-4 + w01 w01 true true 2 2 2 11h v1.27.1-eks-1-27-4 + w02 w02 true true 2 2 2 11h v1.27.1-eks-1-27-4 + ``` + +## Restore a workload cluster + +Similar to failed management cluster without infrastructure components change, follow the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) to restore the workload cluster from the backup.