Add instruction of management cluster backup and restore (#6958)

aws · Nov 2, 2023 · 08fa812 · 08fa812
1 parent 668f7ff
commit 08fa812
Show file tree

Hide file tree

Showing 4 changed files with 233 additions and 1 deletion.
diff --git a/docs/content/en/docs/clustermgmt/cluster-backup-restore/_index.md b/docs/content/en/docs/clustermgmt/cluster-backup-restore/_index.md
@@ -0,0 +1,9 @@
+---
+title: "Backup and restore cluster"
+linkTitle: "Backup and restore cluster"
+weight: 50
+aliases:
+    /docs/tasks/cluster/cluster-backup-restore/
+description: >
+  How to backup and restore your cluster
+---
diff --git a/docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md b/docs/content/en/docs/clustermgmt/cluster-backup-restore/backup-cluster.md
@@ -0,0 +1,50 @@
+---
+title: "Backup cluster"
+linkTitle: "Backup cluster"
+weight: 20
+aliases:
+    /docs/tasks/cluster/cluster-backup-restore/backup-cluster/
+description: >
+  How to backup your EKS Anywhere cluster
+---
+
+We strongly advise performing regular cluster backups of all the EKS Anywhere clusters. This ensures that you always have an up-to-date cluster state available for restoration in case the cluster experiences issues or becomes unrecoverable. This document outlines the steps for creating the two essential types of backups required for the [EKS Anywhere cluster restore process]({{< relref "./restore-cluster" >}}).
+
+## Etcd backup
+
+For optimal cluster maintenance, it is crucial to perform regular etcd backups on all your EKS Anywhere management and workload clusters. **Always** take an etcd backup before performing an upgrade so it can be used to restore the cluster to a previous state in the event of a cluster upgrade failure. To create an etcd backup for your cluster, follow the guidelines provided in the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) section.
+
+
+## Cluster API backup
+
+Since cluster failures primarily occur following unsuccessful cluster upgrades, EKS Anywhere takes the proactive step of automatically creating backups for the Cluster API objects that capture the states of both the management cluster and its workload clusters if all the clusters are in ready state. If one of the workload clusters is not ready, EKS Anywhere takes the best effort to backup the management cluster itself. These backups are stored within the management cluster folder, where the upgrade command is initiated from the Admin machine, and are generated before each management cluster upgrade process. For example, after executing a cluster upgrade command on `mgmt-cluster`, a backup folder is generated with the naming convention of `cluster-state-backup-${timestamp}`:
+
+```bash
+mgmt-cluster/ 
+├── cluster-state-backup-2023-10-11T02_55_56 <------ Folder with a backup of the CAPI objects 
+├── mgmt-cluster-eks-a-cluster.kubeconfig
+├── mgmt-cluster-eks-a-cluster.yaml
+└── generated
+```
+
+Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster:
+
+
+```bash
+MGMT_CLUSTER="mgmt"
+MGMT_CLUSTER_KUBECONFIG=${MGMT_CLUSTER}/${MGMT_CLUSTER}-eks-a-cluster.kubeconfig
+BACKUP_DIRECTORY=backup-mgmt
+
+# Substitute the EKS Anywhere release version with whatever CLI version you are using
+EKSA_RELEASE_VERSION=v0.17.3
+BUNDLE_MANIFEST_URL=$(curl -s https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml | yq ".spec.releases[] | select(.version==\"$EKSA_RELEASE_VERSION\").bundleManifestUrl")
+CLI_TOOLS_IMAGE=$(curl -s $BUNDLE_MANIFEST_URL | yq ".spec.versionsBundles[0].eksa.cliTools.uri")
+
+
+docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} move \
+        --namespace eksa-system \
+        --kubeconfig $MGMT_CLUSTER_KUBECONFIG \
+        --to-directory ${BACKUP_DIRECTORY}
+```
+
+This saves the Cluster API objects of the management cluster `mgmt` with all its workload clusters, to a local directory under the `backup-mgmt` folder.
diff --git a/docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md b/docs/content/en/docs/clustermgmt/cluster-backup-restore/restore-cluster.md
@@ -0,0 +1,164 @@
+---
+title: "Restore cluster"
+linkTitle: "Restore cluster"
+weight: 20
+aliases:
+    /docs/tasks/cluster/cluster-backup-restore/restore-cluster/
+description: >
+  How to restore your EKS Anywhere cluster from backup
+---
+
+In certain unfortunate circumstances, an EKS Anywhere cluster may find itself in an unrecoverable state due to various factors such as a failed cluster upgrade, underlying infrastructure problems, or network issues, rendering the cluster inaccessible through conventional means. This document outlines detailed steps to guide you through the process of restoring a failed cluster from backups in these critical situations.
+
+## Prerequisite
+
+Always backup your EKS Anywhere cluster. Refer to the [Backup cluster]({{< relref "./backup-cluster" >}}) and make sure you have the updated etcd and Cluster API backup at hand.
+
+## Restore a management cluster
+
+As an EKS Anywhere management cluster contains the management components of itself, plus all the workload clusters it manages, the restoration process can be more complicated than just restoring all the objects from the etcd backup. To be more specific, all the core EKS Anywhere and Cluster API custom resources, that manage the lifecycle (provisioning, upgrading, operating, etc.) of the management and its workload clusters, are stored in the management cluster. This includes all the supporting infrastructure, like virtual machines, networks and load balancers. For example, after a failed cluster upgrade, the infrastructure components can change after the etcd backup was taken. Since the backup does not contain the new state of the half upgraded cluster, simply restoring it can create virtual machines UUID and IP mismatches, rendering EKS Anywhere incapable of healing the cluster.
+
+Depending on whether the infrastructure components are changed or not after the etcd backup was taken (for example, if machines are rolled out and recreated and new IP addresses assigned to the machines), different strategy needs to be applied in order to restore the management cluster.
+
+### Cluster accessible and the infrastructure components not changed after etcd backup was taken
+
+If the management cluster is still accessible through the API server, and the underlying infrastructure layer (nodes, machines, VMs, etc.) are not changed after the etcd backup was taken, simply follow the [External etcd backup and restore]({{< relref "../etcd-backup-restore/etcdbackup" >}}) to restore the management cluster itself from the backup.
+
+{{% alert title="Warning" color="warning" %}}
+
+Do not apply the etcd restore unless you are very sure that the infrastructure layer is not changed after the etcd backup was taken. In other words, the nodes, machines, VMs, and their assigned IPs need to be exactly the same as when the backup was taken.
+
+{{% /alert %}}
+
+### Cluster not accessible or infrastructure components changed after etcd backup was taken
+
+If the cluster is no longer accessible in any means, or the infrastructure machines are changed after the etcd backup was taken, restoring this management cluster itself from the outdated etcd backup will not work. Instead, you need to create a new management cluster, and migrate all the EKS Anywhere resources of the old workload clusters to the new one, so that the new management cluster can maintain the new ownership of managing the existing workload clusters. Below is an example of migrating a failed management cluster `mgmt-old` with its workload clusters `w01` and `w02` to a new management cluster `mgmt-new`:
+
+1. Create a new management cluster to which you will be migrating your workload clusters later.
+
+    You can define a cluster config similar to your old management cluster, and run cluster creation of the new management cluster with the **exact same EKS Anywhere version** used to create the old management cluster.
+
+    If the original management cluster still exists with old infrastructure running, you need to create a new management cluster with a **different cluster name** to avoid conflict.
+
+    ```sh
+    eksctl anywhere create cluster -f mgmt-new.yaml
+    ```
+
+1. Move the custom resources of all the workload clusters to the new management cluster created above.
+
+    Using the vSphere provider as an example, we are moving the Cluster API custom resources, such as `vpsherevms`, `vspheremachines` and `machines` of the **workload clusters**, from the old management cluster to the new management cluster created in above step. By using the `--filter-cluster` flag with the `clusterctl move` command, we are only targeting the custom resources from the workload clusters.
+
+
+    ```bash
+    # Use the same cluster name if the newly created management cluster has the same cluster name as the old one
+    MGMT_CLUSTER_OLD="mgmt-old"
+    MGMT_CLUSTER_NEW="mgmt-new"
+    MGMT_CLUSTER_NEW_KUBECONFIG=${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig
+    
+    WORKLOAD_CLUSTER_1="w01"
+    WORKLOAD_CLUSTER_2="w02"
+
+    # Substitute the workspace path with the workspace you are using
+    WORKSPACE_PATH="/home/ec2-user/eks-a"
+    
+    # Retrieve the Cluster API backup folder path that are automatically generated during the cluster upgrade
+    # This folder contains all the resources that represent the cluster state of the old management cluster along with its workload clusters
+    CLUSTER_STATE_BACKUP_LATEST=$(ls -Art ${WORKSPACE_PATH}/${MGMT_CLUSTER_OLD} | grep 'cluster-state-backup' | tail -1)
+    CLUSTER_STATE_BACKUP_LATEST_PATH=${WORKSPACE_PATH}/${MGMT_CLUSTER_OLD}/${CLUSTER_STATE_BACKUP_LATEST}/
+
+    # Substitute the EKS Anywhere release version with the EKS Anywhere version of the original management cluster
+    EKSA_RELEASE_VERSION=v0.17.3
+    BUNDLE_MANIFEST_URL=$(curl -s https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml | yq ".spec.releases[] | select(.version==\"$EKSA_RELEASE_VERSION\").bundleManifestUrl")
+    CLI_TOOLS_IMAGE=$(curl -s $BUNDLE_MANIFEST_URL | yq ".spec.versionsBundles[0].eksa.cliTools.uri")
+
+    # The clusterctl move command needs to be executed for each workload cluster.
+    # It will only move the workload cluster resources from the EKS Anywhere backup to the new management cluster.
+    # If you have multiple workload clusters, you have to run the command for each cluster as shown below.
+
+    # Move workload cluster w01 resources to the new management cluster mgmt-new
+    docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} move \
+        --namespace eksa-system \
+        --filter-cluster {WORKLOAD_CLUSTER_1} \
+        --from-directory ${CLUSTER_STATE_BACKUP_LATEST_PATH} \
+        --to-kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG}
+    
+    # Move workload cluster w02 resources to the new management cluster mgmt-new
+    docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} move \
+        --namespace eksa-system \
+        --filter-cluster {WORKLOAD_CLUSTER_2} \
+        --from-directory ${CLUSTER_STATE_BACKUP_LATEST_PATH} \
+        --to-kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG}
+    ```
+
+1. (Optional) Update the cluster config file of the workload clusters if the new management cluster has a different cluster name than the original management cluster.
+
+    You can **skip this step** if the new management cluster has the same cluster name as the old management cluster.
+
+    ```yaml
+    # workload cluster w01
+    ---
+    apiVersion: anywhere.eks.amazonaws.com/v1alpha1
+    kind: Cluster
+    metadata:
+      name: w01
+      namespace: default
+    spec:
+      managementCluster:
+        name: mgmt-new # This needs to be updated with the new management cluster name.
+      ...
+    ```
+
+    ```yaml
+    # workload cluster w02
+    ---
+    apiVersion: anywhere.eks.amazonaws.com/v1alpha1
+    kind: Cluster
+    metadata:
+      name: w02
+      namespace: default
+    spec:
+      managementCluster:
+        name: mgmt-new # This needs to be updated with the new management cluster name.
+      ...
+    ```
+
+    Make sure that apart from the `managementCluster` field you updated above, all the other cluster configs of the workload clusters need to stay the same as the old workload clusters resources after the old management cluster fails.
+
+    Apply the updated cluster config of each workload cluster in the new management cluster.
+
+    ```bash
+    MGMT_CLUSTER_NEW="mgmt-new"
+    MGMT_CLUSTER_NEW_KUBECONFIG=${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig
+
+    kubectl apply -f w01/w01-eks-a-cluster.yaml --kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG}
+    kubectl apply -f w02/w02-eks-a-cluster.yaml --kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG}
+    ```
+
+1. Validate all clusters are in the desired state.
+
+    ```bash
+    kubectl get clusters -n default -o custom-columns="NAME:.metadata.name,READY:.status.conditions[?(@.type=='Ready')].status" --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig
+
+    NAME       READY
+    mgmt-new   True
+    w01        True
+    w02        True
+
+    kubectl get clusters.cluster.x-k8s.io -n eksa-system --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig
+
+    NAME       PHASE         AGE
+    mgmt-new   Provisioned   11h   
+    w01        Provisioned   11h   
+    w02        Provisioned   11h 
+
+    kubectl get kcp -n eksa-system  --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig
+
+    NAME       CLUSTER    INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
+    mgmt-new   mgmt-new   true          true                   2          2       2                       11h   v1.27.1-eks-1-27-4
+    w01        w01        true          true                   2          2       2                       11h   v1.27.1-eks-1-27-4
+    w02        w02        true          true                   2          2       2                       11h   v1.27.1-eks-1-27-4
+    ```
+
+## Restore a workload cluster
+
+Restoring a workload cluster is a delicate process. If you have an [EKS Anywhere Enterprise Subscription](https://aws.amazon.com/eks/eks-anywhere/pricing/), please contact AWS support team if you wish to perform such operation.
diff --git a/docs/content/en/docs/clustermgmt/etcd-backup-restore/etcdbackup.md b/docs/content/en/docs/clustermgmt/etcd-backup-restore/etcdbackup.md
@@ -11,7 +11,16 @@ External ETCD topology is supported for vSphere, CloudStack and Snow clusters, b
 
 This page contains steps for backing up a cluster by taking an ETCD snapshot, and restoring the cluster from a snapshot.
 
-### Use case
+## Use case
 
 EKS-Anywhere clusters use ETCD as the backing store. Taking a snapshot of ETCD backs up the entire cluster data. This can later be used to restore a cluster back to an earlier state if required. 
+
 ETCD backups can be taken prior to cluster upgrade, so if the upgrade doesn't go as planned, you can restore from the backup.
+
+{{% alert title="Important" color="warning" %}}
+
+Restoring to a previous cluster state is a destructive and destablizing action to take on a running cluster. It should be considered only when all other options have been exhausted.
+
+If you are able to retrieve data using the Kubernetes API server, then etcd is available and you should not restore using an etcd backup.
+
+{{% /alert %}}