Add disabled PV re-provisioning by StorageClasses option on restore #8287

clcondorcet · 2024-10-11T07:57:04Z

Thank you for contributing to Velero!

Please add a summary of your change

This PR addresses a specific disaster recovery use case:

When restoring a cluster after a disaster, PV may or may not have associated snapshots.
However, the underlying volumes from the CSI driver remain intact. I want to relink PVs with existing volumes instead of creating new ones. This PR propose a way to retrieve existing volumes without recreating them.

In my case, the PV has a Reclaim Policy Delete enforced by the StorageClass.
Velero currently does not restore PVs with a Delete policy and no snapshots, which makes sense for regular backup/restore scenarios but can be limiting in disaster recovery situations.

Proposed solution:

This PR introduces a new feature that allows Velero to restore PVs as-is when they meet the following conditions:

No snapshot is available.
The PV's StorageClass is listed in a newly introduced field in the CRD specifications called disabledPVReprovisioningStorageClasses.

This ensures that Velero can relink the PV to its existing volume and rebind the associated PVC, similar to how it restores PVs with a Reclaim Policy Retain.

Implementation Details:

This solution uses StorageClass names rather than PV names, as some CSI drivers generate random PV names.
Also, using StorageClass makes sense in this context because PVs sharing the same StorageClass generally have similar configurations.

This solution does not bypass snapshots. If snapshots are available, restoring from them is preferred as it is generally more reliable. However, this is open for discussion.

Does your change fix a particular issue?

There is no direct issue but I found two issues that are close to what I've added.

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file (make new-changelog) or comment /kind changelog-not-required on this PR.
Updated the corresponding documentation in site/content/docs/main.

BarthV · 2024-10-11T12:56:39Z

Awesome! We are really interested in this annotation in order to recover all our existing volumes.

Lyndon-Li · 2024-10-12T03:36:51Z

However, the underlying volumes from the CSI driver remain intact

Does this mean a coincident and unsteady situation -- the CSI driver just doesn't have chance to reclaim the storage volumes of the PVs because of the disaster?
If so, the consequence of volumes are decided by the specific storage and it is unsafe to reuse the volumes.

Depyume · 2024-10-12T11:39:23Z

However, the underlying volumes from the CSI driver remain intact

Does this mean a coincident and unsteady situation -- the CSI driver just doesn't have chance to reclaim the storage volumes of the PVs because of the disaster? If so, the consequence of volumes are decided by the specific storage and it is unsafe to reuse the volumes.

I think the goal is here to support cases where the entire kubernetes cluster is lost (e.g. burnt bare metal servers) and you still want to recover Volumes that may still exist on the storage backend, regardless Reclaim Policies of backuped volumes objects.

Currently Velero doesn't allow "recovering" / "reattaching" such volumes and either ignore them, or force empty volume recreation (expecting it to be recovered from a snapshot)

felfa01 · 2024-11-20T15:46:33Z

Stumbled upon this PR and it seems to be solving a use case for Disaster Recovery that I have. All my PVs are created with reclaimPolicy: Delete via the StorageClass but for DR scenarios where the backend storage is left intact I'd like to have the option of PVs reconnecting to the backend storage instead of being re-created.

I am able to do this with some manual intervention pre-backup:

Modify the PV and remove any external-provisioner finalizers and change to reclaimPolicy: Retain
Run the Backup
Mimic a disaster by wiping the cluster
Re-create the cluster and run a restore
See that all PVs are recreated but are reconnected to the instact backend storage

I have tried automatic the manual step of modiyfing the PVs pre-backup but this appears to be not supported but from what I can see this PR is in line with what I am after. Looking forward to it.

Depyume · 2024-11-20T20:21:47Z

A rebase on this PR would be great :)

Signed-off-by: Clément Rostagni <[email protected]>

clcondorcet · 2024-11-26T10:03:05Z

Stumbled upon this PR and it seems to be solving a use case for Disaster Recovery that I have. All my PVs are created with reclaimPolicy: Delete via the StorageClass but for DR scenarios where the backend storage is left intact I'd like to have the option of PVs reconnecting to the backend storage instead of being re-created.

That's exactly the idea! I'm glad to see that this feature is wanted by others.

Any update from the maintaining team ?

kaovilai

Please update https://github.com/vmware-tanzu/velero/blob/main/site/content/docs/main/api-types/restore.md

kaovilai · 2024-11-26T21:49:47Z

config/crd/v1/bases/velero.io_restores.yaml

+              disabledPVReprovisioningStorageClasses:
+                description: |-
+                  DisabledPVReprovisioningStorageClasses is a slice of StorageClasses names.
+                  PV without snaptshot and having one of these StorageClass will not be


Suggested change

PV without snaptshot and having one of these StorageClass will not be

PV without snapshot and having one of these StorageClass will not be

kaovilai · 2024-11-26T21:50:02Z

pkg/apis/velero/v1/restore_types.go

@@ -129,6 +129,13 @@ type RestoreSpec struct {
 	// +optional
 	// +nullable
 	UploaderConfig *UploaderConfigForRestore `json:"uploaderConfig,omitempty"`
+
+	// DisabledPVReprovisioningStorageClasses is a slice of StorageClasses names.
+	// PV without snaptshot and having one of these StorageClass will not be


Suggested change

// PV without snaptshot and having one of these StorageClass will not be

// PV without snapshot and having one of these StorageClass will not be

kaovilai · 2024-11-26T21:55:59Z

pkg/restore/restore.go

+	obj *unstructured.Unstructured,
+	logger logrus.FieldLogger,
+) (*unstructured.Unstructured, error) {
+	logger.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and it's storage class has re-provisionning disabled.")


Suggested change

logger.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and it's storage class has re-provisionning disabled.")

logger.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and restore has storage class re-provisionning disabled.")

Lyndon-Li · 2024-11-27T02:47:52Z

However, the underlying volumes from the CSI driver remain intact

Does this mean a coincident and unsteady situation -- the CSI driver just doesn't have chance to reclaim the storage volumes of the PVs because of the disaster? If so, the consequence of volumes are decided by the specific storage and it is unsafe to reuse the volumes.

I think the goal is here to support cases where the entire kubernetes cluster is lost (e.g. burnt bare metal servers) and you still want to recover Volumes that may still exist on the storage backend, regardless Reclaim Policies of backuped volumes objects.

Currently Velero doesn't allow "recovering" / "reattaching" such volumes and either ignore them, or force empty volume recreation (expecting it to be recovered from a snapshot)

The thing is the volumes may or may not exist in the storage, or it is not a steady situation. Operations may succeed or may not.
From the perspective of Kubernetes, the volumes in the storage are already abandoned objects. Therefore, reclaiming a new volume and restoring data there is a more rational approach.

Could you clarify why this relink approach is must-have and what the problem is if you use the current create new volume-restore approach?

github-actions bot added the has-unit-tests label Oct 11, 2024

github-actions bot requested review from anshulahuja98 and reasonerjt October 11, 2024 07:57

github-actions bot assigned clcondorcet Oct 11, 2024

clcondorcet force-pushed the main branch from 36ed1f5 to 2588426 Compare October 11, 2024 08:01

github-actions bot added the has-changelog label Oct 11, 2024

clcondorcet force-pushed the main branch from f7cef84 to dc943a0 Compare October 11, 2024 08:14

reasonerjt added the Needs triage We need discussion to understand problem and decide the priority label Oct 14, 2024

clcondorcet added 2 commits November 26, 2024 10:45

Add disabled PV re-provisioning by StorageClasses option on restore

454fcc8

Signed-off-by: Clément Rostagni <[email protected]>

Add PR changelog

96d3ff2

Signed-off-by: Clément Rostagni <[email protected]>

clcondorcet force-pushed the main branch from dc943a0 to 96d3ff2 Compare November 26, 2024 09:49

kaovilai requested changes Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add disabled PV re-provisioning by StorageClasses option on restore #8287

Add disabled PV re-provisioning by StorageClasses option on restore #8287

clcondorcet commented Oct 11, 2024 •

edited

Loading

BarthV commented Oct 11, 2024

Lyndon-Li commented Oct 12, 2024

Depyume commented Oct 12, 2024 •

edited

Loading

felfa01 commented Nov 20, 2024 •

edited

Loading

Depyume commented Nov 20, 2024

clcondorcet commented Nov 26, 2024

kaovilai left a comment

kaovilai Nov 26, 2024

kaovilai Nov 26, 2024

kaovilai Nov 26, 2024

Lyndon-Li commented Nov 27, 2024

	PV without snaptshot and having one of these StorageClass will not be
	PV without snapshot and having one of these StorageClass will not be

	// PV without snaptshot and having one of these StorageClass will not be
	// PV without snapshot and having one of these StorageClass will not be

	logger.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and it's storage class has re-provisionning disabled.")
	logger.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and restore has storage class re-provisionning disabled.")

Add disabled PV re-provisioning by StorageClasses option on restore #8287

Are you sure you want to change the base?

Add disabled PV re-provisioning by StorageClasses option on restore #8287

Conversation

clcondorcet commented Oct 11, 2024 • edited Loading

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

BarthV commented Oct 11, 2024

Lyndon-Li commented Oct 12, 2024

Depyume commented Oct 12, 2024 • edited Loading

felfa01 commented Nov 20, 2024 • edited Loading

Depyume commented Nov 20, 2024

clcondorcet commented Nov 26, 2024

kaovilai left a comment

Choose a reason for hiding this comment

kaovilai Nov 26, 2024

Choose a reason for hiding this comment

kaovilai Nov 26, 2024

Choose a reason for hiding this comment

kaovilai Nov 26, 2024

Choose a reason for hiding this comment

Lyndon-Li commented Nov 27, 2024

clcondorcet commented Oct 11, 2024 •

edited

Loading

Depyume commented Oct 12, 2024 •

edited

Loading

felfa01 commented Nov 20, 2024 •

edited

Loading