Error VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation #8444

erichevers · 2024-11-22T15:23:04Z

What steps did you take and what happened:
I do a restore from a backup made from a pvc on rook-ceph on cluster prod01 to cluster dR01 with:

velero restore create restore-test --include-namespaces vaicloud-dev --from-backup vaicloud-dev-backup22112024-2
The backup was made using --snapshot-move-data to S3 compatible storage

What did you expect to happen:
The restore to succeed, but i get the following error:

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    vaicloud-dev:  error preparing volumesnapshots.snapshot.storage.k8s.io/vaicloud-dev/cephfs-pvc-snapshot: rpc error: code = Unknown desc = VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation

I have set the requested annotation on rbd and cephfs, on both clusters. Also the volumes that need to be restored are using the rook-ceph-block storageclass, not the cephfs as the failure message indicates. So i'm wondering why this restore fails with a reference to cephfs

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
bundle-2024-11-22-16-16-21.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Environment:

Velero version (use velero version): Version 1.15.0 on both clusters
Velero features (use velero client config get features): features: EnableCSI
Kubernetes version (use kubectl version): 1.30.0 on the prod01 cluster (backup) and 1.31.1 on the dr01 (restore)
Kubernetes installer & version: RKE2
Cloud provider or hardware configuration: Bare-metal
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

blackpiglet · 2024-11-25T03:00:47Z

This is not expected.
The restore's referenced backup already enabled the SnapshotMoveData flag, then the restore should not use the CSI plugins to restore the volume data.

blackpiglet · 2024-11-25T07:36:57Z

@erichevers
Could you help collect the debug bundle of the restore-referenced backup?
IMO, the error happened due to the restore trying to restore the VolumeSnapshot CR.
This is not expected.

If the VolumeSnapshot was created during the backup, the correct behavior is the VolumeSnapshot should be deleted during the backup. There should be something going wrong.
The VolumeSnapshot was there before Velero did the backup. If this is the case, there is also something unexpected. The VolumeSnapshot should be updated the needed information by the CSI BIA.

time="2024-11-22T15:01:11Z" level=info msg="Executing item action for volumesnapshots.snapshot.storage.k8s.io" logSource="pkg/restore/restore.go:1321" restore=velero/restore-test
time="2024-11-22T15:01:11Z" level=info msg="Starting VolumeSnapshotRestoreItemAction" cmd=/velero logSource="pkg/restore/actions/csi/volumesnapshot_action.go:78" pluginName=velero restore=velero/restore-test

erichevers · 2024-11-25T09:22:36Z

Hi @blackpiglet ,
Thanks for looking into this. The debug backup logs are here:
bundle-2024-11-25-10-16-20.tar.gz

blackpiglet · 2024-11-25T15:53:47Z

Thanks for collecting the debug bundle.

There were three VolumeSnapshots included in the backup, and they are not created by the backup.

  snapshot.storage.k8s.io/v1/VolumeSnapshot:
    - vaicloud-dev/cephfs-pvc-snapshot
    - vaicloud-dev/velero-vaicloud-mq-volume-rsjnj
    - vaicloud-dev/velero-vaicloud-postgresql-volume-bgfhz

Velero also run the VolumeSnapshot BackupItemAction against them.

The only reason the restore failed to restore the VolumeSnapshots is that the backup-included VolumeSnapshots didn't have the Status or the Status didn't contain the SnapshotHandle during running backup.
Could you please check the content of those VolumeSnapshots?

erichevers · 2024-11-25T20:11:43Z

Hi @blackpiglet ,

On the Prod01 cluster i've checked the volumesnapshots and indeed there are three

kubectl get volumesnapshots
NAME                                      READYTOUSE   SOURCEPVC                    SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                SNAPSHOTCONTENT                                    CREATIONTIME   AGE
cephfs-pvc-snapshot                       false        cephfs-pvc                                                         csi-cephfsplugin-snapclass                                                                     41d
velero-vaicloud-mq-volume-rsjnj           true         vaicloud-mq-volume                                   2Gi           csi-rbdplugin-snapclass      snapcontent-131fbc1b-92f6-4980-87e9-997d4aef74c3   3d7h           3d7h
velero-vaicloud-postgresql-volume-bgfhz   true         vaicloud-postgresql-volume                           30Gi          csi-rbdplugin-snapclass      snapcontent-89dd62b0-7dae-4297-83cb-dc4d1b97db86   3d7h           3d7h

I don't know where the cephfs snapshot is coming from, but the describe shows that it in a failed state.

kubectl describe volumesnapshot cephfs-pvc-snapshot
Status:
  Error:
    Message:     Failed to create snapshot content with error snapshot controller failed to update cephfs-pvc-snapshot on API server: cannot get claim from snapshot
    Time:        2024-10-15T13:35:37Z
  Ready To Use:  false

I also looked at the other two snapshots and they came from another backup.
I've delete all three volumesnapshot and started a new backup.
velero backup create vaicloud-dev-backup22112024-4 --include-namespaces vaicloud-dev --snapshot-move-

datasnapshots got created and removed, as it should be
Then i moved over to the dr01 cluster and did a restore

velero restore create restore-test --include-namespaces vaicloud-dev --from-backup vaicloud-dev-backup22112024-4

This time there was no messsage about the VolumeSnapshot as the original case. However the restore job stays in:
WaitingForPluginOperations

below is the debug logfile of the restore:
bundle-2024-11-25-21-09-59.tar.gz

Regards

blackpiglet · 2024-11-26T06:30:31Z

From the log, I think the restore worked as expected.
How did the restore take to complete?

The data mover restore may take longer time than the CSI snapshot restore, because the data mover restore needs to create temporary pod and PVC to host the restored data.
The restore time also depends on the amount of the restored volume data.

blackpiglet · 2024-11-26T06:34:20Z

Make some clarification about the scenario of this issue:

The error reported by this issue is not a common case for the Velero CSI snapshot data mover.
The error was triggered by a failed VolumeSnapshot. The VolumeSnapshot didn't have a snapshot handle, and it was not created by the restore-referenced backup.

Although this is a rainy-day case, we may also consider whether Velero should handle it instead of reporting error.

erichevers · 2024-11-26T07:55:24Z

From the log, I think the restore worked as expected. How did the restore take to complete?

The data mover restore may take longer time than the CSI snapshot restore, because the data mover restore needs to create temporary pod and PVC to host the restored data. The restore time also depends on the amount of the restored volume data.

Hi @blackpiglet ,
I just checked and the job failed after the standard 4 hour timeout, with the following error:

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    vaicloud-dev:  fail to patch dynamic PV, err: context deadline exceeded, PVC: vaicloud-postgresql-volume, PV: pvc-59e5c022-2214-4ef4-a24b-f8afff278041
                   fail to patch dynamic PV, err: context deadline exceeded, PVC: vaicloud-mq-volume, PV: pvc-b6de9f16-8a9d-41ae-ae2e-a5fc715377c0

And the pods are still in pending

Regards

blackpiglet · 2024-11-26T08:05:19Z

@erichevers
I found a similar issue #7866
Could you check the PVC vaicloud-postgresql-volume and vaicloud-mq-volume status?
IMO, they were not ended with Bound phase after created by the restore.

erichevers · 2024-11-26T08:19:53Z

@blackpiglet ,
The PVC's are in Pending.

kubectl describe pvc vaicloud-postgresql-volume -n vaicloud-dev
gives:
Name:          vaicloud-postgresql-volume
Namespace:     vaicloud-dev
StorageClass:  rook-ceph-block
Status:        Pending
Volume:        
Labels:        velero.io/backup-name=vaicloud-dev-backup22112024-4
               velero.io/restore-name=restore-test
               velero.io/volume-snapshot-name=velero-vaicloud-postgresql-volume-jkmhl
Annotations:   backup.velero.io/must-include-additional-items: true
               velero.io/csi-volumesnapshot-class: csi-rbdplugin-snapclass
               volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Used By:       vaicloud-db-7846b4c4cd-25k8w
Events:
  Type    Reason                Age                     From                                                                                                        Message
  ----    ------                ----                    ----                                                                                                        -------
  Normal  ExternalProvisioning  3m53s (x2923 over 12h)  persistentvolume-controller                                                                                 Waiting for a volume to be created either by the external provisioner 'rook-ceph.rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal  Provisioning          27s (x205 over 12h)     rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-54b4855f96-b95cx_0ee214d7-199b-4fcc-8748-dfa6b513df21  External provisioner is provisioning volume for claim "vaicloud-dev/vaicloud-postgresql-volume"

Regards

blackpiglet · 2024-11-26T08:31:15Z

Waiting for a volume to be created either by the external provisioner 'rook-ceph.rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

External provisioner is provisioning volume for claim "vaicloud-dev/vaicloud-postgresql-volume"

To me, the error should be related to the Ceph Rook not creating a volume for the PVC in time.
Could you check whether there was any error logs in the Ceph Rook pods?

blackpiglet · 2024-11-26T09:39:51Z

Make some clarification about the scenario of this issue:

The error reported by this issue is not a common case for the Velero CSI snapshot data mover.

The error was triggered by a failed VolumeSnapshot. The VolumeSnapshot didn't have a snapshot handle, and it was not created by the restore-referenced backup.

Although this is a rainy-day case, we may also consider whether Velero should handle it instead of reporting error.

Create a new issue #8460 to address this comment.

blackpiglet added the area/datamover label Nov 25, 2024

blackpiglet assigned Lyndon-Li Nov 25, 2024

reasonerjt added the Needs investigation label Nov 25, 2024

blackpiglet assigned blackpiglet and unassigned Lyndon-Li Nov 26, 2024

blackpiglet added Needs info Waiting for information and removed Needs investigation labels Nov 26, 2024

blackpiglet mentioned this issue Nov 26, 2024

Skip the VolumeSnapshots and VolumeSnapshotContents not created by the backup. #8460

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation #8444

Error VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation #8444

erichevers commented Nov 22, 2024 •

edited by blackpiglet

Loading

blackpiglet commented Nov 25, 2024

blackpiglet commented Nov 25, 2024

erichevers commented Nov 25, 2024

blackpiglet commented Nov 25, 2024

erichevers commented Nov 25, 2024 •

edited by blackpiglet

Loading

blackpiglet commented Nov 26, 2024

blackpiglet commented Nov 26, 2024 •

edited

Loading

erichevers commented Nov 26, 2024 •

edited by blackpiglet

Loading

blackpiglet commented Nov 26, 2024

erichevers commented Nov 26, 2024 •

edited by blackpiglet

Loading

blackpiglet commented Nov 26, 2024

blackpiglet commented Nov 26, 2024

Error VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation #8444

Error VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation #8444

Comments

erichevers commented Nov 22, 2024 • edited by blackpiglet Loading

blackpiglet commented Nov 25, 2024

blackpiglet commented Nov 25, 2024

erichevers commented Nov 25, 2024

blackpiglet commented Nov 25, 2024

erichevers commented Nov 25, 2024 • edited by blackpiglet Loading

blackpiglet commented Nov 26, 2024

blackpiglet commented Nov 26, 2024 • edited Loading

erichevers commented Nov 26, 2024 • edited by blackpiglet Loading

blackpiglet commented Nov 26, 2024

erichevers commented Nov 26, 2024 • edited by blackpiglet Loading

blackpiglet commented Nov 26, 2024

blackpiglet commented Nov 26, 2024

erichevers commented Nov 22, 2024 •

edited by blackpiglet

Loading

erichevers commented Nov 25, 2024 •

edited by blackpiglet

Loading

blackpiglet commented Nov 26, 2024 •

edited

Loading

erichevers commented Nov 26, 2024 •

edited by blackpiglet

Loading

erichevers commented Nov 26, 2024 •

edited by blackpiglet

Loading