Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation #8444

Open
erichevers opened this issue Nov 22, 2024 · 12 comments
Assignees
Labels
area/datamover Needs info Waiting for information

Comments

@erichevers
Copy link

erichevers commented Nov 22, 2024

What steps did you take and what happened:
I do a restore from a backup made from a pvc on rook-ceph on cluster prod01 to cluster dR01 with:

  • velero restore create restore-test --include-namespaces vaicloud-dev --from-backup vaicloud-dev-backup22112024-2
    The backup was made using --snapshot-move-data to S3 compatible storage

What did you expect to happen:
The restore to succeed, but i get the following error:

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    vaicloud-dev:  error preparing volumesnapshots.snapshot.storage.k8s.io/vaicloud-dev/cephfs-pvc-snapshot: rpc error: code = Unknown desc = VolumeSnapshot vaicloud-dev/cephfs-pvc-snapshot does not have a velero.io/csi-volumesnapshot-handle annotation

I have set the requested annotation on rbd and cephfs, on both clusters. Also the volumes that need to be restored are using the rook-ceph-block storageclass, not the cephfs as the failure message indicates. So i'm wondering why this restore fails with a reference to cephfs

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
bundle-2024-11-22-16-16-21.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version): Version 1.15.0 on both clusters
  • Velero features (use velero client config get features): features: EnableCSI
  • Kubernetes version (use kubectl version): 1.30.0 on the prod01 cluster (backup) and 1.31.1 on the dr01 (restore)
  • Kubernetes installer & version: RKE2
  • Cloud provider or hardware configuration: Bare-metal
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@blackpiglet
Copy link
Contributor

This is not expected.
The restore's referenced backup already enabled the SnapshotMoveData flag, then the restore should not use the CSI plugins to restore the volume data.

@blackpiglet
Copy link
Contributor

@erichevers
Could you help collect the debug bundle of the restore-referenced backup?
IMO, the error happened due to the restore trying to restore the VolumeSnapshot CR.
This is not expected.

  • If the VolumeSnapshot was created during the backup, the correct behavior is the VolumeSnapshot should be deleted during the backup. There should be something going wrong.
  • The VolumeSnapshot was there before Velero did the backup. If this is the case, there is also something unexpected. The VolumeSnapshot should be updated the needed information by the CSI BIA.
time="2024-11-22T15:01:11Z" level=info msg="Executing item action for volumesnapshots.snapshot.storage.k8s.io" logSource="pkg/restore/restore.go:1321" restore=velero/restore-test
time="2024-11-22T15:01:11Z" level=info msg="Starting VolumeSnapshotRestoreItemAction" cmd=/velero logSource="pkg/restore/actions/csi/volumesnapshot_action.go:78" pluginName=velero restore=velero/restore-test

@erichevers
Copy link
Author

Hi @blackpiglet ,
Thanks for looking into this. The debug backup logs are here:
bundle-2024-11-25-10-16-20.tar.gz

@blackpiglet
Copy link
Contributor

Thanks for collecting the debug bundle.

There were three VolumeSnapshots included in the backup, and they are not created by the backup.

  snapshot.storage.k8s.io/v1/VolumeSnapshot:
    - vaicloud-dev/cephfs-pvc-snapshot
    - vaicloud-dev/velero-vaicloud-mq-volume-rsjnj
    - vaicloud-dev/velero-vaicloud-postgresql-volume-bgfhz

Velero also run the VolumeSnapshot BackupItemAction against them.

The only reason the restore failed to restore the VolumeSnapshots is that the backup-included VolumeSnapshots didn't have the Status or the Status didn't contain the SnapshotHandle during running backup.
Could you please check the content of those VolumeSnapshots?

@erichevers
Copy link
Author

erichevers commented Nov 25, 2024

Hi @blackpiglet ,

On the Prod01 cluster i've checked the volumesnapshots and indeed there are three

kubectl get volumesnapshots
NAME                                      READYTOUSE   SOURCEPVC                    SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                SNAPSHOTCONTENT                                    CREATIONTIME   AGE
cephfs-pvc-snapshot                       false        cephfs-pvc                                                         csi-cephfsplugin-snapclass                                                                     41d
velero-vaicloud-mq-volume-rsjnj           true         vaicloud-mq-volume                                   2Gi           csi-rbdplugin-snapclass      snapcontent-131fbc1b-92f6-4980-87e9-997d4aef74c3   3d7h           3d7h
velero-vaicloud-postgresql-volume-bgfhz   true         vaicloud-postgresql-volume                           30Gi          csi-rbdplugin-snapclass      snapcontent-89dd62b0-7dae-4297-83cb-dc4d1b97db86   3d7h           3d7h

I don't know where the cephfs snapshot is coming from, but the describe shows that it in a failed state.

kubectl describe volumesnapshot cephfs-pvc-snapshot
Status:
  Error:
    Message:     Failed to create snapshot content with error snapshot controller failed to update cephfs-pvc-snapshot on API server: cannot get claim from snapshot
    Time:        2024-10-15T13:35:37Z
  Ready To Use:  false

I also looked at the other two snapshots and they came from another backup.
I've delete all three volumesnapshot and started a new backup.
velero backup create vaicloud-dev-backup22112024-4 --include-namespaces vaicloud-dev --snapshot-move-

datasnapshots got created and removed, as it should be
Then i moved over to the dr01 cluster and did a restore

velero restore create restore-test --include-namespaces vaicloud-dev --from-backup vaicloud-dev-backup22112024-4

This time there was no messsage about the VolumeSnapshot as the original case. However the restore job stays in:
WaitingForPluginOperations

below is the debug logfile of the restore:
bundle-2024-11-25-21-09-59.tar.gz

Regards

@blackpiglet
Copy link
Contributor

From the log, I think the restore worked as expected.
How did the restore take to complete?

The data mover restore may take longer time than the CSI snapshot restore, because the data mover restore needs to create temporary pod and PVC to host the restored data.
The restore time also depends on the amount of the restored volume data.

@blackpiglet
Copy link
Contributor

blackpiglet commented Nov 26, 2024

Make some clarification about the scenario of this issue:

  • The error reported by this issue is not a common case for the Velero CSI snapshot data mover.
  • The error was triggered by a failed VolumeSnapshot. The VolumeSnapshot didn't have a snapshot handle, and it was not created by the restore-referenced backup.

Although this is a rainy-day case, we may also consider whether Velero should handle it instead of reporting error.

@blackpiglet blackpiglet assigned blackpiglet and unassigned Lyndon-Li Nov 26, 2024
@erichevers
Copy link
Author

erichevers commented Nov 26, 2024

From the log, I think the restore worked as expected. How did the restore take to complete?

The data mover restore may take longer time than the CSI snapshot restore, because the data mover restore needs to create temporary pod and PVC to host the restored data. The restore time also depends on the amount of the restored volume data.

Hi @blackpiglet ,
I just checked and the job failed after the standard 4 hour timeout, with the following error:

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    vaicloud-dev:  fail to patch dynamic PV, err: context deadline exceeded, PVC: vaicloud-postgresql-volume, PV: pvc-59e5c022-2214-4ef4-a24b-f8afff278041
                   fail to patch dynamic PV, err: context deadline exceeded, PVC: vaicloud-mq-volume, PV: pvc-b6de9f16-8a9d-41ae-ae2e-a5fc715377c0

And the pods are still in pending

Regards

@blackpiglet
Copy link
Contributor

@erichevers
I found a similar issue #7866
Could you check the PVC vaicloud-postgresql-volume and vaicloud-mq-volume status?
IMO, they were not ended with Bound phase after created by the restore.

@erichevers
Copy link
Author

erichevers commented Nov 26, 2024

@blackpiglet ,
The PVC's are in Pending.

kubectl describe pvc vaicloud-postgresql-volume -n vaicloud-dev
gives:
Name:          vaicloud-postgresql-volume
Namespace:     vaicloud-dev
StorageClass:  rook-ceph-block
Status:        Pending
Volume:        
Labels:        velero.io/backup-name=vaicloud-dev-backup22112024-4
               velero.io/restore-name=restore-test
               velero.io/volume-snapshot-name=velero-vaicloud-postgresql-volume-jkmhl
Annotations:   backup.velero.io/must-include-additional-items: true
               velero.io/csi-volumesnapshot-class: csi-rbdplugin-snapclass
               volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Used By:       vaicloud-db-7846b4c4cd-25k8w
Events:
  Type    Reason                Age                     From                                                                                                        Message
  ----    ------                ----                    ----                                                                                                        -------
  Normal  ExternalProvisioning  3m53s (x2923 over 12h)  persistentvolume-controller                                                                                 Waiting for a volume to be created either by the external provisioner 'rook-ceph.rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal  Provisioning          27s (x205 over 12h)     rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-54b4855f96-b95cx_0ee214d7-199b-4fcc-8748-dfa6b513df21  External provisioner is provisioning volume for claim "vaicloud-dev/vaicloud-postgresql-volume"

Regards

@blackpiglet blackpiglet added Needs info Waiting for information and removed Needs investigation labels Nov 26, 2024
@blackpiglet
Copy link
Contributor

Waiting for a volume to be created either by the external provisioner 'rook-ceph.rbd.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

External provisioner is provisioning volume for claim "vaicloud-dev/vaicloud-postgresql-volume"

To me, the error should be related to the Ceph Rook not creating a volume for the PVC in time.
Could you check whether there was any error logs in the Ceph Rook pods?

@blackpiglet
Copy link
Contributor

Make some clarification about the scenario of this issue:

  • The error reported by this issue is not a common case for the Velero CSI snapshot data mover.
  • The error was triggered by a failed VolumeSnapshot. The VolumeSnapshot didn't have a snapshot handle, and it was not created by the restore-referenced backup.

Although this is a rainy-day case, we may also consider whether Velero should handle it instead of reporting error.

Create a new issue #8460 to address this comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datamover Needs info Waiting for information
Projects
None yet
Development

No branches or pull requests

4 participants