Large PVCs are canceled #8454

filipe-silva-magalhaes-alb · 2024-11-25T14:46:07Z

What steps did you take and what happened:
The data uploads of the largest pvcs failed.

velero create backup velero-schedule-s3-20241125000006 --resource-policies-configmap velero-efs-resourcepolicy --snapshot-move-data

kubectl get configmap cm -n velero velero-efs-resourcepolicy -o yaml

apiVersion: v1
data:
  efs-resourcepolicy.yaml: |
    version: v1
    volumePolicies:
    - conditions:
        csi:
          driver: efs.csi.aws.com
      action:
        type: skip
kind: ConfigMap
metadata:
  name: velero-efs-resourcepolicy
  namespace: velero

What did you expect to happen:
Backup runs without problems.

The following information will help us better understand what's going on:

velero debug --backup velero-schedule-s3-20241125000006
bundle-2024-11-25-14-30-47.tar.gz

Parameters of backup:

csiSnapshotTimeout: 10m0s
itemOperationTimeout: 6h0m0s
uploaderConfig:
  parallelFilesUpload: 2

Parameters of daemonset (running in privileged mode):

  - --features=EnableCSI 
  - --data-mover-prepare-timeout=190m

Anything else you would like to add:

Environment:

Velero version (use velero version): v1.14.1
Velero features (use velero client config get features): None
Kubernetes version (use kubectl version): v1.28.15-eks-7f9249a
Kubernetes installer & version: EKS
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Amazon Linux 2

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-11-26T03:02:26Z

By default, Velero's data mover backup has a 4 hour timeout for each volume. If that is not enough, you could config default-item-operation-timeout from the Velero server parameter.
Meanwhile, if you want to accelerate the backup especially to the large/complex volumes, you could config uploader concurrency through the parallel-files-upload backup flag. By default, it is the number of CPU cores of the node where the data mover backup is running

filipe-silva-magalhaes-alb · 2024-11-26T09:18:12Z

Hello @Lyndon-Li , in the schedule I configured the "itemOperationTimeout" to "6h". The instance type of our nodes is "r6a.xlarge" (4 vCPUs and 32GB).

The backup shouldn't fail after 3h:5x mins.

Schedule configuration:

spec:
  schedule: 0 0 * * *
  skipImmediately: false
  template:
    itemOperationTimeout: 6h0m0s
    resourcePolicy:
      apiGroup: v1
      kind: ConfigMap
      name: velero-efs-resourcepolicy
    snapshotMoveData: true
    ttl: 166h
    uploaderConfig:
      parallelFilesUpload: 2
  useOwnerReferencesInBackup: true

Lyndon-Li · 2024-11-26T09:23:10Z

itemOperationTimeout should be configured to the Velero server parameters.
parallelFilesUpload could also be increased in order to accelerate.

Lyndon-Li added the area/datamover label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large PVCs are canceled #8454

Large PVCs are canceled #8454

filipe-silva-magalhaes-alb commented Nov 25, 2024 •

edited

Loading

Lyndon-Li commented Nov 26, 2024

filipe-silva-magalhaes-alb commented Nov 26, 2024 •

edited

Loading

Lyndon-Li commented Nov 26, 2024

Large PVCs are canceled #8454

Large PVCs are canceled #8454

Comments

filipe-silva-magalhaes-alb commented Nov 25, 2024 • edited Loading

Lyndon-Li commented Nov 26, 2024

filipe-silva-magalhaes-alb commented Nov 26, 2024 • edited Loading

Lyndon-Li commented Nov 26, 2024

filipe-silva-magalhaes-alb commented Nov 25, 2024 •

edited

Loading

filipe-silva-magalhaes-alb commented Nov 26, 2024 •

edited

Loading