Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large PVCs are canceled #8454

Open
filipe-silva-magalhaes-alb opened this issue Nov 25, 2024 · 3 comments
Open

Large PVCs are canceled #8454

filipe-silva-magalhaes-alb opened this issue Nov 25, 2024 · 3 comments

Comments

@filipe-silva-magalhaes-alb
Copy link

filipe-silva-magalhaes-alb commented Nov 25, 2024

What steps did you take and what happened:
The data uploads of the largest pvcs failed.

velero create backup velero-schedule-s3-20241125000006 --resource-policies-configmap velero-efs-resourcepolicy --snapshot-move-data

kubectl get configmap cm -n velero velero-efs-resourcepolicy -o yaml

apiVersion: v1
data:
  efs-resourcepolicy.yaml: |
    version: v1
    volumePolicies:
    - conditions:
        csi:
          driver: efs.csi.aws.com
      action:
        type: skip
kind: ConfigMap
metadata:
  name: velero-efs-resourcepolicy
  namespace: velero

What did you expect to happen:
Backup runs without problems.

The following information will help us better understand what's going on:

velero debug --backup velero-schedule-s3-20241125000006
bundle-2024-11-25-14-30-47.tar.gz

Parameters of backup:

csiSnapshotTimeout: 10m0s
itemOperationTimeout: 6h0m0s
uploaderConfig:
  parallelFilesUpload: 2

Parameters of daemonset (running in privileged mode):

  - --features=EnableCSI 
  - --data-mover-prepare-timeout=190m 

Anything else you would like to add:

Environment:

  • Velero version (use velero version): v1.14.1
  • Velero features (use velero client config get features): None
  • Kubernetes version (use kubectl version): v1.28.15-eks-7f9249a
  • Kubernetes installer & version: EKS
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Amazon Linux 2

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

By default, Velero's data mover backup has a 4 hour timeout for each volume. If that is not enough, you could config default-item-operation-timeout from the Velero server parameter.
Meanwhile, if you want to accelerate the backup especially to the large/complex volumes, you could config uploader concurrency through the parallel-files-upload backup flag. By default, it is the number of CPU cores of the node where the data mover backup is running

@filipe-silva-magalhaes-alb
Copy link
Author

filipe-silva-magalhaes-alb commented Nov 26, 2024

Hello @Lyndon-Li , in the schedule I configured the "itemOperationTimeout" to "6h". The instance type of our nodes is "r6a.xlarge" (4 vCPUs and 32GB).

The backup shouldn't fail after 3h:5x mins.

Schedule configuration:

spec:
  schedule: 0 0 * * *
  skipImmediately: false
  template:
    itemOperationTimeout: 6h0m0s
    resourcePolicy:
      apiGroup: v1
      kind: ConfigMap
      name: velero-efs-resourcepolicy
    snapshotMoveData: true
    ttl: 166h
    uploaderConfig:
      parallelFilesUpload: 2
  useOwnerReferencesInBackup: true

@Lyndon-Li
Copy link
Contributor

itemOperationTimeout should be configured to the Velero server parameters.
parallelFilesUpload could also be increased in order to accelerate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants