Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero Collector and Analyzer #806

Closed
diamonwiggins opened this issue Oct 27, 2022 · 10 comments · Fixed by #1366
Closed

Velero Collector and Analyzer #806

diamonwiggins opened this issue Oct 27, 2022 · 10 comments · Fixed by #1366
Assignees
Labels
type::feature New feature or request

Comments

@diamonwiggins
Copy link
Member

diamonwiggins commented Oct 27, 2022

Describe the rationale for the suggested feature.

Velero is a toolset that allows you to backup/restore Kubernetes resources and persistent volumes. There are Kubernetes clusters where both Troubleshoot and Velero are commonly used and often times there is a lack of information and analysis that happens on the state of Velero in those environments

Describe the feature

A Velero Collector and Analyzer can be added to Support Bundle and Preflight specs.

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: velero
spec:
  collectors:
    - velero: {}
  analyzers:
    - velero: {}

The Velero Collector can collect information such as:

  • kubectl logs deploy/velero -n <velero-namespace> -c velero
  • kubectl get bsl -n <velero-namespace>
  • kubectl get bsl -n <velero-namespace> -oyaml
  • kubectl get resticrepositories -n <velero-namespace>
  • kubectl get resticrepositories -n <velero-namespace> -oyaml
  • velero get backups
  • ???

The Velero Analyzer can provide the following analysis:

  • ???

Additional context

@sgalsaleh
Copy link
Member

initial thoughts:

some more collectors:

  • kubectl logs daemonset/restic -n velero
  • velero get restores
  • velero describe backups --details
  • velero describe restores --details
  • kubectl get podvolumebackups -n velero -oyaml
  • kubectl get podvolumerestores -n velero -oyaml

important note: when the local volume provider plugin is configured, it will be an init container in the velero deployment, we should get the logs for it as well when possible. also, that means that you'd have to pass -c velero to the kubectl logs command, not sure if that matters with the client-go package.

ideas for analyzers:

  • detect that there is at least 1 bsl
  • detect if bsl is available, if not, print why
  • detect that there is at least 1 restic repository
  • i'm not 100% sure about this one, but detect if restic repository is in a "Ready" phase or not
  • sometimes restic can have issues with memory, but i don't remember the symptoms off the top of my head, would be nice to have an analyzer that detects that

i'll add more if i think of anything else.

@xavpaice xavpaice added the type::feature New feature or request label Oct 30, 2022
@CpuID CpuID self-assigned this Jan 9, 2023
@CpuID
Copy link
Contributor

CpuID commented Jan 9, 2023

I'll start getting into this over the next couple of days, once I familiarise myself with Velero itself first 👍

@CpuID CpuID removed their assignment Jan 27, 2023
@banjoh
Copy link
Member

banjoh commented Jul 19, 2023

Running out of memory: Detect how many objects are in storage and warn if the number is close enough to lead restic to run out of memory

@xavpaice
Copy link
Member

Some sample logs from recent support issues, that would have been helped by this analyzer:

Permissions issues on the backup location:

open /var/velero-local-volume-provider/velero-lvp-471ddcf356bb/restic/default/index/3a62dce588bba9f315ba1b2fa86f2c73781f42a365ef747e609b27f9ac4c943a: permission denied\n: exit status 1"

Files got removed during backup (e.g. a database without getting an application level backup):

# kubectl logs velero-5cb7cffdc9-8pllw -n velero -f

time="2023-07-20T09:37:09Z" level=error msg="Error backing up item" backup=velero/instance-278xp error="pod volume backup failed: running Restic backup, stderr={\"message_type\":\"error\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-e31a4031-8c5a-4449-b852-3612a3cf22ed/mount/lost+found\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival\",\"item\":\"lost+found\"}\nWarning: at least one source file could not be read\n: exit status 3" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:199" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=vault-0
time="2023-07-20T09:37:09Z" level=error msg="Error backing up item" backup=velero/instance-278xp error="pod volume backup failed: running Restic backup, stderr={\"message_type\":\"error\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/lost+found\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/raft\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival\",\"item\":\"lost+found\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"node-id\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/node-id\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival\",\"item\":\"raft\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"vault.db\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/vault.db\"}\nWarning: at least one source file could not be read\n: exit status 3" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:199" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=vault-0

inconsistent state between the backupstoragelocation and the object store that velero uses, we resolved this by deleting the default restic repository manually and restarting Velero so that it would be recreated.

running Restic backup, stderr=Fatal: invalid id "7278389d": no matching ID found for prefix "7278389d"

Some other ideas:

  • restic uses memory, watch for that limit being an issue - in the velero log, error level=info msg="stderr: /bin/bash: line 1: 20109 Killed
  • Check for supported config, e.g. hostpath is not supported on multi-node clusters
  • Check how much data is present and space available, remember that pruning needs space and is delayed by a number of days.

@banjoh
Copy link
Member

banjoh commented Aug 24, 2023

@adamancini velero has velero debug command which generates a velero debug bundle. You might want to check that out and see if there are some things we still can collect that are collected by the command. Its a command run on the host, so it can I think only be run as a host collector, unless we add velero as dependency (last option IMO)

The codebase is https://github.com/vmware-tanzu/velero/blob/main/pkg/cmd/cli/debug/debug.go

@diamonwiggins
Copy link
Member Author

Recently discovered the following working a support issue:

Error getting volume snapshotter for volume snapshot location

When this error is thrown it means that the particular volume won't be backed up likely due to a plugin issue for a particular storage provider. Full error was:

time="2023-08-23T03:10:38Z" level=error msg="Error getting volume snapshotter for volume snapshot location" backup=velero/my-backup-22-08-6 error="rpc error: code = Unknown desc = faile to get address for maya-apiserver/cvc-server service" error.file="/home/travis/gopath/src/github.com/openebs/velero-plugin/pkg/cstor/cstor.go:233" error.function="github.com/openebs/velero-plugin/pkg/cstor.(*Plugin).Init" logSource="pkg/backup/item_backupper.go:524" name=pvc-3e3ada5e-2361-48c7-bcd6-366b698c6207 namespace= persistentVolume=pvc-3e3ada5e-2361-48c7-bcd6-366b698c6207 resource=persistentvolumes volumeSnapshotLocation=local-default

The OpenEBS plugin which only has support for cstor and not localpv was being used instead of a filesystem backup

@adamancini adamancini moved this from Next to In Progress in Troubleshoot Roadmap Sep 7, 2023
@adamancini
Copy link
Member

adamancini commented Sep 19, 2023

  • velero pod logs
  • init container logs
  • all velero custom resources
    • backuprepositories
    • backups
    • backupstoragelocations
    • deletebackuprequests
    • downloadrequests
    • podvolumebackups
    • podvolumerestores
    • restores
    • schedules
    • serverstatusrequests
    • volumesnapshotlocations
  • velero commands
    • velero get backups
    • velero get restores
    • velero describe backups
    • velero describe restores

@adamancini
Copy link
Member

ada@ada-kurl:~/support-bundle-2023-09-19T17_40_36$ tree -L 4
.
├── analysis.json
├── cluster-resources
│   └── pods
│       └── logs
│           └── velero
├── execution-data
│   └── summary.txt
├── velero
│   ├── backuprepositories
│   │   ├── default-default-restic-6bwck.yaml
│   │   └── kurl-default-restic-2lfp4.yaml
│   ├── backups
│   │   ├── annarchy-mfvpt.yaml
│   │   ├── instance-f2m6f.yaml
│   │   └── instance-g9ccf.yaml
│   ├── backupstoragelocations
│   │   └── default.yaml
│   ├── describe-backups-errors.json
│   ├── describe-backups-stderr.txt
│   ├── describe-restores-errors.json
│   ├── describe-restores-stderr.txt
│   ├── get-backups.yaml
│   ├── get-restores.yaml
│   ├── logs
│   │   ├── node-agent-j4zvz
│   │   │   └── node-agent.log -> ../../../cluster-resources/pods/logs/velero/node-agent-j4zvz/node-agent.log
│   │   └── velero-787c5b44b9-8vzth
│   │       ├── replicated-kurl-util.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/replicated-kurl-util.log
│   │       ├── replicated-local-volume-provider.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/replicated-local-volume-provider.log
│   │       ├── velero.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero.log
│   │       ├── velero-velero-plugin-for-aws.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero-velero-plugin-for-aws.log
│   │       ├── velero-velero-plugin-for-gcp.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero-velero-plugin-for-gcp.log
│   │       └── velero-velero-plugin-for-microsoft-azure.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero-velero-plugin-for-microsoft-azure.log
│   ├── podvolumebackups
│   │   ├── annarchy-mfvpt-tclgv.yaml
│   │   ├── instance-f2m6f-2rdkx.yaml
│   │   ├── instance-f2m6f-5tgvb.yaml
│   │   ├── instance-f2m6f-xf6z9.yaml
│   │   ├── instance-f2m6f-xxh6m.yaml
│   │   ├── instance-g9ccf-qpfn2.yaml
│   │   ├── instance-g9ccf-qt9wh.yaml
│   │   └── instance-g9ccf-w6mgw.yaml
│   ├── podvolumerestores
│   │   └── annarchy-mfvpt-h5f2c.yaml
│   └── restores
│       └── annarchy-mfvpt.yaml
└── version.yaml

@adamancini
Copy link
Member

adamancini commented Sep 19, 2023

installing an older velero (1.9.x) to check custom resource differences

ada@ada-velero-collector:~$ kubectl api-resources | grep velero
backups                                        velero.io/v1                           true         Backup
backupstoragelocations            bsl          velero.io/v1                           true         BackupStorageLocation
deletebackuprequests                           velero.io/v1                           true         DeleteBackupRequest
downloadrequests                               velero.io/v1                           true         DownloadRequest
podvolumebackups                               velero.io/v1                           true         PodVolumeBackup
podvolumerestores                              velero.io/v1                           true         PodVolumeRestore
resticrepositories                             velero.io/v1                           true         ResticRepository
restores                                       velero.io/v1                           true         Restore
schedules                                      velero.io/v1                           true         Schedule
serverstatusrequests              ssr          velero.io/v1                           true         ServerStatusRequest
volumesnapshotlocations                        velero.io/v1                           true         VolumeSnapshotLocation

arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
arcolife pushed a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
  * test velero spec

  * local watch script for building troubleshoot

  * need to get an older velero library to support restic
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
arcolife pushed a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
  * test velero spec

  * local watch script for building troubleshoot

  * need to get an older velero library to support restic
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
@arcolife
Copy link
Contributor

analyzer work #1366

arcolife pushed a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
  * test velero spec

  * need to get an older velero library to support restic
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
arcolife pushed a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
  * test velero spec

  * need to get an older velero library to support restic

Signed-off-by: Archit Sharma <[email protected]>
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 11, 2023
  * updated schemas

Signed-off-by: Archit Sharma <[email protected]>
arcolife pushed a commit to arcolife/troubleshoot that referenced this issue Oct 12, 2023
  * test velero spec

  * need to get an older velero library to support restic

Signed-off-by: Archit Sharma <[email protected]>
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 12, 2023
  * updated schemas

Signed-off-by: Archit Sharma <[email protected]>
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 19, 2023
  * updated schemas
  * velero analyzer without collector

Signed-off-by: Archit Sharma <[email protected]>
arcolife added a commit to arcolife/troubleshoot that referenced this issue Oct 19, 2023
  * updated schemas
  * velero analyzer without collector

Signed-off-by: Archit Sharma <[email protected]>
arcolife added a commit to arcolife/troubleshoot that referenced this issue Nov 1, 2023
  * updated schemas
  * velero analyzer without collector

Signed-off-by: Archit Sharma <[email protected]>
arcolife added a commit to arcolife/troubleshoot that referenced this issue Nov 3, 2023
  * updated schemas
  * velero analyzer without collector

Signed-off-by: Archit Sharma <[email protected]>
arcolife pushed a commit that referenced this issue Nov 3, 2023
* feat: add velero analyzer (#806)

  * updated schema
  * analyzer without collector
  * tests
  * covers deprecated Restic repository type
  * velero version from deployment image to check deprecated type
  * read for both velero pod kinds (velero*, node-agent*)

---------

Signed-off-by: Archit Sharma <[email protected]>
@github-project-automation github-project-automation bot moved this from In Progress to Done in Troubleshoot Roadmap Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment