Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

Open
mpryc opened this issue Nov 22, 2024 · 10 comments
Open
Assignees
Labels
Needs info Waiting for information

Comments

@mpryc
Copy link
Contributor

mpryc commented Nov 22, 2024

What steps did you take and what happened:

During Backup/Restore with the node-agents utilizing Kopia it was observed that the size of the Kopia's cache folder is growing:

$ kubectl get pods -n velero-ns -l name=node-agent -o name | while read pod; do echo -e "$(kubectl exec -n velero-ns $pod -- du -hs /var | awk '{print $1}')\t$pod"; done | sort -h | awk 'BEGIN {print "KOPIA CACHE SIZE\tPOD\n-------------------------"} {print $0}'
KOPIA CACHE SIZE	POD
-------------------------
11M	pod/node-agent-7zqm9
11M	pod/node-agent-dzl6s
244M	pod/node-agent-9tmdw

What did you expect to happen:
Velero should have mechanism to automatically clean up/run some maintenance on the node-agent's Kopia cache after a backup or restore operation.

@weshayutin
Copy link
Contributor

@msfrucht please review and work w/ @mpryc on this

@kaovilai
Copy link
Member

So we had already confirmed running immediate maintenance won't be deleting backups. Are we saying there's benefit to reducing cache directory size?

@weshayutin weshayutin added this to OADP Nov 22, 2024
@Lyndon-Li
Copy link
Contributor

The above tests cannot prove that Velero needs to intervene Kopia repo's cache management:

  1. Kopia repo has its own policy to manage the cache, e.g., there are several margins to decide when and how to remove the cache
  2. The cache management is repo-wide, not operation-wide. Or in another word, the cache will still be effective across more than one repo upper level operations, i.e., backup/restore or maintenance
  3. At present, Velero's default cache limit(hard) is 5G for data and metadata representatively
  4. Comparing to the cache limit, the above tests cannot approve that Velero needs to manage the cache somehow, because it is still in the scope of Kopia repo's own management; nor we can prove this theoretically as mentioned above

Generally speaking, setting the cache limit, which we already have, is a more graceful way to control Kopia repo's cache. And it should be enough, unless really necessary, we would have Kopia repo itself to manage the cache all the time.

We only need to consider Velero's intervene in the corner case that the cache is out of Kopia repo's control and so is left behind e.g., a repo is deleted or is never visited again for a long time.

@reasonerjt reasonerjt added the Needs info Waiting for information label Nov 25, 2024
@reasonerjt
Copy link
Contributor

@mpryc
Per the comment @Lyndon-Li
It seems the cache growing is expected. Please elaborate on why this is a severe issue.

@mpryc
Copy link
Contributor Author

mpryc commented Nov 25, 2024

Currently it was observed that the cache folder is growing way above 5G:

# Before node-agent pod restart (after backup & restore operation):

Filesystem      Size  Used Avail Use% Mounted o
/dev/sdb4       447G  237G  211G  53% /var

# After node-agent pod restart:

Filesystem      Size  Used Avail Use% Mounted o
/dev/sdb4       447G   25G  422G   6% /var

While this growth is expected, the cache is not being freed, even if the data used for the restore is no longer needed. This causes disk space to accumulate until the node-agent is restarted.

Possibly dynamically setting hard limit would be sufficient here? I know there is a way to set the limit, however if the limit was dynamically calculated based on the available disk size on the node as it happens that the cache is growing to the point the node is dying due to full disk space and the only option at this point is to restart the node-agent. If that happens during backup or restore operation the backup or restore is never successful.

@Lyndon-Li
Copy link
Contributor

How many BSLs do you have and how many namespaces are you backing up?

@mpryc
Copy link
Contributor Author

mpryc commented Nov 25, 2024

How many BSLs do you have and how many namespaces are you backing up?

It's a single namespace, single BSL, single pod with big PV size (1TB) that has a lot of data in it. Many files within that PV around 10GB each.

@Lyndon-Li
Copy link
Contributor

Are you using 1.15?

@mpryc
Copy link
Contributor Author

mpryc commented Nov 25, 2024

Are you using 1.15?

Forgot to mention that in the bug, sorry: it's not 1.15 it's 1.14

@Lyndon-Li
Copy link
Contributor

This is a known issue in 1.14 and has been fixed in 1.15, so you could upgrade to 1.15 and do another test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs info Waiting for information
Projects
None yet
Development

No branches or pull requests

5 participants