-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443
Comments
So we had already confirmed running immediate maintenance won't be deleting backups. Are we saying there's benefit to reducing cache directory size? |
The above tests cannot prove that Velero needs to intervene Kopia repo's cache management:
Generally speaking, setting the cache limit, which we already have, is a more graceful way to control Kopia repo's cache. And it should be enough, unless really necessary, we would have Kopia repo itself to manage the cache all the time. We only need to consider Velero's intervene in the corner case that the cache is out of Kopia repo's control and so is left behind e.g., a repo is deleted or is never visited again for a long time. |
@mpryc |
Currently it was observed that the cache folder is growing way above 5G: # Before node-agent pod restart (after backup & restore operation):
Filesystem Size Used Avail Use% Mounted o
/dev/sdb4 447G 237G 211G 53% /var
# After node-agent pod restart:
Filesystem Size Used Avail Use% Mounted o
/dev/sdb4 447G 25G 422G 6% /var While this growth is expected, the cache is not being freed, even if the data used for the restore is no longer needed. This causes disk space to accumulate until the node-agent is restarted. Possibly dynamically setting hard limit would be sufficient here? I know there is a way to set the limit, however if the limit was dynamically calculated based on the available disk size on the node as it happens that the cache is growing to the point the node is dying due to full disk space and the only option at this point is to restart the node-agent. If that happens during backup or restore operation the backup or restore is never successful. |
How many BSLs do you have and how many namespaces are you backing up? |
It's a single namespace, single BSL, single pod with big PV size (1TB) that has a lot of data in it. Many files within that PV around 10GB each. |
Are you using 1.15? |
Forgot to mention that in the bug, sorry: it's not 1.15 it's 1.14 |
This is a known issue in 1.14 and has been fixed in 1.15, so you could upgrade to 1.15 and do another test. |
What steps did you take and what happened:
During Backup/Restore with the node-agents utilizing Kopia it was observed that the size of the Kopia's cache folder is growing:
What did you expect to happen:
Velero should have mechanism to automatically clean up/run some maintenance on the node-agent's Kopia cache after a backup or restore operation.
The text was updated successfully, but these errors were encountered: