Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

mpryc · 2024-11-22T13:40:59Z

What steps did you take and what happened:

During Backup/Restore with the node-agents utilizing Kopia it was observed that the size of the Kopia's cache folder is growing:

$ kubectl get pods -n velero-ns -l name=node-agent -o name | while read pod; do echo -e "$(kubectl exec -n velero-ns $pod -- du -hs /var | awk '{print $1}')\t$pod"; done | sort -h | awk 'BEGIN {print "KOPIA CACHE SIZE\tPOD\n-------------------------"} {print $0}'
KOPIA CACHE SIZE	POD
-------------------------
11M	pod/node-agent-7zqm9
11M	pod/node-agent-dzl6s
244M	pod/node-agent-9tmdw

What did you expect to happen:
Velero should have mechanism to automatically clean up/run some maintenance on the node-agent's Kopia cache after a backup or restore operation.

weshayutin · 2024-11-22T17:09:57Z

@msfrucht please review and work w/ @mpryc on this

kaovilai · 2024-11-22T17:21:06Z

So we had already confirmed running immediate maintenance won't be deleting backups. Are we saying there's benefit to reducing cache directory size?

Lyndon-Li · 2024-11-25T03:36:14Z

The above tests cannot prove that Velero needs to intervene Kopia repo's cache management:

Kopia repo has its own policy to manage the cache, e.g., there are several margins to decide when and how to remove the cache
The cache management is repo-wide, not operation-wide. Or in another word, the cache will still be effective across more than one repo upper level operations, i.e., backup/restore or maintenance
At present, Velero's default cache limit(hard) is 5G for data and metadata representatively
Comparing to the cache limit, the above tests cannot approve that Velero needs to manage the cache somehow, because it is still in the scope of Kopia repo's own management; nor we can prove this theoretically as mentioned above

Generally speaking, setting the cache limit, which we already have, is a more graceful way to control Kopia repo's cache. And it should be enough, unless really necessary, we would have Kopia repo itself to manage the cache all the time.

We only need to consider Velero's intervene in the corner case that the cache is out of Kopia repo's control and so is left behind e.g., a repo is deleted or is never visited again for a long time.

reasonerjt · 2024-11-25T06:30:07Z

@mpryc
Per the comment @Lyndon-Li
It seems the cache growing is expected. Please elaborate on why this is a severe issue.

mpryc · 2024-11-25T12:23:11Z

Currently it was observed that the cache folder is growing way above 5G:

# Before node-agent pod restart (after backup & restore operation):

Filesystem      Size  Used Avail Use% Mounted o
/dev/sdb4       447G  237G  211G  53% /var

# After node-agent pod restart:

Filesystem      Size  Used Avail Use% Mounted o
/dev/sdb4       447G   25G  422G   6% /var

While this growth is expected, the cache is not being freed, even if the data used for the restore is no longer needed. This causes disk space to accumulate until the node-agent is restarted.

Possibly dynamically setting hard limit would be sufficient here? I know there is a way to set the limit, however if the limit was dynamically calculated based on the available disk size on the node as it happens that the cache is growing to the point the node is dying due to full disk space and the only option at this point is to restart the node-agent. If that happens during backup or restore operation the backup or restore is never successful.

Lyndon-Li · 2024-11-25T14:33:38Z

How many BSLs do you have and how many namespaces are you backing up?

mpryc · 2024-11-25T14:48:39Z

How many BSLs do you have and how many namespaces are you backing up?

It's a single namespace, single BSL, single pod with big PV size (1TB) that has a lot of data in it. Many files within that PV around 10GB each.

Lyndon-Li · 2024-11-25T14:52:41Z

Are you using 1.15?

mpryc · 2024-11-25T16:12:35Z

Are you using 1.15?

Forgot to mention that in the bug, sorry: it's not 1.15 it's 1.14

Lyndon-Li · 2024-11-26T02:38:00Z

This is a known issue in 1.14 and has been fixed in 1.15, so you could upgrade to 1.15 and do another test.

weshayutin added this to OADP Nov 22, 2024

reasonerjt added the Needs info Waiting for information label Nov 25, 2024

reasonerjt assigned Lyndon-Li Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

mpryc commented Nov 22, 2024

weshayutin commented Nov 22, 2024

kaovilai commented Nov 22, 2024

Lyndon-Li commented Nov 25, 2024

reasonerjt commented Nov 25, 2024

mpryc commented Nov 25, 2024

Lyndon-Li commented Nov 25, 2024

mpryc commented Nov 25, 2024 •

edited

Loading

Lyndon-Li commented Nov 25, 2024

mpryc commented Nov 25, 2024

Lyndon-Li commented Nov 26, 2024

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

Mechanism to run maintenance of Kopia's cache directory within the node-agent #8443

Comments

mpryc commented Nov 22, 2024

weshayutin commented Nov 22, 2024

kaovilai commented Nov 22, 2024

Lyndon-Li commented Nov 25, 2024

reasonerjt commented Nov 25, 2024

mpryc commented Nov 25, 2024

Lyndon-Li commented Nov 25, 2024

mpryc commented Nov 25, 2024 • edited Loading

Lyndon-Li commented Nov 25, 2024

mpryc commented Nov 25, 2024

Lyndon-Li commented Nov 26, 2024

mpryc commented Nov 25, 2024 •

edited

Loading