ReplicationSource cacheCapacity space avialble metric #1159

reefland · 2024-03-07T18:44:56Z

Describe the feature you'd like to have.
I'm was trying to have existing PrometheusRules to alert me if the volsync cache PVCs are not sized appropriately.

Then I noticed that I do not see PVCs created by volsync visible within Prometheus. Perhaps this is just how volsync uses PVC's and kubelet can't gather metrics on PVCs not actively mounted.

What is the value to the end user? (why is it a priority?)
The docs state This volume contains cached metadata from the backup repository. It must be large enough to hold the non-pruned repository metadata.

I do not know how much space is being used by Restic metadata or how that changes over time
I would like to bump up the cache size before the volume fills up and volsync backups are impacted

How will we know we have a good solution? (acceptance criteria)
I'm going to assume that volsync does not normally mount cache PVCs (and thus kubelet can't not report on it). If this is true, perhaps when trigger.schedule event happens would it be possible for volsync to then emit its own metric with cache capacity? perhaps percent free? Something like volsync_cache_capacity_available

maybe "-1" if unknown (no event triggered), otherwise a number between 0 and 100 as a percentage of capacity left.

Then I can have an alert like:

- alert: VolSyncCacheVolumeCapacityLow
  annotation:
    summary: >-
        {{ $labels.obj_namespace }}/{{ $labels.obj_name }} cache volume space is almost full. 
        Increase size of cacheCapacity value.
    description: >-
        {{ $labels.obj_namespace }}/{{ $labels.obj_name }} cache volume space is < 15%.
        VALUE = {{ $value }}
    expr: |
      volsync_cache_capacity_available > -1 and volsync_cache_capacity_available < 15
    for: 15m
    labels:
      severity: critical

The text was updated successfully, but these errors were encountered:

tesshuflower · 2024-03-07T21:53:07Z

The VolSync controller doesn't mount a restic cache PVC itself, it's mounted to the mover pod from the job that runs during a sync however. Can you see stats for when the mover job is running?

As such, I'm not sure we want to try to capture this usage data and have it sent back to the controller to emit as events.

Depending on your CSI driver, maybe it's possible to get some stats via volume health monitoring? https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1432-volume-health-monitor#kubelet-metrics-changes

I've never looked into this myself, but looks like potentially there could be VolumeUsage reported.

reefland · 2024-03-08T00:40:08Z

The kubelet_volume_stats_* series of metrics contain the data I want such as used_bytes or capacity_bytes but none of the PVCs created by volsync are listed. Perhaps the mover pods have the cache volume mounted so briefly it hasn't happened when kubelet is fetching data?

kube_persistentvolume_capacity_bytes does include PVCs created by volsync, but only total capacity of the volume is available. kube_persistentvolume_* series of metric do not contain any use information.

I was unable to locate anything about "VolumeUsage" other than above.

reefland added the enhancement New feature or request label Mar 7, 2024

JohnStrunk added this to VolSync project tracking Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReplicationSource cacheCapacity space avialble metric #1159

ReplicationSource cacheCapacity space avialble metric #1159

reefland commented Mar 7, 2024

tesshuflower commented Mar 7, 2024

reefland commented Mar 8, 2024

ReplicationSource cacheCapacity space avialble metric #1159

ReplicationSource cacheCapacity space avialble metric #1159

Comments

reefland commented Mar 7, 2024

tesshuflower commented Mar 7, 2024

reefland commented Mar 8, 2024