Report disk usage, fix low disk usage recovery, update EVE overhead limits to avoid running out #3667

eriknordmark · 2023-12-13T21:10:15Z

This PR addresses several issues related to trying to use "all" storage on the device. They include:

reporting all of the used directory paths and zvols in /persist; that is important to track down disk usage in the field
fix the recovery login in onboot.sh to work with ZFS and also change it to trigger at <4Gb of space left (and not 70% used as today)
Make the EVE overhead be more dynamic (include 2Gb for EVE + configured limit for newlog) and update the EVE number if we are currently using more in /persist e.g., for netdump or whatever

Separately we should probably also make it so that the system containerd can run without needing to allocate space in /persist.

If something uses up too much space in /persist, the console output during boot will include something like this:
Free space in /persist: 2369 MBytes
Free space in /persist is only 2369 hence below the limit 4096 MBytes
Free space in /persist after clearing /persist/log: 24917 MBytes
Free space in /persist after clearing /persist/log and 2s sleep: 24917 MBytes
Free space in /persist after recovery: 24917 MBytes

[The sleep is needed due to ZFS delayed deletes. The above output is with an ext4 /persist.]

codecov · 2023-12-13T21:23:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 17.51%. Comparing base (2c5fb18) to head (04892db).
Report is 72 commits behind head on master.

❗ Current head 04892db differs from pull request most recent head 81f192d

Please upload reports for the commit 81f192d to get more accurate results.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3667   +/-   ##
=======================================
  Coverage   17.51%   17.51%           
=======================================
  Files           3        3           
  Lines         805      805           
=======================================
  Hits          141      141           
  Misses        629      629           
  Partials       35       35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/pillar/types/diskmetrics.go

pkg/pillar/diskmetrics/usage.go

pkg/mkimage-raw-efi/install

pkg/pillar/diskmetrics/usage.go

eriknordmark · 2024-07-31T23:58:53Z

@OhmSpectator @rouming I've taken care of all of the comments (and sanitized the set of commits). Please review and approve as appropriate.

Also, we should presumably use a directory in /persist for the eve-info to make it easier to clean that up. I can do a separate PR for that if folks agree.

OhmSpectator · 2024-08-01T12:44:13Z

Let's find a time to chat. Not clear what they would want to remove (e.g., is it removing everything in /persist?)

@eriknordmark, as far as I understand, they want to clean up some space during the tests. Initially, they only wanted to clean up the logs. I recommended removing the contents of the directories mentioned in the onboot.sh script, and it was fine for them. So, if we add it as an eve command, it could help.

Yep, we can discuss it briefly during the call on Thursday (the Community Call) or on Friday (we have a 1-to-1 call scheduled, as I remember).

pkg/pillar/scripts/onboot.sh

pkg/pillar/types/diskmetrics.go

pkg/pillar/diskmetrics/usage.go

pkg/pillar/types/diskmetrics.go

pkg/pillar/diskmetrics/usage.go

milan-zededa · 2024-08-02T10:02:22Z

pkg/pillar/diskmetrics/usage.go

+			return nil
+		}
+
+		info, err := os.Stat(path)


Btw. @andrewd-zededa made a change recently where he limited the number of Stat calls since it apparently can be expensive to call on each file: #4065
I don't know if the same consideration needs to apply for FindLargeFiles as well..

This is my only concern as well. os.Stat is expensive, as you have to read each file inode. Unfortunately, there is no easier way to do it that I can think of. The way ext4 is built, the directory entries include only the name of the entry and the type. If you want the size, you need to read the inode, for which you need os.Stat().

As far as I can see #4065 just saves by not calling os.Stat() on files outside of a given path, which I think this does already.

If you really want to get fancy, I have a library that knows how to read ext4 natively. It can load the entire inode table into memory, and you can use that to replace the os.Stat() for each one with an in-memory lookup. It won't save you on the directory tree reading, but it will save you all of the os.Stat() calls. It doesn't have a public interface for doing that, but I don't mind adding it.

I would suggest that we put this through its paces, including performance test. If that becomes an issue, worry about it then.

We're only doing this in unknown directories in /persist - all of the known places are in the exclude paths.
This is to catch somebody manually e.g., putting huge tar/iso/img files in /persist/foo.iso or /persist/bar/foo.iso.
Thus the number of things we stat should be very small unless somebody manually dumps a huge directory tree manually into /persist.
Let me get a count on the number of calls.

If it's that small, then don't worry too much about it. And definitely don't use that library.

@milan-zededa @deitch I checked a default freshly installed EVE device and there are only two files which get Stat'ed in this loop (/persist/SMART_details.json and /persist/SMART_details_previous.json). And the FindLargeFiles is called every 20 seconds. So this doesn't seem to be a risk for load.
But it was a good idea to check this so thanks for the comment!

milan-zededa

LGTM, just minor comments
Let's get test results

milan-zededa · 2024-08-02T11:57:07Z

We do have some regression here:
Smoke tests with tpm=true fail here:

> test eden.lim.test -test.v -timewait 5m -test.run TestInfo -out InfoContent.dinfo.HSMStatus 'InfoContent.dinfo.HSMStatus:ENABLED'

HSMStatus continues being published as DISABLED even though we run with (sw)tpm.
(you can see what eve published inside the artifact in the file info.log)

Edit: I see this failed already here: #4115
@shjala Can you please investigate?

Signed-off-by: eriknordmark <[email protected]>

eriknordmark requested review from OhmSpectator and rouming December 13, 2023 21:10

eriknordmark requested a review from milan-zededa December 14, 2023 10:01

OhmSpectator reviewed Dec 14, 2023

View reviewed changes

pkg/pillar/types/diskmetrics.go Show resolved Hide resolved

eriknordmark force-pushed the emerg branch 2 times, most recently from 2db4646 to 1df5061 Compare December 15, 2023 22:24

eriknordmark requested a review from OhmSpectator December 15, 2023 22:41

OhmSpectator approved these changes Dec 15, 2023

View reviewed changes

pkg/pillar/types/diskmetrics.go Outdated Show resolved Hide resolved

eriknordmark force-pushed the emerg branch from 1df5061 to fdccf5e Compare December 17, 2023 22:50

eriknordmark requested a review from deitch as a code owner December 25, 2023 01:21

github-actions bot requested a review from OhmSpectator December 25, 2023 01:22

OhmSpectator reviewed Dec 25, 2023

View reviewed changes

pkg/pillar/diskmetrics/usage.go Outdated Show resolved Hide resolved

pkg/pillar/diskmetrics/usage.go Show resolved Hide resolved

eriknordmark force-pushed the emerg branch from 701d78f to f597d91 Compare January 3, 2024 21:05

eriknordmark requested a review from zedi-pramodh as a code owner January 3, 2024 21:05

github-actions bot requested review from jsfakian, naiming-zededa and OhmSpectator January 3, 2024 21:06

zedi-pramodh reviewed Jan 3, 2024

View reviewed changes

pkg/mkimage-raw-efi/install Outdated Show resolved Hide resolved

github-actions bot requested a review from zedi-pramodh January 11, 2024 21:41

eriknordmark force-pushed the emerg branch 2 times, most recently from 37c8734 to 49902b4 Compare January 12, 2024 21:00

eriknordmark changed the title ~~volumemgr: Update list of paths used to calculate disk usage~~ [WIP] Report disk usage, fix low disk usage recovery, update EVE overhead limits to avoid running out Jan 12, 2024

eriknordmark force-pushed the emerg branch from 49902b4 to eda47f9 Compare January 19, 2024 09:12

eriknordmark force-pushed the emerg branch 2 times, most recently from 6608f86 to 7e2d51e Compare February 6, 2024 17:06

rouming reviewed Feb 12, 2024

View reviewed changes

pkg/pillar/diskmetrics/usage.go Outdated Show resolved Hide resolved

eriknordmark force-pushed the emerg branch from 7e2d51e to 2519cc8 Compare February 13, 2024 22:51

github-actions bot requested a review from rouming February 13, 2024 22:52

eriknordmark force-pushed the emerg branch from 2519cc8 to 2802369 Compare February 23, 2024 13:22

OhmSpectator reviewed Aug 1, 2024

View reviewed changes

pkg/pillar/scripts/onboot.sh Outdated Show resolved Hide resolved

pkg/pillar/scripts/onboot.sh Show resolved Hide resolved

pkg/pillar/types/diskmetrics.go Outdated Show resolved Hide resolved

rouming approved these changes Aug 1, 2024

View reviewed changes

christoph-zededa reviewed Aug 1, 2024

View reviewed changes

pkg/pillar/diskmetrics/usage.go Outdated Show resolved Hide resolved

deitch reviewed Aug 1, 2024

View reviewed changes

pkg/pillar/diskmetrics/usage.go Outdated Show resolved Hide resolved

eriknordmark force-pushed the emerg branch from d776579 to 51d37aa Compare August 1, 2024 18:05

github-actions bot requested review from deitch, OhmSpectator and rouming August 1, 2024 18:06

eriknordmark force-pushed the emerg branch from 51d37aa to 726ddf9 Compare August 1, 2024 18:14

milan-zededa reviewed Aug 2, 2024

View reviewed changes

pkg/pillar/types/diskmetrics.go Outdated Show resolved Hide resolved

milan-zededa reviewed Aug 2, 2024

View reviewed changes

pkg/pillar/diskmetrics/usage.go Outdated Show resolved Hide resolved

milan-zededa reviewed Aug 2, 2024

View reviewed changes

milan-zededa approved these changes Aug 2, 2024

View reviewed changes

milan-zededa mentioned this pull request Aug 2, 2024

Move tpm2.OpenTPM call to CreateKey #4127

Merged

eriknordmark added 11 commits August 2, 2024 09:55

volumemgr: Remove reporting of old /persist dirs

2901990

Signed-off-by: eriknordmark <[email protected]>

diskmetrics: Update list of paths used to calculate disk usage

57413f1

Signed-off-by: eriknordmark <[email protected]>

Introduce new location constants

e9f67b0

Signed-off-by: eriknordmark <[email protected]>

diskmetrics: Add FindLargeFiles()

1d6109e

Signed-off-by: eriknordmark <[email protected]>

volumemgr: report large files in /persist to controller

8c3c114

Signed-off-by: eriknordmark <[email protected]>

Report (as DiskMetric) all ZFS datasets using unique names

0938d47

Signed-off-by: eriknordmark <[email protected]>

volumemgr: Publish remainingspace for nodeagent to use

6bce3dc

Signed-off-by: eriknordmark <[email protected]>

nodeagent: If RemainingSpace is zero then go into MaintenanceMode

9a25e5f

Signed-off-by: eriknordmark <[email protected]>

diskmetrics: Dom0DiskReservedSize track actual; use newlog upper limit

fcfe247

Signed-off-by: eriknordmark <[email protected]>

volumemgr: sizemgmt with currentAppDiskUsage

7a5d713

Signed-off-by: eriknordmark <[email protected]>

onboot: recover from low diskspace using more dirs and for ZFS

5d327ed

Signed-off-by: eriknordmark <[email protected]>

eriknordmark force-pushed the emerg branch from 726ddf9 to 5d327ed Compare August 2, 2024 17:45

eriknordmark merged commit ed3eff9 into lf-edge:master Aug 2, 2024
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report disk usage, fix low disk usage recovery, update EVE overhead limits to avoid running out #3667

Report disk usage, fix low disk usage recovery, update EVE overhead limits to avoid running out #3667

eriknordmark commented Dec 13, 2023 •

edited

Loading

codecov bot commented Dec 13, 2023 •

edited

Loading

eriknordmark commented Jul 31, 2024

OhmSpectator commented Aug 1, 2024 •

edited

Loading

milan-zededa Aug 2, 2024

deitch Aug 2, 2024

eriknordmark Aug 2, 2024

deitch Aug 2, 2024

eriknordmark Aug 2, 2024 •

edited

Loading

milan-zededa left a comment •

edited

Loading

milan-zededa commented Aug 2, 2024 •

edited

Loading

Report disk usage, fix low disk usage recovery, update EVE overhead limits to avoid running out #3667

Report disk usage, fix low disk usage recovery, update EVE overhead limits to avoid running out #3667

Conversation

eriknordmark commented Dec 13, 2023 • edited Loading

codecov bot commented Dec 13, 2023 • edited Loading

Codecov Report

eriknordmark commented Jul 31, 2024

OhmSpectator commented Aug 1, 2024 • edited Loading

milan-zededa Aug 2, 2024

Choose a reason for hiding this comment

deitch Aug 2, 2024

Choose a reason for hiding this comment

eriknordmark Aug 2, 2024

Choose a reason for hiding this comment

deitch Aug 2, 2024

Choose a reason for hiding this comment

eriknordmark Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

milan-zededa left a comment • edited Loading

Choose a reason for hiding this comment

milan-zededa commented Aug 2, 2024 • edited Loading

eriknordmark commented Dec 13, 2023 •

edited

Loading

codecov bot commented Dec 13, 2023 •

edited

Loading

OhmSpectator commented Aug 1, 2024 •

edited

Loading

eriknordmark Aug 2, 2024 •

edited

Loading

milan-zededa left a comment •

edited

Loading

milan-zededa commented Aug 2, 2024 •

edited

Loading