Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report disk usage, fix low disk usage recovery, update EVE overhead limits to avoid running out #3667

Merged
merged 11 commits into from
Aug 2, 2024

Conversation

eriknordmark
Copy link
Contributor

@eriknordmark eriknordmark commented Dec 13, 2023

This PR addresses several issues related to trying to use "all" storage on the device. They include:

  • reporting all of the used directory paths and zvols in /persist; that is important to track down disk usage in the field
  • fix the recovery login in onboot.sh to work with ZFS and also change it to trigger at <4Gb of space left (and not 70% used as today)
  • Make the EVE overhead be more dynamic (include 2Gb for EVE + configured limit for newlog) and update the EVE number if we are currently using more in /persist e.g., for netdump or whatever

Separately we should probably also make it so that the system containerd can run without needing to allocate space in /persist.

If something uses up too much space in /persist, the console output during boot will include something like this:
Free space in /persist: 2369 MBytes
Free space in /persist is only 2369 hence below the limit 4096 MBytes
Free space in /persist after clearing /persist/log: 24917 MBytes
Free space in /persist after clearing /persist/log and 2s sleep: 24917 MBytes
Free space in /persist after recovery: 24917 MBytes

[The sleep is needed due to ZFS delayed deletes. The above output is with an ext4 /persist.]

Copy link

codecov bot commented Dec 13, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 17.51%. Comparing base (2c5fb18) to head (04892db).
Report is 72 commits behind head on master.

Current head 04892db differs from pull request most recent head 81f192d

Please upload reports for the commit 81f192d to get more accurate results.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3667   +/-   ##
=======================================
  Coverage   17.51%   17.51%           
=======================================
  Files           3        3           
  Lines         805      805           
=======================================
  Hits          141      141           
  Misses        629      629           
  Partials       35       35           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eriknordmark eriknordmark force-pushed the emerg branch 2 times, most recently from 2db4646 to 1df5061 Compare December 15, 2023 22:24
pkg/pillar/types/diskmetrics.go Outdated Show resolved Hide resolved
pkg/pillar/diskmetrics/usage.go Outdated Show resolved Hide resolved
pkg/pillar/diskmetrics/usage.go Show resolved Hide resolved
@github-actions github-actions bot requested a review from zedi-pramodh January 11, 2024 21:41
@eriknordmark eriknordmark force-pushed the emerg branch 2 times, most recently from 37c8734 to 49902b4 Compare January 12, 2024 21:00
@eriknordmark eriknordmark changed the title volumemgr: Update list of paths used to calculate disk usage [WIP] Report disk usage, fix low disk usage recovery, update EVE overhead limits to avoid running out Jan 12, 2024
@eriknordmark eriknordmark force-pushed the emerg branch 2 times, most recently from 6608f86 to 7e2d51e Compare February 6, 2024 17:06
@eriknordmark
Copy link
Contributor Author

@OhmSpectator @rouming I've taken care of all of the comments (and sanitized the set of commits). Please review and approve as appropriate.

Also, we should presumably use a directory in /persist for the eve-info to make it easier to clean that up. I can do a separate PR for that if folks agree.

@OhmSpectator
Copy link
Member

OhmSpectator commented Aug 1, 2024

Let's find a time to chat. Not clear what they would want to remove (e.g., is it removing everything in /persist?)

@eriknordmark, as far as I understand, they want to clean up some space during the tests. Initially, they only wanted to clean up the logs. I recommended removing the contents of the directories mentioned in the onboot.sh script, and it was fine for them. So, if we add it as an eve command, it could help.

Yep, we can discuss it briefly during the call on Thursday (the Community Call) or on Friday (we have a 1-to-1 call scheduled, as I remember).

pkg/pillar/scripts/onboot.sh Outdated Show resolved Hide resolved
pkg/pillar/scripts/onboot.sh Show resolved Hide resolved
pkg/pillar/types/diskmetrics.go Outdated Show resolved Hide resolved
return nil
}

info, err := os.Stat(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. @andrewd-zededa made a change recently where he limited the number of Stat calls since it apparently can be expensive to call on each file: #4065
I don't know if the same consideration needs to apply for FindLargeFiles as well..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my only concern as well. os.Stat is expensive, as you have to read each file inode. Unfortunately, there is no easier way to do it that I can think of. The way ext4 is built, the directory entries include only the name of the entry and the type. If you want the size, you need to read the inode, for which you need os.Stat().

As far as I can see #4065 just saves by not calling os.Stat() on files outside of a given path, which I think this does already.

If you really want to get fancy, I have a library that knows how to read ext4 natively. It can load the entire inode table into memory, and you can use that to replace the os.Stat() for each one with an in-memory lookup. It won't save you on the directory tree reading, but it will save you all of the os.Stat() calls. It doesn't have a public interface for doing that, but I don't mind adding it.

I would suggest that we put this through its paces, including performance test. If that becomes an issue, worry about it then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only doing this in unknown directories in /persist - all of the known places are in the exclude paths.
This is to catch somebody manually e.g., putting huge tar/iso/img files in /persist/foo.iso or /persist/bar/foo.iso.
Thus the number of things we stat should be very small unless somebody manually dumps a huge directory tree manually into /persist.
Let me get a count on the number of calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's that small, then don't worry too much about it. And definitely don't use that library.

Copy link
Contributor Author

@eriknordmark eriknordmark Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milan-zededa @deitch I checked a default freshly installed EVE device and there are only two files which get Stat'ed in this loop (/persist/SMART_details.json and /persist/SMART_details_previous.json). And the FindLargeFiles is called every 20 seconds. So this doesn't seem to be a risk for load.
But it was a good idea to check this so thanks for the comment!

Copy link
Contributor

@milan-zededa milan-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just minor comments
Let's get test results

@milan-zededa
Copy link
Contributor

milan-zededa commented Aug 2, 2024

We do have some regression here:
Smoke tests with tpm=true fail here:

> test eden.lim.test -test.v -timewait 5m -test.run TestInfo -out InfoContent.dinfo.HSMStatus 'InfoContent.dinfo.HSMStatus:ENABLED'

HSMStatus continues being published as DISABLED even though we run with (sw)tpm.
(you can see what eve published inside the artifact in the file info.log)

Edit: I see this failed already here: #4115
@shjala Can you please investigate?

@eriknordmark eriknordmark merged commit ed3eff9 into lf-edge:master Aug 2, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants