-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Runner disk usage #869
Comments
This is becoming urgent - PRs are blocking on CI runs that are failing due to out of space. Is there a short-term fix we can get in place? E.g. could we pay $$ for a month and buy some time to reorganise/refine? Or is the cost prohibitive/would it require significant setup? |
I've cleaned up the saved caches (ranging from 1.5-2.5GB), any one of which may have been mounted into the build agent. Let's see if builds can make further progress. As a potential next step, what do we think about breaking up the current do-all build.yml into more workflows? Each workflow we break out means a new runner host to spread the disk usage load. I could see having a compile/build workflow, a unit-test workflow and an integration/e2e workflow. Thoughts? |
Nice one @vdice - things looking back on track! Thanks! Do you have a sense for how long this buys us - e.g. days, weeks, months? |
Yay! Alas, I don't know how long this will help... and I see that we've already gone over the cache max with the recent CI (re-)runs. We can view a list here: https://github.com/fermyon/spin/actions/caches It doesn't look like GitHub automagically deletes caches for PR merge branches once the PR itself is merged; so to clear up space, we'd need to manually delete those. I'm still unclear on the interaction between the caches and runner disk usage... needs more researching. |
Note: following numbers were captured when running the action with ubuntu only on a fork in my user account.
From the above table it seems (probably as expected) that can we cleanup some of the downloaded modules after these steps to reclaim the disk space? I am wondering how much of these we may need during |
Whomping us again already! |
I've deleted all caches again. Please retry failed workflows. It would be odd if actually persisting caches were causing the issue, but apparently mounting a given cache in to a runner seems to be tipping the scales into maxing out disk. If that helps yet again, I suggest we temporarily remove our use of these caches (or tweak configuration, perhaps to not cache a subset of the filesystem?) until we have cycles to work on some of the items mentioned in #869 (comment) |
Something is blocking integration tests and I'm not sure if it's this or something else, because the symptom is timing out waiting for Spin to start serving. It works locally but it does seem reasonably consistent about the point at which it fails. I don't think it's anything in my PR because that doesn't touch the integration tests (and the integration tests shouldn't be affected by changes to the templates system). Other PRs have passed (and these did, before) so maybe we're just so close to the limit that something builds up between morning and afternoon and that's enough to tip it over? |
@itowlson I’m also getting this sporadic ci failure on macos in my PRs. |
i did more analysis on this and tried removing
so approx 10 mins longer. we can possibly go ahead with this (to unblock the PR's) and continue investigating. if that sounds ok, I will open PR for this. continuing my investigation in the mean time. |
I had a chat with @radu-matei about this, and we propose to bypass the caching and do fully clean builds every time (not sure if this is the same as your suggestion @rajatjindal). Builds will be a lot slower but hopefully more reliable. Then we can take stock and look at longer-term optimisation (it sounds like @endocrimes has some ideas on this). @vdice I don't know how to do this - is it something you could set up for us please? The urgency is fairly high, I'm afraid, because CI is really fouling up the PR pipeline right now. Thanks! |
I'm afraid disabling the cache step doesn't help -- or at least is not guaranteed to avoid the behavior: https://github.com/fermyon/spin/actions/runs/3483294334/jobs/5826621776 |
#904 brings a much-needed refactor which should greatly help avoid the issues seen here, now that good chunks of work have been split out into separate runners for a given workflow. Big thanks to @endocrimes for the improvements 🎉 There may be refinements in this area we still wish to make, eg to improve build times, etc., but I'll close this one as completed. Feel free to break out specific follow-ups as seen fit. |
With the growing list of test dependencies, it appears we're running into the default 14GB disk size limit of the default GH runner.
Ref #798 (comment)
One initial question: Are we getting bit by our use of the rust cache? Are we able to clear it? Even with the growing list of dependencies for e2e testing (bindle, hippo, soon Vault), I wouldn't expect to exceed 14GB of disk.
Failing that, do we want to try to get creative in our tests to just-in-time install deps and then remove them after the corresponding tests have completed? (eg install bindle, run tests that require bindle, remove bindle, move on to the next tests that require external deps) Or would this get unwieldy/fragile?
If other methods like the above don't help, do we want to:
The text was updated successfully, but these errors were encountered: