CI: Runner disk usage #869

vdice · 2022-11-01T16:43:30Z

With the growing list of test dependencies, it appears we're running into the default 14GB disk size limit of the default GH runner.

Ref #798 (comment)

One initial question: Are we getting bit by our use of the rust cache? Are we able to clear it? Even with the growing list of dependencies for e2e testing (bindle, hippo, soon Vault), I wouldn't expect to exceed 14GB of disk.

Failing that, do we want to try to get creative in our tests to just-in-time install deps and then remove them after the corresponding tests have completed? (eg install bindle, run tests that require bindle, remove bindle, move on to the next tests that require external deps) Or would this get unwieldy/fragile?

If other methods like the above don't help, do we want to:

Pay $$ for larger runners
Use self-hosted runners with deps pre-installed
Other?

itowlson · 2022-11-14T19:41:14Z

This is becoming urgent - PRs are blocking on CI runs that are failing due to out of space.

Is there a short-term fix we can get in place? E.g. could we pay $$ for a month and buy some time to reorganise/refine? Or is the cost prohibitive/would it require significant setup?

vdice · 2022-11-14T21:32:18Z

I've cleaned up the saved caches (ranging from 1.5-2.5GB), any one of which may have been mounted into the build agent. Let's see if builds can make further progress.

As a potential next step, what do we think about breaking up the current do-all build.yml into more workflows? Each workflow we break out means a new runner host to spread the disk usage load. I could see having a compile/build workflow, a unit-test workflow and an integration/e2e workflow. Thoughts?

itowlson · 2022-11-14T23:41:16Z

Nice one @vdice - things looking back on track! Thanks!

Do you have a sense for how long this buys us - e.g. days, weeks, months?

vdice · 2022-11-14T23:45:20Z

Yay!

Alas, I don't know how long this will help... and I see that we've already gone over the cache max with the recent CI (re-)runs. We can view a list here: https://github.com/fermyon/spin/actions/caches

It doesn't look like GitHub automagically deletes caches for PR merge branches once the PR itself is merged; so to clear up space, we'd need to manually delete those.

I'm still unclear on the interaction between the caches and runner disk usage... needs more researching.

rajatjindal · 2022-11-15T01:21:36Z

Note: following numbers were captured when running the action with ubuntu only on a fork in my user account.

Step	Disk space remaining after step
Determining the checkout info	30G
Run actions-rs/toolchain@v1	30G
Run rustup target add wasm32-wasi	30G
Run engineerd/[email protected]	30G
Run engineerd/[email protected]	30G
Run curl -L https://github.com/deislabs/hippo/releases/download/v0.19.0/hippo-server-linux-x64.tar.gz -o hippo-server.tar.gz	29G
Run Swatinem/rust-cache@v2	29G
Run cargo fmt --all -- --check	29G
Run cargo clippy --workspace --all-targets --all-features -- -D warnings	26G
Run cargo build --workspace --all-targets --all-features --features openssl/vendored	17G
Run make test-unit	9.2G
Run actions/setup-go@v3	9.2G
Run wget https://github.com/tinygo-org/tinygo/releases/download/v0.22.0/tinygo_0.22.0_amd64.deb	8.4G
Run engineerd/[email protected]	8.4G
Run make test-sdk-go	8.4G
Run actions/upload-artifact@v3	8.4G

From the above table it seems (probably as expected) that cargo clippy, cargo build and make unit-test are the major disk space consumer here.

can we cleanup some of the downloaded modules after these steps to reclaim the disk space? I am wondering how much of these we may need during go-sdk test and will it increase the execution time. I will investigate this more today.

itowlson · 2022-11-15T21:59:45Z

Whomping us again already!

vdice · 2022-11-15T22:59:57Z

I've deleted all caches again. Please retry failed workflows.

It would be odd if actually persisting caches were causing the issue, but apparently mounting a given cache in to a runner seems to be tipping the scales into maxing out disk.

If that helps yet again, I suggest we temporarily remove our use of these caches (or tweak configuration, perhaps to not cache a subset of the filesystem?) until we have cycles to work on some of the items mentioned in #869 (comment)

itowlson · 2022-11-16T04:02:23Z

Something is blocking integration tests and I'm not sure if it's this or something else, because the symptom is timing out waiting for Spin to start serving. It works locally but it does seem reasonably consistent about the point at which it fails. I don't think it's anything in my PR because that doesn't touch the integration tests (and the integration tests shouldn't be affected by changes to the templates system). Other PRs have passed (and these did, before) so maybe we're just so close to the limit that something builds up between morning and afternoon and that's enough to tip it over?

etehtsea · 2022-11-16T04:08:50Z

@itowlson I’m also getting this sporadic ci failure on macos in my PRs.

rajatjindal · 2022-11-16T08:45:11Z

i did more analysis on this and tried removing target directory after the cargo steps

command	Time taken if target dir was cleaned before command	Time taken if target is retained
Cargo clippy	~360s	700s
Cargo build	~660s	520s
Make test unit	~920s	614s

so approx 10 mins longer.

we can possibly go ahead with this (to unblock the PR's) and continue investigating. if that sounds ok, I will open PR for this. continuing my investigation in the mean time.

itowlson · 2022-11-16T21:09:31Z

I had a chat with @radu-matei about this, and we propose to bypass the caching and do fully clean builds every time (not sure if this is the same as your suggestion @rajatjindal). Builds will be a lot slower but hopefully more reliable. Then we can take stock and look at longer-term optimisation (it sounds like @endocrimes has some ideas on this).

@vdice I don't know how to do this - is it something you could set up for us please? The urgency is fairly high, I'm afraid, because CI is really fouling up the PR pipeline right now. Thanks!

vdice · 2022-11-16T22:16:34Z

I'm afraid disabling the cache step doesn't help -- or at least is not guaranteed to avoid the behavior: https://github.com/fermyon/spin/actions/runs/3483294334/jobs/5826621776

vdice · 2022-11-17T17:25:50Z

#904 brings a much-needed refactor which should greatly help avoid the issues seen here, now that good chunks of work have been split out into separate runners for a given workflow. Big thanks to @endocrimes for the improvements 🎉

There may be refinements in this area we still wish to make, eg to improve build times, etc., but I'll close this one as completed. Feel free to break out specific follow-ups as seen fit.

vdice mentioned this issue Nov 2, 2022

CI: Add runtime config provider tests to e2e #872

Open

vdice added this to Spin Triage Nov 4, 2022

vdice moved this to 🆕 Triage Needed in Spin Triage Nov 4, 2022

itowlson added the ci label Nov 6, 2022

itowlson moved this from 🆕 Triage Needed to 🔖 Backlog in Spin Triage Nov 6, 2022

radu-matei mentioned this issue Nov 15, 2022

chore: Upgrade wasmtime to 2.0.2 #902

Merged

vdice mentioned this issue Nov 16, 2022

ci(build.yml): temporarily disable use of cache #905

Closed

vdice closed this as completed Nov 17, 2022

Repository owner moved this from 🔖 Backlog to ✅ Done in Spin Triage Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Runner disk usage #869

CI: Runner disk usage #869

vdice commented Nov 1, 2022

itowlson commented Nov 14, 2022

vdice commented Nov 14, 2022 •

edited

Loading

itowlson commented Nov 14, 2022

vdice commented Nov 14, 2022 •

edited

Loading

rajatjindal commented Nov 15, 2022

itowlson commented Nov 15, 2022

vdice commented Nov 15, 2022 •

edited

Loading

itowlson commented Nov 16, 2022 •

edited

Loading

etehtsea commented Nov 16, 2022

rajatjindal commented Nov 16, 2022

itowlson commented Nov 16, 2022

vdice commented Nov 16, 2022 •

edited

Loading

vdice commented Nov 17, 2022

CI: Runner disk usage #869

CI: Runner disk usage #869

Comments

vdice commented Nov 1, 2022

itowlson commented Nov 14, 2022

vdice commented Nov 14, 2022 • edited Loading

itowlson commented Nov 14, 2022

vdice commented Nov 14, 2022 • edited Loading

rajatjindal commented Nov 15, 2022

itowlson commented Nov 15, 2022

vdice commented Nov 15, 2022 • edited Loading

itowlson commented Nov 16, 2022 • edited Loading

etehtsea commented Nov 16, 2022

rajatjindal commented Nov 16, 2022

itowlson commented Nov 16, 2022

vdice commented Nov 16, 2022 • edited Loading

vdice commented Nov 17, 2022

vdice commented Nov 14, 2022 •

edited

Loading

vdice commented Nov 14, 2022 •

edited

Loading

vdice commented Nov 15, 2022 •

edited

Loading

itowlson commented Nov 16, 2022 •

edited

Loading

vdice commented Nov 16, 2022 •

edited

Loading