Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Runner disk usage #869

Closed
vdice opened this issue Nov 1, 2022 · 13 comments
Closed

CI: Runner disk usage #869

vdice opened this issue Nov 1, 2022 · 13 comments
Labels

Comments

@vdice
Copy link
Member

vdice commented Nov 1, 2022

With the growing list of test dependencies, it appears we're running into the default 14GB disk size limit of the default GH runner.

Ref #798 (comment)

One initial question: Are we getting bit by our use of the rust cache? Are we able to clear it? Even with the growing list of dependencies for e2e testing (bindle, hippo, soon Vault), I wouldn't expect to exceed 14GB of disk.

Failing that, do we want to try to get creative in our tests to just-in-time install deps and then remove them after the corresponding tests have completed? (eg install bindle, run tests that require bindle, remove bindle, move on to the next tests that require external deps) Or would this get unwieldy/fragile?

If other methods like the above don't help, do we want to:

  1. Pay $$ for larger runners
  2. Use self-hosted runners with deps pre-installed
  3. Other?
@vdice vdice added this to Spin Triage Nov 4, 2022
@vdice vdice moved this to 🆕 Triage Needed in Spin Triage Nov 4, 2022
@itowlson itowlson added the ci label Nov 6, 2022
@itowlson itowlson moved this from 🆕 Triage Needed to 🔖 Backlog in Spin Triage Nov 6, 2022
@itowlson
Copy link
Contributor

This is becoming urgent - PRs are blocking on CI runs that are failing due to out of space.

Is there a short-term fix we can get in place? E.g. could we pay $$ for a month and buy some time to reorganise/refine? Or is the cost prohibitive/would it require significant setup?

@vdice
Copy link
Member Author

vdice commented Nov 14, 2022

I've cleaned up the saved caches (ranging from 1.5-2.5GB), any one of which may have been mounted into the build agent. Let's see if builds can make further progress.

As a potential next step, what do we think about breaking up the current do-all build.yml into more workflows? Each workflow we break out means a new runner host to spread the disk usage load. I could see having a compile/build workflow, a unit-test workflow and an integration/e2e workflow. Thoughts?

@itowlson
Copy link
Contributor

Nice one @vdice - things looking back on track! Thanks!

Do you have a sense for how long this buys us - e.g. days, weeks, months?

@vdice
Copy link
Member Author

vdice commented Nov 14, 2022

Yay!

Alas, I don't know how long this will help... and I see that we've already gone over the cache max with the recent CI (re-)runs. We can view a list here: https://github.com/fermyon/spin/actions/caches

It doesn't look like GitHub automagically deletes caches for PR merge branches once the PR itself is merged; so to clear up space, we'd need to manually delete those.

I'm still unclear on the interaction between the caches and runner disk usage... needs more researching.

@rajatjindal
Copy link
Collaborator

Note: following numbers were captured when running the action with ubuntu only on a fork in my user account.

Step Disk space remaining after step
Determining the checkout info 30G
Run actions-rs/toolchain@v1 30G
Run rustup target add wasm32-wasi 30G
Run engineerd/[email protected] 30G
Run engineerd/[email protected] 30G
Run curl -L https://github.com/deislabs/hippo/releases/download/v0.19.0/hippo-server-linux-x64.tar.gz -o hippo-server.tar.gz 29G
Run Swatinem/rust-cache@v2 29G
Run cargo fmt --all -- --check 29G
Run cargo clippy --workspace --all-targets --all-features -- -D warnings 26G
Run cargo build --workspace --all-targets --all-features --features openssl/vendored 17G
Run make test-unit 9.2G
Run actions/setup-go@v3 9.2G
Run wget https://github.com/tinygo-org/tinygo/releases/download/v0.22.0/tinygo_0.22.0_amd64.deb 8.4G
Run engineerd/[email protected] 8.4G
Run make test-sdk-go 8.4G
Run actions/upload-artifact@v3 8.4G

From the above table it seems (probably as expected) that cargo clippy, cargo build and make unit-test are the major disk space consumer here.

can we cleanup some of the downloaded modules after these steps to reclaim the disk space? I am wondering how much of these we may need during go-sdk test and will it increase the execution time. I will investigate this more today.

@itowlson
Copy link
Contributor

Whomping us again already!

@vdice
Copy link
Member Author

vdice commented Nov 15, 2022

I've deleted all caches again. Please retry failed workflows.

It would be odd if actually persisting caches were causing the issue, but apparently mounting a given cache in to a runner seems to be tipping the scales into maxing out disk.

If that helps yet again, I suggest we temporarily remove our use of these caches (or tweak configuration, perhaps to not cache a subset of the filesystem?) until we have cycles to work on some of the items mentioned in #869 (comment)

@itowlson
Copy link
Contributor

itowlson commented Nov 16, 2022

Something is blocking integration tests and I'm not sure if it's this or something else, because the symptom is timing out waiting for Spin to start serving. It works locally but it does seem reasonably consistent about the point at which it fails. I don't think it's anything in my PR because that doesn't touch the integration tests (and the integration tests shouldn't be affected by changes to the templates system). Other PRs have passed (and these did, before) so maybe we're just so close to the limit that something builds up between morning and afternoon and that's enough to tip it over?

@etehtsea
Copy link
Contributor

@itowlson I’m also getting this sporadic ci failure on macos in my PRs.

@rajatjindal
Copy link
Collaborator

i did more analysis on this and tried removing target directory after the cargo steps

command Time taken if target dir was cleaned before command Time taken if target is retained
Cargo clippy ~360s 700s
Cargo build ~660s 520s
Make test unit ~920s 614s

so approx 10 mins longer.

we can possibly go ahead with this (to unblock the PR's) and continue investigating. if that sounds ok, I will open PR for this. continuing my investigation in the mean time.

@itowlson
Copy link
Contributor

I had a chat with @radu-matei about this, and we propose to bypass the caching and do fully clean builds every time (not sure if this is the same as your suggestion @rajatjindal). Builds will be a lot slower but hopefully more reliable. Then we can take stock and look at longer-term optimisation (it sounds like @endocrimes has some ideas on this).

@vdice I don't know how to do this - is it something you could set up for us please? The urgency is fairly high, I'm afraid, because CI is really fouling up the PR pipeline right now. Thanks!

@vdice
Copy link
Member Author

vdice commented Nov 16, 2022

I'm afraid disabling the cache step doesn't help -- or at least is not guaranteed to avoid the behavior: https://github.com/fermyon/spin/actions/runs/3483294334/jobs/5826621776

@vdice
Copy link
Member Author

vdice commented Nov 17, 2022

#904 brings a much-needed refactor which should greatly help avoid the issues seen here, now that good chunks of work have been split out into separate runners for a given workflow. Big thanks to @endocrimes for the improvements 🎉

There may be refinements in this area we still wish to make, eg to improve build times, etc., but I'll close this one as completed. Feel free to break out specific follow-ups as seen fit.

@vdice vdice closed this as completed Nov 17, 2022
Repository owner moved this from 🔖 Backlog to ✅ Done in Spin Triage Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

4 participants