-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework GitHub Actions workflows to build packages --> test packages #584
Comments
Proof of concept migration of one workflow: #625. This added 1 minute to total workflow time but has a few scaling benefits. Going to let that sit for a bit and run some more experiments. The main time sink is installing Python packages (even if already downloaded/cached). Workflows that use persistent self-hosted runners currently don't use venvs, so they risk having packages left over from previous jobs and either installing conflicting versions of packages or failing to install the requested versions entirely. The new |
You may want to look at using uv as a pip replacement when latency is a concern. I dislike forked tool flows, by it seems like a lot of folks are having a good experience there. |
Recipes for using |
If you want to build a package you want to use |
The bottleneck I'd like to optimize is the 2m30s spent installing packages (including deps), not the 1m30s building the shortfin/sharktank/shark-ai packages. See logs at https://github.com/nod-ai/shark-ai/actions/runs/12059301876/job/33628235219?pr=625#step:5:35 :
The build steps can be optimized too, but 1m30s on a standard runner with a (very low) 40% cache hit rate is pretty respectable already. |
Very hype about how this improves our CI dependencies, especially about the part where we can pin IREE versions such that most CI tasks don't doesn't suffer from IREE regressions. |
Switching from pip to uv saved about 1 minute of job time on #625. Probably worth it given the relative time scales here. |
For uv, we can also use https://github.com/astral-sh/setup-uv |
Progress on #584. This is expected to save around 10-20 seconds when building packages on standard GitHub-hosted runners: ``` Tue, 03 Dec 2024 11:07:18 GMT [372/380] Linking CXX shared library src/libshortfin.so.3.1.0 Tue, 03 Dec 2024 11:07:18 GMT [373/380] Creating library symlink src/libshortfin.so.1 src/libshortfin.so Tue, 03 Dec 2024 11:07:23 GMT [374/380] Linking CXX executable src/shortfin/support/shortfin_support_test Tue, 03 Dec 2024 11:07:23 GMT [375/380] Linking CXX executable src/shortfin/array/shortfin_array_test Tue, 03 Dec 2024 11:07:36 GMT [376/380] Building CXX object python/CMakeFiles/shortfin_python_extension.dir/array_host_ops.cc.o Tue, 03 Dec 2024 11:07:45 GMT [377/380] Linking CXX shared module python/_shortfin_default/lib.cpython-311-x86_64-linux-gnu.so ``` (from these logs: https://github.com/nod-ai/shark-ai/actions/runs/12138320160/job/33843543941#step:6:738) IREE also disables its tests when building packages: * https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/compiler/setup.py#L260 * https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/runtime/setup.py#L278
The `mi300-sdxl-kernel` runner has been offline for a few weeks, so runs of this workflow have been queued: https://github.com/nod-ai/shark-ai/actions/workflows/ci-sdxl.yaml. This `mi300x-4` runner is probably fit to run this workflow. Also refactored the workflow to not use explicit build steps, which loosens the requirements on installed software and helps make progress on #584.
Many of these workflows are using persistent self-hosted runners, so it looks like they have been reusing the same system-wide Python environment between workflow runs (plus layer of caching on top). This switches to using venvs at `${{ github.workspace }}/.venv` that should be ephemeral, giving us more explicit control over which packages are installed. More work is planned as part of #584 to refactor these workflows further - replacing the package installs code like `pip install --no-compile -r requirements.txt -r sharktank/requirements-tests.txt -e sharktank/` with a `setup_venv.py` script that uses dev/nightly/stable packages (from an appropriate source). This also disables pip caching, since that is not directly compatible with using venvs. As a result, some workflows are slower now, but they are more predictable in what they install. Good reading for adding caching back: * https://adamj.eu/tech/2023/11/02/github-actions-faster-python-virtual-environments/ * https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages
#646) Splitting this off from #589 to make progress on #584. Tested with ``` CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh + ccache --show-stats Cacheable calls: 626 / 636 (98.43%) Hits: 2 / 626 ( 0.32%) Direct: 2 / 2 (100.0%) Preprocessed: 0 / 2 ( 0.00%) Misses: 624 / 626 (99.68%) Uncacheable calls: 10 / 636 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 3.10%) Hits: 2 / 626 ( 0.32%) Misses: 624 / 626 (99.68%) + ccache --show-stats ccache stats: Cacheable calls: 1252 / 1272 (98.43%) Hits: 550 / 1252 (43.93%) Direct: 550 / 550 (100.0%) Preprocessed: 0 / 550 ( 0.00%) Misses: 702 / 1252 (56.07%) Uncacheable calls: 20 / 1272 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 4.11%) Hits: 550 / 1252 (43.93%) Misses: 702 / 1252 (56.07%) + ccache --show-stats Cacheable calls: 1878 / 1908 (98.43%) Hits: 1098 / 1878 (58.47%) Direct: 1098 / 1098 (100.0%) Preprocessed: 0 / 1098 ( 0.00%) Misses: 780 / 1878 (41.53%) Uncacheable calls: 30 / 1908 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 5.12%) Hits: 1098 / 1878 (58.47%) Misses: 780 / 1878 (41.53%) CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh + ccache --show-stats ccache stats: Cacheable calls: 3756 / 3816 (98.43%) Hits: 2820 / 3756 (75.08%) Direct: 2820 / 2820 (100.0%) Preprocessed: 0 / 2820 ( 0.00%) Misses: 936 / 3756 (24.92%) Uncacheable calls: 60 / 3816 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 5.19%) Hits: 2820 / 3756 (75.08%) Misses: 936 / 3756 (24.92%) ``` So we have multiple configurations getting built (Python versions, tracing enable/disabled), but we still get a reasonable number of cache hits. Definitely room to improve there, but better than nothing.
I've landed some incremental changes that prepare us for package-based workflows, but I've been a bit skeptical of the complexity that they will introduce. Here's another data point as motivation: installing into a fresh venv, with package downloads already cached on the system, https://github.com/nod-ai/shark-ai/actions/runs/12206496774/job/34056060889#step:5:32 took 9m20s on the
|
This simplification will help with #584. Nightly releases of iree-turbine are now being built thanks to iree-org/iree-turbine#314 and published at the same index as the other IREE packages thanks to iree-org/iree#19391.
The pip install step isn't consistently that slow. Recent runs took ~4m30s instead of that 9m+. For workflows that only use sharktank and not shortfin, the setup is already fast enough: 27s at https://github.com/nod-ai/shark-ai/actions/runs/12243785411/job/34154097452, for example. The install steps are simpler now, so I'm skeptical about going "full pkgci" across all workflows from a complexity point of view. For workflows that run integration tests using shortfin, a dedicated package build job will make more sense as we make the build more complex... like adding a rust dependency for tokenizers. |
Progress on #584. ~~Depends on #666 (the first commit).~~ This is refactors the `build_packages.yml` workflow so it can be used via `workflow_call` as part of a "pkgci" setup, as an alternative to creating a new `pkgci_build_packages.yml` workflow as originally proposed in #589. This lets us reuse the same workflow for building stable, nightly, and dev packages, all across the same matrix of Python versions and operating systems. Package builds take about 2 minutes (wall time) across the full matrix, so we might as well build them all, instead of artificially constraining ourselves to a subset like only Linux on Python 3.11. Triggers for the workflow are now this: Trigger | Scenario | Build type(s) -- | -- | -- `schedule` | Nightly pre-release build | `rc` `workflow_dispatch` | Workflow testing, manual releasing | `rc` default, `stable` and `dev` possible `workflow_call` | Pull request or push "pkgci" dev builds | `dev` default, `stable` and `rc` possible With this workflow behavior: Build type | Version suffix | Cache enabled? | Tracing enabled? | Pushes to release? -- | -- | -- | -- | -- `stable` | None | No | Yes | No `rc` | `rcYYYYMMDD` | No | Yes | Yes `dev` | `.dev0+${{ github.sha }}` | Yes | No | No Tested over at https://github.com/ScottTodd/shark-ai/actions/workflows/build_packages.yml. Example run: https://github.com/ScottTodd/shark-ai/actions/runs/12245900071 (warm cache)
Progress on #584. This is expected to save around 10-20 seconds when building packages on standard GitHub-hosted runners: ``` Tue, 03 Dec 2024 11:07:18 GMT [372/380] Linking CXX shared library src/libshortfin.so.3.1.0 Tue, 03 Dec 2024 11:07:18 GMT [373/380] Creating library symlink src/libshortfin.so.1 src/libshortfin.so Tue, 03 Dec 2024 11:07:23 GMT [374/380] Linking CXX executable src/shortfin/support/shortfin_support_test Tue, 03 Dec 2024 11:07:23 GMT [375/380] Linking CXX executable src/shortfin/array/shortfin_array_test Tue, 03 Dec 2024 11:07:36 GMT [376/380] Building CXX object python/CMakeFiles/shortfin_python_extension.dir/array_host_ops.cc.o Tue, 03 Dec 2024 11:07:45 GMT [377/380] Linking CXX shared module python/_shortfin_default/lib.cpython-311-x86_64-linux-gnu.so ``` (from these logs: https://github.com/nod-ai/shark-ai/actions/runs/12138320160/job/33843543941#step:6:738) IREE also disables its tests when building packages: * https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/compiler/setup.py#L260 * https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/runtime/setup.py#L278
The `mi300-sdxl-kernel` runner has been offline for a few weeks, so runs of this workflow have been queued: https://github.com/nod-ai/shark-ai/actions/workflows/ci-sdxl.yaml. This `mi300x-4` runner is probably fit to run this workflow. Also refactored the workflow to not use explicit build steps, which loosens the requirements on installed software and helps make progress on #584.
Many of these workflows are using persistent self-hosted runners, so it looks like they have been reusing the same system-wide Python environment between workflow runs (plus layer of caching on top). This switches to using venvs at `${{ github.workspace }}/.venv` that should be ephemeral, giving us more explicit control over which packages are installed. More work is planned as part of #584 to refactor these workflows further - replacing the package installs code like `pip install --no-compile -r requirements.txt -r sharktank/requirements-tests.txt -e sharktank/` with a `setup_venv.py` script that uses dev/nightly/stable packages (from an appropriate source). This also disables pip caching, since that is not directly compatible with using venvs. As a result, some workflows are slower now, but they are more predictable in what they install. Good reading for adding caching back: * https://adamj.eu/tech/2023/11/02/github-actions-faster-python-virtual-environments/ * https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages
#646) Splitting this off from #589 to make progress on #584. Tested with ``` CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh + ccache --show-stats Cacheable calls: 626 / 636 (98.43%) Hits: 2 / 626 ( 0.32%) Direct: 2 / 2 (100.0%) Preprocessed: 0 / 2 ( 0.00%) Misses: 624 / 626 (99.68%) Uncacheable calls: 10 / 636 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 3.10%) Hits: 2 / 626 ( 0.32%) Misses: 624 / 626 (99.68%) + ccache --show-stats ccache stats: Cacheable calls: 1252 / 1272 (98.43%) Hits: 550 / 1252 (43.93%) Direct: 550 / 550 (100.0%) Preprocessed: 0 / 550 ( 0.00%) Misses: 702 / 1252 (56.07%) Uncacheable calls: 20 / 1272 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 4.11%) Hits: 550 / 1252 (43.93%) Misses: 702 / 1252 (56.07%) + ccache --show-stats Cacheable calls: 1878 / 1908 (98.43%) Hits: 1098 / 1878 (58.47%) Direct: 1098 / 1098 (100.0%) Preprocessed: 0 / 1098 ( 0.00%) Misses: 780 / 1878 (41.53%) Uncacheable calls: 30 / 1908 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 5.12%) Hits: 1098 / 1878 (58.47%) Misses: 780 / 1878 (41.53%) CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh + ccache --show-stats ccache stats: Cacheable calls: 3756 / 3816 (98.43%) Hits: 2820 / 3756 (75.08%) Direct: 2820 / 2820 (100.0%) Preprocessed: 0 / 2820 ( 0.00%) Misses: 936 / 3756 (24.92%) Uncacheable calls: 60 / 3816 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 5.19%) Hits: 2820 / 3756 (75.08%) Misses: 936 / 3756 (24.92%) ``` So we have multiple configurations getting built (Python versions, tracing enable/disabled), but we still get a reasonable number of cache hits. Definitely room to improve there, but better than nothing.
This simplification will help with nod-ai#584. Nightly releases of iree-turbine are now being built thanks to iree-org/iree-turbine#314 and published at the same index as the other IREE packages thanks to iree-org/iree#19391.
Progress on nod-ai#584. ~~Depends on nod-ai#666 (the first commit).~~ This is refactors the `build_packages.yml` workflow so it can be used via `workflow_call` as part of a "pkgci" setup, as an alternative to creating a new `pkgci_build_packages.yml` workflow as originally proposed in nod-ai#589. This lets us reuse the same workflow for building stable, nightly, and dev packages, all across the same matrix of Python versions and operating systems. Package builds take about 2 minutes (wall time) across the full matrix, so we might as well build them all, instead of artificially constraining ourselves to a subset like only Linux on Python 3.11. Triggers for the workflow are now this: Trigger | Scenario | Build type(s) -- | -- | -- `schedule` | Nightly pre-release build | `rc` `workflow_dispatch` | Workflow testing, manual releasing | `rc` default, `stable` and `dev` possible `workflow_call` | Pull request or push "pkgci" dev builds | `dev` default, `stable` and `rc` possible With this workflow behavior: Build type | Version suffix | Cache enabled? | Tracing enabled? | Pushes to release? -- | -- | -- | -- | -- `stable` | None | No | Yes | No `rc` | `rcYYYYMMDD` | No | Yes | Yes `dev` | `.dev0+${{ github.sha }}` | Yes | No | No Tested over at https://github.com/ScottTodd/shark-ai/actions/workflows/build_packages.yml. Example run: https://github.com/ScottTodd/shark-ai/actions/runs/12245900071 (warm cache)
The https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/ci-shark-ai.yml workflow has massive overhead on
|
These workflows all currently build shortfin from source, duplicating all the boilerplate to fetch dependencies in some carefully balanced order:
For workflows that run on
pull_request
andpush
triggers, we can add abuild_dev_packages
job similar to https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/build_packages.yml that builds the packages and then have those workflows install artifacts from that job. For workflows that run onschedule
, we can either do the same thing, or we can use the already built nightly packages (docs: https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md).In both cases, the complexity of package building will be isolated to a few package-oriented workflows and we'll gain confidence that the test jobs are compatible with our releases, so users will be able to use them without needing to build from source either.
Once we have something working, we can optimize the package build to improve CI turnaround times:
shark-ai/shortfin/build_tools/build_linux_package.sh
Lines 94 to 97 in 06599e9
shark-ai/shortfin/setup.py
Line 78 in 06599e9
shark-ai/shortfin/setup.py
Lines 260 to 263 in 06599e9
See https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci.yml for the shape of this sort of setup in IREE.
The text was updated successfully, but these errors were encountered: