Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework GitHub Actions workflows to build packages --> test packages #584

Open
ScottTodd opened this issue Nov 21, 2024 · 11 comments
Open

Rework GitHub Actions workflows to build packages --> test packages #584

ScottTodd opened this issue Nov 21, 2024 · 11 comments
Assignees

Comments

@ScottTodd
Copy link
Member

These workflows all currently build shortfin from source, duplicating all the boilerplate to fetch dependencies in some carefully balanced order:

For workflows that run on pull_request and push triggers, we can add a build_dev_packages job similar to https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/build_packages.yml that builds the packages and then have those workflows install artifacts from that job. For workflows that run on schedule, we can either do the same thing, or we can use the already built nightly packages (docs: https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md).

In both cases, the complexity of package building will be isolated to a few package-oriented workflows and we'll gain confidence that the test jobs are compatible with our releases, so users will be able to use them without needing to build from source either.

Once we have something working, we can optimize the package build to improve CI turnaround times:

  • cache pip dependencies
  • cache CMake builds (or the entire build - see what IREE does)
  • cache Dockerfiles
  • skip the tracy build variant:
    function build_shortfin() {
    export SHORTFIN_ENABLE_TRACING=ON
    python -m pip wheel --disable-pip-version-check -v -w "${OUTPUT_DIR}" "${REPO_ROOT}/shortfin"
    }
    ENABLE_TRACY = get_env_boolean("SHORTFIN_ENABLE_TRACING", False)

    shark-ai/shortfin/setup.py

    Lines 260 to 263 in 06599e9

    try:
    self.build_default_configuration()
    if ENABLE_TRACY:
    self.build_tracy_configuration()

See https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci.yml for the shape of this sort of setup in IREE.

@ScottTodd
Copy link
Member Author

Proof of concept migration of one workflow: #625. This added 1 minute to total workflow time but has a few scaling benefits. Going to let that sit for a bit and run some more experiments.

The main time sink is installing Python packages (even if already downloaded/cached). Workflows that use persistent self-hosted runners currently don't use venvs, so they risk having packages left over from previous jobs and either installing conflicting versions of packages or failing to install the requested versions entirely. The new setup_venv.py code (forked from IREE) installs the dev packages and requirements sequentially, but we might be able to optimize that a bit while still retaining predictability.

@stellaraccident
Copy link
Contributor

You may want to look at using uv as a pip replacement when latency is a concern. I dislike forked tool flows, by it seems like a lot of folks are having a good experience there.

@ScottTodd
Copy link
Member Author

Recipes for using uv: https://github.com/astral-sh/uv?tab=readme-ov-file#a-pip-compatible-interface . Definitely worth trying out.

@marbre
Copy link
Collaborator

marbre commented Nov 28, 2024

If you want to build a package you want to use uv build and not uv pip. The equivalent for python -m pip wheel -v -w wheeldir . would be uv build --wheel -v -o wheeldir .. I would say uv is definitely an alternative, especially as uv venv is a really nice alternative. Furthermore, uv allows to install different Python versions and therefore is also a replacement for pyenv. However, I also faced some issues in the past but it's for sure worth to give it another try.

@ScottTodd
Copy link
Member Author

The bottleneck I'd like to optimize is the 2m30s spent installing packages (including deps), not the 1m30s building the shortfin/sharktank/shark-ai packages. See logs at https://github.com/nod-ai/shark-ai/actions/runs/12059301876/job/33628235219?pr=625#step:5:35 :

Wed, 27 Nov 2024 23:02:34 GMT
Installing collected packages: mpmath, typing-extensions, sympy, networkx, MarkupSafe, fsspec, filelock, jinja2, torch
Wed, 27 Nov 2024 23:03:02 GMT
Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.6.1 jinja2-3.1.4 mpmath-1.3.0 networkx-3.3 sympy-1.13.1 torch-2.3.0+cpu typing-extensions-4.12.2
...
Wed, 27 Nov 2024 23:03:13 GMT
Installing collected packages: pytz, xxhash, urllib3, tzdata, tqdm, sniffio, six, safetensors, regex, pyyaml, pydantic-core, pyarrow, propcache, packaging, numpy, multidict, idna, h11, frozenlist, dill, click, charset-normalizer, certifi, attrs, annotated-types, aiohappyeyeballs, yarl, uvicorn, requests, python-dateutil, pydantic, multiprocess, iree-base-runtime, iree-base-compiler, gguf, anyio, aiosignal, starlette, pandas, iree-turbine, huggingface-hub, aiohttp, tokenizers, fastapi, transformers, datasets
Wed, 27 Nov 2024 23:04:07 GMT
Successfully installed aiohappyeyeballs-2.4.3 aiohttp-3.11.8 aiosignal-1.3.1 annotated-types-0.7.0 anyio-4.6.2.post1 attrs-24.2.0 certifi-2024.8.30 charset-normalizer-3.4.0 click-8.1.7 datasets-3.0.1 dill-0.3.8 fastapi-0.112.2 frozenlist-1.5.0 gguf-0.10.0 h11-0.14.0 huggingface-hub-0.22.2 idna-3.10 iree-base-compiler-3.0.0 iree-base-runtime-3.0.0 iree-turbine-3.0.0 multidict-6.1.0 multiprocess-0.70.16 numpy-1.26.4 packaging-24.2 pandas-2.2.3 propcache-0.2.0 pyarrow-18.1.0 pydantic-2.10.2 pydantic-core-2.27.1 python-dateutil-2.9.0.post0 pytz-2024.2 pyyaml-6.0.2 regex-2024.11.6 requests-2.32.3 safetensors-0.4.5 six-1.16.0 sniffio-1.3.1 starlette-0.38.6 tokenizers-0.19.1 tqdm-4.67.1 transformers-4.40.0 tzdata-2024.2 urllib3-2.2.3 uvicorn-0.30.6 xxhash-3.5.0 yarl-1.18.0

The build steps can be optimized too, but 1m30s on a standard runner with a (very low) 40% cache hit rate is pretty respectable already.

@renxida
Copy link
Contributor

renxida commented Dec 3, 2024

Very hype about how this improves our CI dependencies, especially about the part where we can pin IREE versions such that most CI tasks don't doesn't suffer from IREE regressions.

@ScottTodd
Copy link
Member Author

Switching from pip to uv saved about 1 minute of job time on #625. Probably worth it given the relative time scales here.

@ScottTodd
Copy link
Member Author

For uv, we can also use https://github.com/astral-sh/setup-uv

ScottTodd added a commit that referenced this issue Dec 4, 2024
Progress on #584.

This is expected to save around 10-20 seconds when building packages on
standard GitHub-hosted runners:

```
Tue, 03 Dec 2024 11:07:18 GMT   [372/380] Linking CXX shared library src/libshortfin.so.3.1.0
Tue, 03 Dec 2024 11:07:18 GMT   [373/380] Creating library symlink src/libshortfin.so.1 src/libshortfin.so
Tue, 03 Dec 2024 11:07:23 GMT   [374/380] Linking CXX executable src/shortfin/support/shortfin_support_test
Tue, 03 Dec 2024 11:07:23 GMT   [375/380] Linking CXX executable src/shortfin/array/shortfin_array_test
Tue, 03 Dec 2024 11:07:36 GMT   [376/380] Building CXX object python/CMakeFiles/shortfin_python_extension.dir/array_host_ops.cc.o
Tue, 03 Dec 2024 11:07:45 GMT   [377/380] Linking CXX shared module python/_shortfin_default/lib.cpython-311-x86_64-linux-gnu.so
```
(from these logs:
https://github.com/nod-ai/shark-ai/actions/runs/12138320160/job/33843543941#step:6:738)

IREE also disables its tests when building packages:
*
https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/compiler/setup.py#L260
*
https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/runtime/setup.py#L278
ScottTodd added a commit that referenced this issue Dec 4, 2024
The `mi300-sdxl-kernel` runner has been offline for a few weeks, so runs
of this workflow have been queued:
https://github.com/nod-ai/shark-ai/actions/workflows/ci-sdxl.yaml. This
`mi300x-4` runner is probably fit to run this workflow.

Also refactored the workflow to not use explicit build steps, which
loosens the requirements on installed software and helps make progress
on #584.
ScottTodd added a commit that referenced this issue Dec 5, 2024
Many of these workflows are using persistent self-hosted runners, so it
looks like they have been reusing the same system-wide Python
environment between workflow runs (plus layer of caching on top). This
switches to using venvs at `${{ github.workspace }}/.venv` that should
be ephemeral, giving us more explicit control over which packages are
installed.

More work is planned as part of
#584 to refactor these
workflows further - replacing the package installs code like `pip
install --no-compile -r requirements.txt -r
sharktank/requirements-tests.txt -e sharktank/` with a `setup_venv.py`
script that uses dev/nightly/stable packages (from an appropriate
source).

This also disables pip caching, since that is not directly compatible
with using venvs. As a result, some workflows are slower now, but they
are more predictable in what they install. Good reading for adding
caching back:
*
https://adamj.eu/tech/2023/11/02/github-actions-faster-python-virtual-environments/
*
https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages
ScottTodd added a commit that referenced this issue Dec 5, 2024
#646)

Splitting this off from #589 to
make progress on #584.

Tested with
```
CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh

+ ccache --show-stats
Cacheable calls:   626 / 636 (98.43%)
  Hits:              2 / 626 ( 0.32%)
    Direct:          2 /   2 (100.0%)
    Preprocessed:    0 /   2 ( 0.00%)
  Misses:          624 / 626 (99.68%)
Uncacheable calls:  10 / 636 ( 1.57%)
Local storage:
  Cache size (GB): 0.1 / 2.0 ( 3.10%)
  Hits:              2 / 626 ( 0.32%)
  Misses:          624 / 626 (99.68%)

+ ccache --show-stats
ccache stats:
Cacheable calls:   1252 / 1272 (98.43%)
  Hits:             550 / 1252 (43.93%)
    Direct:         550 /  550 (100.0%)
    Preprocessed:     0 /  550 ( 0.00%)
  Misses:           702 / 1252 (56.07%)
Uncacheable calls:   20 / 1272 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 4.11%)
  Hits:             550 / 1252 (43.93%)
  Misses:           702 / 1252 (56.07%)

+ ccache --show-stats
Cacheable calls:   1878 / 1908 (98.43%)
  Hits:            1098 / 1878 (58.47%)
    Direct:        1098 / 1098 (100.0%)
    Preprocessed:     0 / 1098 ( 0.00%)
  Misses:           780 / 1878 (41.53%)
Uncacheable calls:   30 / 1908 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 5.12%)
  Hits:            1098 / 1878 (58.47%)
  Misses:           780 / 1878 (41.53%)

CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh

+ ccache --show-stats
ccache stats:
Cacheable calls:   3756 / 3816 (98.43%)
  Hits:            2820 / 3756 (75.08%)
    Direct:        2820 / 2820 (100.0%)
    Preprocessed:     0 / 2820 ( 0.00%)
  Misses:           936 / 3756 (24.92%)
Uncacheable calls:   60 / 3816 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 5.19%)
  Hits:            2820 / 3756 (75.08%)
  Misses:           936 / 3756 (24.92%)
```

So we have multiple configurations getting built (Python versions,
tracing enable/disabled), but we still get a reasonable number of cache
hits. Definitely room to improve there, but better than nothing.
@ScottTodd
Copy link
Member Author

I've landed some incremental changes that prepare us for package-based workflows, but I've been a bit skeptical of the complexity that they will introduce.

Here's another data point as motivation: installing into a fresh venv, with package downloads already cached on the system, https://github.com/nod-ai/shark-ai/actions/runs/12206496774/job/34056060889#step:5:32 took 9m20s on the Install pip deps step:

  • 2 minutes for pip install --no-compile -r pytorch-cpu-requirements.txt
  • 6 minutes for pip install --no-compile -r requirements.txt -e sharktank/ shortfin/
  • 20 seconds for pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps -e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine"
  • 15 seconds for pip install -f https://iree.dev/pip-release-links.html --upgrade --pre iree-base-compiler iree-base-runtime

ScottTodd added a commit that referenced this issue Dec 9, 2024
This simplification will help with
#584.

Nightly releases of iree-turbine are now being built thanks to
iree-org/iree-turbine#314 and published at the
same index as the other IREE packages thanks to
iree-org/iree#19391.
@ScottTodd
Copy link
Member Author

The pip install step isn't consistently that slow. Recent runs took ~4m30s instead of that 9m+.
Sample workflow run history: https://github.com/nod-ai/shark-ai/actions/workflows/ci-shark-ai.yml?query=branch%3Amain

For workflows that only use sharktank and not shortfin, the setup is already fast enough: 27s at https://github.com/nod-ai/shark-ai/actions/runs/12243785411/job/34154097452, for example.

The install steps are simpler now, so I'm skeptical about going "full pkgci" across all workflows from a complexity point of view. For workflows that run integration tests using shortfin, a dedicated package build job will make more sense as we make the build more complex... like adding a rust dependency for tokenizers.

ScottTodd added a commit that referenced this issue Dec 10, 2024
Progress on #584. ~~Depends on
#666 (the first commit).~~

This is refactors the `build_packages.yml` workflow so it can be used
via `workflow_call` as part of a "pkgci" setup, as an alternative to
creating a new `pkgci_build_packages.yml` workflow as originally
proposed in #589. This lets us
reuse the same workflow for building stable, nightly, and dev packages,
all across the same matrix of Python versions and operating systems.
Package builds take about 2 minutes (wall time) across the full matrix,
so we might as well build them all, instead of artificially constraining
ourselves to a subset like only Linux on Python 3.11.

Triggers for the workflow are now this:

Trigger | Scenario | Build type(s)
-- | -- | --
`schedule` | Nightly pre-release build | `rc`
`workflow_dispatch` | Workflow testing, manual releasing | `rc` default,
`stable` and `dev` possible
`workflow_call` | Pull request or push "pkgci" dev builds | `dev`
default, `stable` and `rc` possible

With this workflow behavior:

Build type | Version suffix | Cache enabled? | Tracing enabled? | Pushes
to release?
-- | -- | -- | -- | --
`stable` | None | No | Yes | No
`rc` | `rcYYYYMMDD` | No | Yes | Yes
`dev` | `.dev0+${{ github.sha }}` | Yes | No | No

Tested over at
https://github.com/ScottTodd/shark-ai/actions/workflows/build_packages.yml.
Example run:
https://github.com/ScottTodd/shark-ai/actions/runs/12245900071 (warm
cache)
monorimet pushed a commit that referenced this issue Dec 13, 2024
Progress on #584.

This is expected to save around 10-20 seconds when building packages on
standard GitHub-hosted runners:

```
Tue, 03 Dec 2024 11:07:18 GMT   [372/380] Linking CXX shared library src/libshortfin.so.3.1.0
Tue, 03 Dec 2024 11:07:18 GMT   [373/380] Creating library symlink src/libshortfin.so.1 src/libshortfin.so
Tue, 03 Dec 2024 11:07:23 GMT   [374/380] Linking CXX executable src/shortfin/support/shortfin_support_test
Tue, 03 Dec 2024 11:07:23 GMT   [375/380] Linking CXX executable src/shortfin/array/shortfin_array_test
Tue, 03 Dec 2024 11:07:36 GMT   [376/380] Building CXX object python/CMakeFiles/shortfin_python_extension.dir/array_host_ops.cc.o
Tue, 03 Dec 2024 11:07:45 GMT   [377/380] Linking CXX shared module python/_shortfin_default/lib.cpython-311-x86_64-linux-gnu.so
```
(from these logs:
https://github.com/nod-ai/shark-ai/actions/runs/12138320160/job/33843543941#step:6:738)

IREE also disables its tests when building packages:
*
https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/compiler/setup.py#L260
*
https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/runtime/setup.py#L278
monorimet pushed a commit that referenced this issue Dec 13, 2024
The `mi300-sdxl-kernel` runner has been offline for a few weeks, so runs
of this workflow have been queued:
https://github.com/nod-ai/shark-ai/actions/workflows/ci-sdxl.yaml. This
`mi300x-4` runner is probably fit to run this workflow.

Also refactored the workflow to not use explicit build steps, which
loosens the requirements on installed software and helps make progress
on #584.
monorimet pushed a commit that referenced this issue Dec 13, 2024
Many of these workflows are using persistent self-hosted runners, so it
looks like they have been reusing the same system-wide Python
environment between workflow runs (plus layer of caching on top). This
switches to using venvs at `${{ github.workspace }}/.venv` that should
be ephemeral, giving us more explicit control over which packages are
installed.

More work is planned as part of
#584 to refactor these
workflows further - replacing the package installs code like `pip
install --no-compile -r requirements.txt -r
sharktank/requirements-tests.txt -e sharktank/` with a `setup_venv.py`
script that uses dev/nightly/stable packages (from an appropriate
source).

This also disables pip caching, since that is not directly compatible
with using venvs. As a result, some workflows are slower now, but they
are more predictable in what they install. Good reading for adding
caching back:
*
https://adamj.eu/tech/2023/11/02/github-actions-faster-python-virtual-environments/
*
https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages
monorimet pushed a commit that referenced this issue Dec 13, 2024
#646)

Splitting this off from #589 to
make progress on #584.

Tested with
```
CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh

+ ccache --show-stats
Cacheable calls:   626 / 636 (98.43%)
  Hits:              2 / 626 ( 0.32%)
    Direct:          2 /   2 (100.0%)
    Preprocessed:    0 /   2 ( 0.00%)
  Misses:          624 / 626 (99.68%)
Uncacheable calls:  10 / 636 ( 1.57%)
Local storage:
  Cache size (GB): 0.1 / 2.0 ( 3.10%)
  Hits:              2 / 626 ( 0.32%)
  Misses:          624 / 626 (99.68%)

+ ccache --show-stats
ccache stats:
Cacheable calls:   1252 / 1272 (98.43%)
  Hits:             550 / 1252 (43.93%)
    Direct:         550 /  550 (100.0%)
    Preprocessed:     0 /  550 ( 0.00%)
  Misses:           702 / 1252 (56.07%)
Uncacheable calls:   20 / 1272 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 4.11%)
  Hits:             550 / 1252 (43.93%)
  Misses:           702 / 1252 (56.07%)

+ ccache --show-stats
Cacheable calls:   1878 / 1908 (98.43%)
  Hits:            1098 / 1878 (58.47%)
    Direct:        1098 / 1098 (100.0%)
    Preprocessed:     0 / 1098 ( 0.00%)
  Misses:           780 / 1878 (41.53%)
Uncacheable calls:   30 / 1908 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 5.12%)
  Hits:            1098 / 1878 (58.47%)
  Misses:           780 / 1878 (41.53%)

CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh

+ ccache --show-stats
ccache stats:
Cacheable calls:   3756 / 3816 (98.43%)
  Hits:            2820 / 3756 (75.08%)
    Direct:        2820 / 2820 (100.0%)
    Preprocessed:     0 / 2820 ( 0.00%)
  Misses:           936 / 3756 (24.92%)
Uncacheable calls:   60 / 3816 ( 1.57%)
Local storage:
  Cache size (GB):  0.1 /  2.0 ( 5.19%)
  Hits:            2820 / 3756 (75.08%)
  Misses:           936 / 3756 (24.92%)
```

So we have multiple configurations getting built (Python versions,
tracing enable/disabled), but we still get a reasonable number of cache
hits. Definitely room to improve there, but better than nothing.
IanNod pushed a commit to IanNod/SHARK-Platform that referenced this issue Dec 17, 2024
This simplification will help with
nod-ai#584.

Nightly releases of iree-turbine are now being built thanks to
iree-org/iree-turbine#314 and published at the
same index as the other IREE packages thanks to
iree-org/iree#19391.
IanNod pushed a commit to IanNod/SHARK-Platform that referenced this issue Dec 17, 2024
Progress on nod-ai#584. ~~Depends on
nod-ai#666 (the first commit).~~

This is refactors the `build_packages.yml` workflow so it can be used
via `workflow_call` as part of a "pkgci" setup, as an alternative to
creating a new `pkgci_build_packages.yml` workflow as originally
proposed in nod-ai#589. This lets us
reuse the same workflow for building stable, nightly, and dev packages,
all across the same matrix of Python versions and operating systems.
Package builds take about 2 minutes (wall time) across the full matrix,
so we might as well build them all, instead of artificially constraining
ourselves to a subset like only Linux on Python 3.11.

Triggers for the workflow are now this:

Trigger | Scenario | Build type(s)
-- | -- | --
`schedule` | Nightly pre-release build | `rc`
`workflow_dispatch` | Workflow testing, manual releasing | `rc` default,
`stable` and `dev` possible
`workflow_call` | Pull request or push "pkgci" dev builds | `dev`
default, `stable` and `rc` possible

With this workflow behavior:

Build type | Version suffix | Cache enabled? | Tracing enabled? | Pushes
to release?
-- | -- | -- | -- | --
`stable` | None | No | Yes | No
`rc` | `rcYYYYMMDD` | No | Yes | Yes
`dev` | `.dev0+${{ github.sha }}` | Yes | No | No

Tested over at
https://github.com/ScottTodd/shark-ai/actions/workflows/build_packages.yml.
Example run:
https://github.com/ScottTodd/shark-ai/actions/runs/12245900071 (warm
cache)
@ScottTodd
Copy link
Member Author

The https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/ci-shark-ai.yml workflow has massive overhead on nodai-amdgpu-mi250-x86-64 runners right now. Sample logs: https://github.com/nod-ai/shark-ai/actions/runs/12654550570/job/35262707224

  • 30 minutes total
  • 6 minutes for actions/checkout
    Tue, 07 Jan 2025 15:48:38 GMT /usr/bin/git clean -ffdx
    Tue, 07 Jan 2025 15:54:24 GMT Removing .pytest_cache/
    Tue, 07 Jan 2025 15:54:24 GMT Removing .venv/
    
  • 47 seconds for python -m venv ${VENV_DIR}
  • 16 minutes for pip install (including building shortfin)
  • 8 minutes for the actual tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants