Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hosting jit_stencils.h #115869

Open
brandtbucher opened this issue Feb 23, 2024 · 14 comments
Open

Hosting jit_stencils.h #115869

brandtbucher opened this issue Feb 23, 2024 · 14 comments
Assignees
Labels
3.13 bugs and security fixes build The build process and cross-build dependencies Pull requests that update a dependency file topic-JIT

Comments

@brandtbucher
Copy link
Member

brandtbucher commented Feb 23, 2024

While this is probably desirable, I'm not quite sure if it's feasible. With that said, several people (@vstinner at the sprint and @zooba during PR review) both expressed a desire to remove the LLVM build-time dependency for JIT builds. Let's have that conversation here.

Background

When building CPython with the JIT enabled, LLVM 16 LLVM 18 is used to compile Tools/jit/template.c many times, and process the resulting object files into a file called jit_stencils.h in the build directory.

A useful analogy

Because this file depends on Python.h (and thus pyconfig.h and many build-specific configuration options, including things like _DEBUG/NDEBUG/Py_DEBUG/etc.) and contains binary code, it is probably most useful to think of jit_stencils.h as a binary extension module.

If we could build, host, and manage compiled versions of, say, itertoolsmodule.c somewhere and have it work correctly for those who need it, then such a scheme would probably work for jit_stencils.h.

Open questions

  • Can this be done in a way that actually works correctly and is worth the trouble (the status quo being "download LLVM 18 if you want to build the JIT").
  • Should we just try to host these for the most common build configurations? Or "everything"?
  • Should all platforms be in one file (with each platform guarded by #ifdefs), or many files?
  • Should these files be checked in? Or hosted somewhere? Who/what builds them? How often?
  • Does this introduce any new attack vectors?
  • What should the workflow look like for:
    • ...those developing the JIT?
    • ...those changing header files that the JIT depends on?
    • ...those building CPython with a JIT from a random commit?
    • ...those building CPython with a JIT from a release tag?

Linked PRs

@brandtbucher brandtbucher added build The build process and cross-build 3.13 bugs and security fixes dependencies Pull requests that update a dependency file labels Feb 23, 2024
@terryjreedy
Copy link
Member

terryjreedy commented Feb 23, 2024

A couple of naive questions:

  1. Is the part of LLVM needed to produce jit_stencils.h small enough to consider extracting it somehow into our repository? And is the algorithm to do so stable enough to not necessarily be a maintenance burden?

  2. What is the risk of the status quo? If a change to LLVM broke our usage of it, would that necessarily be considered a breakage of LLVM itself? Do we only depend on documented (and hopefully tested) behavior?

@zooba
Copy link
Member

zooba commented Feb 26, 2024

Should all platforms be in one file (with each platform guarded by #ifdefs), or many files?

This one is easy enough for me to answer right now: many files. If we have to download all the platforms every time, it may as well just be checked into the main repo.

The advantage here is space saved by not including every platform in the main repo (and then not having to decide which ones we do include). If that's not an interesting saving, then this isn't a question about hosting, it's a question about caching the code.

(And just for context, my sensitivity to time taken by a new process that runs in every clean CI build is about 15 seconds. If it takes more than that, I want a workaround that takes less than that. And my sensitivity for local development is basically 0 more installers to run - time matters less because there'll be caching.)

@savannahostrowski
Copy link
Member

I'd be interested to understand folks' goals for this and any additional rationale. Are there specific pain points we hope to resolve by eliminating the dependency?

@zooba
Copy link
Member

zooba commented May 1, 2024

We don't like having build-time dependencies that can fail due to networking issues or installation issues. Virtually all our network access is only to github.com (the exception being apt install on Linux, which is the most flaky part of that build), and is only accessing our own repositories (where we implicitly trust the potential contributors).

Docs builds are also a minor exception for now, but those are being separated out in official builds so that our supply chain is as clean and tight as we can make it.

So basically, the goal is build reliability, and our way of achieving that is to have all of our build-time dependencies somewhere under github.com/python.

A secondary goal is build time, which is why we have checked in generated files and regen them when their sources change. This is primarily for local dev builds (CI has already gotten way out of hand, I don't think we'll ever get PR builds down to a reasonably fast check anymore, but that used to be a goal). We also don't assume that our contributors have the ability/desire to install additional apps beyond their system compilers, or that they have sufficient internet capacity for anything non-essential or large (besides system compilers).

So for the sake of contribution, we don't want contributors to have to locate/download/install anything outside of their main compiler unless it's scripted as a normal part of build and highly reliable.1

Footnotes

  1. Or at least "as reliable as the rest of our dependencies", which more or less means if GitHub is down, builds can fail, but we don't want that list getting longer - not even PyPI.

@vstinner
Copy link
Member

vstinner commented May 1, 2024

I'd be interested to understand folks' goals for this and any additional rationale. Are there specific pain points we hope to resolve by eliminating the dependency?

Usually Linux distributions only include one LLVM version, like LLVM 17 and clang 17 (version used by Fedora 39). Before, the Python JIT compiler required clang 16 and so it didn't work. Now it requires clang 18 and so it still doesn't work.

If Python source code (ex: in the Git repository) contains code generated by LLVM, Python doesn't have to attempt to use the same LLVM version than the one used by Debian (stable / old-stable), Ubuntu (latest / LTS), Fedora (Rawhide / stable), etc. Spoiler: there is no single LLVM version available on all Linux distributions if you consider all flavors (especially development version vs stable version).

@eli-schwartz
Copy link
Contributor

eli-schwartz commented May 1, 2024

Some distros actually do provide multiple LLVM versions, but those tend to be more "advanced" distros. Even there, llvm is generally a fairly hefty burden to install. Especially if it's the only software you have that uses llvm, because your system uses a GCC toolchain. In contrast, cpython is an extremely fundamental package that is used extensively by the system stack.

(llvm is currently needed by... okay, well, I do need llvm 17 for a) mesa and b) gnome gjs / cinnamon cjs, which use mozilla spidermonkey. I don't need any other version of llvm, and I wouldn't need either one of those either if I was running a server system.)

The real kicker is that llvm depends on cpython. If cpython also depends on llvm, then which one do you build first? Answer: you have to build cpython twice, once without the JIT and once with the JIT. Dependency cycles are dreary and depressing to deal with, and may not be able to be fully automated at all. They are best avoided if it is possible to do, and every package that has to be added into the bootstrap set for extra-special handling is an extra burden.

Hosting the stencils just like any other generated code cpython uses, would allow sidestepping this worry. No need to pull llvm into the bootstrap set or add special cases to build it twice.

@encukou
Copy link
Member

encukou commented Sep 26, 2024

FWIW, given my experience as a Fedora maintainer: for the system Python package, Fedora will want to avoid pre-generated files (checked in or downloaded), which also means trying rather hard to have a working system-packaged version of LLVM. (You can ping @hroncok if you want details/confirmation.)

@hroncok
Copy link
Contributor

hroncok commented Sep 26, 2024

We also have multiple llvm versions available in Fedora.

@zooba
Copy link
Member

zooba commented Sep 27, 2024

If we wrote all the stencils by hand and had them checked in, would you insist on recreating them?

What if we generated them and then validated them by hand, and then checked them in. Would you recreate them and revalidate before accepting your regeneration?

At what point do you decide to trust the sources provided by the project vs. trusting your own generated/unvendored ones more?

@hroncok
Copy link
Contributor

hroncok commented Sep 27, 2024

At what point do you decide to trust the sources provided by the project vs. trusting your own generated/unvendored ones more?

At a point where it's more of a binary blob than "source". I understand that line is fuzzy.

@zooba
Copy link
Member

zooba commented Sep 27, 2024

That's a very reasonable line, I don't see how there's any way around it for this case (assuming a initialised C array is more of a binary blob, which I'd sure say it is).

Though in those circumstances, having to build CPython without the JIT in order to bootstrap clang so that you can rebuild CPython wit the JIT seems entirely reasonable to me. Yes, it's an annoying amount of work, but it happens once at the distro level (whose job is to do the annoying work) rather than happening every time at the user/contributor level.

@brandtbucher
Copy link
Member Author

So, following up on the discussion that we had at the core dev sprint: there is definitely an appetite for removing the LLVM requirement for "normal" JIT builds, so we should begin work on making that a reality (this will also unblock shipping the JIT off-by-default in 3.14, since it makes the release process a lot cleaner).

The general consensus we reached was:

  • Let's provide pre-generated stencils for non-debug builds on tier-one platforms (more on this later).
  • It's probably okay to check these into the repo (more on this later).

What this means is that, if you are compiling a release build of CPython from a checkout that hasn't modified any of the JIT's input files, and you would also like to build the JIT, you should be able to do so without either Python or LLVM already installed. For everyone else, the process will work exactly as it does today.

As far as changing the JIT itself, the change is pretty simple; #include "jit_stencils.h" will be replaced with something like this:

#if defined(__APPLE__) && defined(__aarch64__)
    #ifdef USE_PREBUILT_JIT_STENCILS
        #include "somewhere/else/jit_stencils_aarch64-apple-darwin.h"
    #else
        #include "jit_stencils_aarch64-apple-darwin.h"
    #endif
#elif defined(__APPLE__) && defined(__x86_64__)
    #ifdef USE_PREBUILT_JIT_STENCILS
        #include "somewhere/else/jit_stencils_x86_64-apple-darwin.h"
    #else
        #include "jit_stencils_x86_64-apple-darwin.h"
    #endif
#elif defined(__linux__) && defined(__aarch64__)
    #ifdef USE_PREBUILT_JIT_STENCILS
        #include "somewhere/else/jit_stencils_aarch64-unknown-linux-gnu.h"
    #else
        #include "jit_stencils_aarch64-unknown-linux-gnu.h"
    #endif
#elif defined(__linux__) && defined(__x86_64__)
    #ifdef USE_PREBUILT_JIT_STENCILS
        #include "somewhere/else/jit_stencils_x86_64-unknown-linux-gnu.h"
    #else
        #include "jit_stencils_x86_64-unknown-linux-gnu.h"
    #endif
#elif defined(_M_ARM64)
    #ifdef USE_PREBUILT_JIT_STENCILS
        #include "somewhere/else/jit_stencils_aarch64-pc-windows-msvc.h"
    #else
        #include "jit_stencils_aarch64-pc-windows-msvc.h"
    #endif
#elif defined(_M_X64)
    #ifdef USE_PREBUILT_JIT_STENCILS
        #include "somewhere/else/jit_stencils_x86_64-pc-windows-msvc.h"
    #else
        #include "jit_stencils_x86_64-pc-windows-msvc.h"
    #endif
#elif defined(_M_IX86)
    #ifdef USE_PREBUILT_JIT_STENCILS
        #include "somewhere/else/jit_stencils_i686-pc-windows-msvc.h"
    #else
        #include "jit_stencils_i686-pc-windows-msvc.h"
    #endif
#else
    #error "Unsupported platform!"
#endif

When re-generating the JIT stencils locally, the process will work the same as today, except that the target triple will be appended to the end of the generated file, as shown above (this scheme solves GH-114809, whether the files are pre-generated or not).

So how do the pre-build ones get regenerated? A scheme that makes sense to me would be to piggyback on our CI in GitHub actions:

  1. When somebody modifies any of the JIT inputs, JIT CI runs.
  2. This CI re-generates the JIT stencils for each platform.
  3. If the file matches the ones that are present in the repo, the build continues and tests are run.
  4. If the file doesn't match, the job fails and the newly-generated file is made available to the author of the PR to add to their branch (either as some sort of build artifact, or as a bot PR against the author's branch on their fork).

This combines two closely related things that we want to do: run JIT CI on the PR for a bunch of platforms, and check that the files are up-to-date for each of those same platforms. Note also that only step 4 is new; the rest of the process is the same as what we already have.

Putting them somewhere outside of the repo

A few core devs expressed concern about the impact on the repo size of the generated files. While checking them directly into the repo is easiest, they're big (and once they're in there, they're in there forever). Alternatives do exist, should we wish to explore them. The most viable appeared to be putting them in another Python org repository, and referencing that. There was strong pushback at the sprint against using git submodules, so if we did go this route, we would probably end up rolling our own version of submodules that's better-suited to our workflow.

One possible scheme, based on an idea shared by @nascheme, is where we have a "manifest" file that contains a hash referencing a set of files in the stencils repo. Rather than committing the generated files into a PR author's branch, the stencil-generating CI jobs would instead commit the generated files into the stencils repo, and provide the PR author with a new hash to update the manifest with on their branch. The upside is that the files are kept out of the repo. The downside is that the tooling could need to clone a separate repo as part of the build process (or, the user would need to have a cloned copy somewhere and would need to communicate its location to the build scripts).

Providing pre-generated stencils for debug builds

In theory, builds configured --with-pydebug should be binary-compatible with release builds, so the pre-generated release stencils should "just work". There are three main issues with this:

  • While the builds are compatible from an "official ABI" standpoint, the stencils themselves are part of the interpreter core, and use something of an "internal ABI". Currently, trying to use release stencils on a debug build crashes, I suspect due to some internal struct members being conditionally defined differently on debug builds (but I'm not sure).
  • --with-pydebug implies Py_REF_DEBUG, which puts _Py_INCREF_IncRefTotal and _Py_DECREF_DecRefTotal calls all over the place (and also adds checks for negative refcounts). Omitting these in the JIT code results in incorrect sys.gettotalrefcount values, and negative refcount checks can't be relied on to abort the process anymore. So we would probably need to decouple the two configure options (but this could mean that refleaks in JIT builds are much harder to detect and hunt down).
  • --with-pydebug implies --with-assertions. If we used release stencils on debug builds, assertions would not be active in JIT code. This isn't incorrect, but it's probably unexpected.

So, if we wanted to remove the LLVM requirement for debug builds by just re-using the release stencils, we would need to address these issues. The simpler, but more wasteful option would just be to generate separate debug stencils for every platform.


Regarding next steps, I think the following sequence makes sense:

  • Start suffixing the generated files with the target triple, as shown above.
  • Optional: we can update the Makefile and/or the JIT build scripts to generate multiple different files on "fat" macOS builds, to fix JIT & macOS fat builds #114809. It might be simpler to just have the Makefile call the build script twice? Not sure.
  • Then, we can experiment with using CI to check in files for a couple of platforms. We can either put them in the repo on someone's fork, or try the separate-stencils-repo option for prototyping. Either way, we shouldn't merge this until we have concensus regarding the approach; this step is more about figuring out how the the CI portion of the problem. This will also involve solving the potentially-tricky problem of detecting whether the current build is even able to use the pre-generated files.

Curious to hear others' thoughts. @savannahostrowski, alright if I assign you?

@dolfinus
Copy link

dolfinus commented Nov 28, 2024

Rather than committing the generated files into a PR author's branch, the stencil-generating CI jobs would instead commit the generated files into the stencils repo, and provide the PR author with a new hash to update the manifest with on their branch.

That sounds almost the same as Git LFS - large files are stored somethere else (usually some S3 instance), then git tracks only file hashes. But unlike a separated repo, git clone could fetch files stored in LFS without any changes to CI process (git-lfs plugin is usually already installed to CI runners).

@savannahostrowski savannahostrowski self-assigned this Nov 28, 2024
@nascheme
Copy link
Member

After thinking more about, I don't think a manifest file is required. Here is a rough idea:

  • make the path to the stencil file depend on the hash of the inputs and the platform triple

  • use something similar to Tools/jit/_targets.py _compute_digest() to generate a directory or file path, for example:

jit_stencils/3.14/x86_64-pc-linux-gnu/ba5e6e3182fa318dd7fdab91fa5d5600c1f9068798c1baa52e779fdcacfc0f7b.h

  • the jit_stencils/ folder is just another git repo stored on github.com/python/..., similar to other external deps

  • we can have a CI tool that detects when stencils are missing, generate them and commit them to the jit_stencils repo.

  • for end users, we can also provide a tool that works if you have a network connection, populating jit_stencils/ with only the stencil file(s) you need to build the source version you have. I'm not familiar with it but maybe a git sparse checkout would work.

  • for cpython devs, we could have some tooling to make life easier, something like pull-stencils.py commit-stencils.py. It might be best to let only the CI system commit to the stencils repo. If you have LLVM, you could just generate updated stencils like you currently do and not commit them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes build The build process and cross-build dependencies Pull requests that update a dependency file topic-JIT
Projects
None yet
Development

No branches or pull requests

12 participants