-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hosting jit_stencils.h
#115869
Comments
A couple of naive questions:
|
This one is easy enough for me to answer right now: many files. If we have to download all the platforms every time, it may as well just be checked into the main repo. The advantage here is space saved by not including every platform in the main repo (and then not having to decide which ones we do include). If that's not an interesting saving, then this isn't a question about hosting, it's a question about caching the code. (And just for context, my sensitivity to time taken by a new process that runs in every clean CI build is about 15 seconds. If it takes more than that, I want a workaround that takes less than that. And my sensitivity for local development is basically 0 more installers to run - time matters less because there'll be caching.) |
I'd be interested to understand folks' goals for this and any additional rationale. Are there specific pain points we hope to resolve by eliminating the dependency? |
We don't like having build-time dependencies that can fail due to networking issues or installation issues. Virtually all our network access is only to github.com (the exception being Docs builds are also a minor exception for now, but those are being separated out in official builds so that our supply chain is as clean and tight as we can make it. So basically, the goal is build reliability, and our way of achieving that is to have all of our build-time dependencies somewhere under github.com/python. A secondary goal is build time, which is why we have checked in generated files and regen them when their sources change. This is primarily for local dev builds (CI has already gotten way out of hand, I don't think we'll ever get PR builds down to a reasonably fast check anymore, but that used to be a goal). We also don't assume that our contributors have the ability/desire to install additional apps beyond their system compilers, or that they have sufficient internet capacity for anything non-essential or large (besides system compilers). So for the sake of contribution, we don't want contributors to have to locate/download/install anything outside of their main compiler unless it's scripted as a normal part of build and highly reliable.1 Footnotes
|
Usually Linux distributions only include one LLVM version, like LLVM 17 and clang 17 (version used by Fedora 39). Before, the Python JIT compiler required clang 16 and so it didn't work. Now it requires clang 18 and so it still doesn't work. If Python source code (ex: in the Git repository) contains code generated by LLVM, Python doesn't have to attempt to use the same LLVM version than the one used by Debian (stable / old-stable), Ubuntu (latest / LTS), Fedora (Rawhide / stable), etc. Spoiler: there is no single LLVM version available on all Linux distributions if you consider all flavors (especially development version vs stable version). |
Some distros actually do provide multiple LLVM versions, but those tend to be more "advanced" distros. Even there, llvm is generally a fairly hefty burden to install. Especially if it's the only software you have that uses llvm, because your system uses a GCC toolchain. In contrast, cpython is an extremely fundamental package that is used extensively by the system stack. (llvm is currently needed by... okay, well, I do need llvm 17 for a) mesa and b) gnome gjs / cinnamon cjs, which use mozilla spidermonkey. I don't need any other version of llvm, and I wouldn't need either one of those either if I was running a server system.) The real kicker is that llvm depends on cpython. If cpython also depends on llvm, then which one do you build first? Answer: you have to build cpython twice, once without the JIT and once with the JIT. Dependency cycles are dreary and depressing to deal with, and may not be able to be fully automated at all. They are best avoided if it is possible to do, and every package that has to be added into the bootstrap set for extra-special handling is an extra burden. Hosting the stencils just like any other generated code cpython uses, would allow sidestepping this worry. No need to pull llvm into the bootstrap set or add special cases to build it twice. |
FWIW, given my experience as a Fedora maintainer: for the system Python package, Fedora will want to avoid pre-generated files (checked in or downloaded), which also means trying rather hard to have a working system-packaged version of LLVM. (You can ping @hroncok if you want details/confirmation.) |
We also have multiple llvm versions available in Fedora. |
If we wrote all the stencils by hand and had them checked in, would you insist on recreating them? What if we generated them and then validated them by hand, and then checked them in. Would you recreate them and revalidate before accepting your regeneration? At what point do you decide to trust the sources provided by the project vs. trusting your own generated/unvendored ones more? |
At a point where it's more of a binary blob than "source". I understand that line is fuzzy. |
That's a very reasonable line, I don't see how there's any way around it for this case (assuming a initialised C array is more of a binary blob, which I'd sure say it is). Though in those circumstances, having to build CPython without the JIT in order to bootstrap clang so that you can rebuild CPython wit the JIT seems entirely reasonable to me. Yes, it's an annoying amount of work, but it happens once at the distro level (whose job is to do the annoying work) rather than happening every time at the user/contributor level. |
So, following up on the discussion that we had at the core dev sprint: there is definitely an appetite for removing the LLVM requirement for "normal" JIT builds, so we should begin work on making that a reality (this will also unblock shipping the JIT off-by-default in 3.14, since it makes the release process a lot cleaner). The general consensus we reached was:
What this means is that, if you are compiling a release build of CPython from a checkout that hasn't modified any of the JIT's input files, and you would also like to build the JIT, you should be able to do so without either Python or LLVM already installed. For everyone else, the process will work exactly as it does today. As far as changing the JIT itself, the change is pretty simple; #if defined(__APPLE__) && defined(__aarch64__)
#ifdef USE_PREBUILT_JIT_STENCILS
#include "somewhere/else/jit_stencils_aarch64-apple-darwin.h"
#else
#include "jit_stencils_aarch64-apple-darwin.h"
#endif
#elif defined(__APPLE__) && defined(__x86_64__)
#ifdef USE_PREBUILT_JIT_STENCILS
#include "somewhere/else/jit_stencils_x86_64-apple-darwin.h"
#else
#include "jit_stencils_x86_64-apple-darwin.h"
#endif
#elif defined(__linux__) && defined(__aarch64__)
#ifdef USE_PREBUILT_JIT_STENCILS
#include "somewhere/else/jit_stencils_aarch64-unknown-linux-gnu.h"
#else
#include "jit_stencils_aarch64-unknown-linux-gnu.h"
#endif
#elif defined(__linux__) && defined(__x86_64__)
#ifdef USE_PREBUILT_JIT_STENCILS
#include "somewhere/else/jit_stencils_x86_64-unknown-linux-gnu.h"
#else
#include "jit_stencils_x86_64-unknown-linux-gnu.h"
#endif
#elif defined(_M_ARM64)
#ifdef USE_PREBUILT_JIT_STENCILS
#include "somewhere/else/jit_stencils_aarch64-pc-windows-msvc.h"
#else
#include "jit_stencils_aarch64-pc-windows-msvc.h"
#endif
#elif defined(_M_X64)
#ifdef USE_PREBUILT_JIT_STENCILS
#include "somewhere/else/jit_stencils_x86_64-pc-windows-msvc.h"
#else
#include "jit_stencils_x86_64-pc-windows-msvc.h"
#endif
#elif defined(_M_IX86)
#ifdef USE_PREBUILT_JIT_STENCILS
#include "somewhere/else/jit_stencils_i686-pc-windows-msvc.h"
#else
#include "jit_stencils_i686-pc-windows-msvc.h"
#endif
#else
#error "Unsupported platform!"
#endif When re-generating the JIT stencils locally, the process will work the same as today, except that the target triple will be appended to the end of the generated file, as shown above (this scheme solves GH-114809, whether the files are pre-generated or not). So how do the pre-build ones get regenerated? A scheme that makes sense to me would be to piggyback on our CI in GitHub actions:
This combines two closely related things that we want to do: run JIT CI on the PR for a bunch of platforms, and check that the files are up-to-date for each of those same platforms. Note also that only step 4 is new; the rest of the process is the same as what we already have. Putting them somewhere outside of the repoA few core devs expressed concern about the impact on the repo size of the generated files. While checking them directly into the repo is easiest, they're big (and once they're in there, they're in there forever). Alternatives do exist, should we wish to explore them. The most viable appeared to be putting them in another Python org repository, and referencing that. There was strong pushback at the sprint against using git submodules, so if we did go this route, we would probably end up rolling our own version of submodules that's better-suited to our workflow. One possible scheme, based on an idea shared by @nascheme, is where we have a "manifest" file that contains a hash referencing a set of files in the stencils repo. Rather than committing the generated files into a PR author's branch, the stencil-generating CI jobs would instead commit the generated files into the stencils repo, and provide the PR author with a new hash to update the manifest with on their branch. The upside is that the files are kept out of the repo. The downside is that the tooling could need to clone a separate repo as part of the build process (or, the user would need to have a cloned copy somewhere and would need to communicate its location to the build scripts). Providing pre-generated stencils for debug buildsIn theory, builds configured
So, if we wanted to remove the LLVM requirement for debug builds by just re-using the release stencils, we would need to address these issues. The simpler, but more wasteful option would just be to generate separate debug stencils for every platform. Regarding next steps, I think the following sequence makes sense:
Curious to hear others' thoughts. @savannahostrowski, alright if I assign you? |
That sounds almost the same as Git LFS - large files are stored somethere else (usually some S3 instance), then git tracks only file hashes. But unlike a separated repo, git clone could fetch files stored in LFS without any changes to CI process (git-lfs plugin is usually already installed to CI runners). |
After thinking more about, I don't think a manifest file is required. Here is a rough idea:
|
While this is probably desirable, I'm not quite sure if it's feasible. With that said, several people (@vstinner at the sprint and @zooba during PR review) both expressed a desire to remove the LLVM build-time dependency for JIT builds. Let's have that conversation here.
Background
When building CPython with the JIT enabled,
LLVM 16LLVM 18 is used to compileTools/jit/template.c
many times, and process the resulting object files into a file calledjit_stencils.h
in the build directory.A useful analogy
Because this file depends on
Python.h
(and thuspyconfig.h
and many build-specific configuration options, including things like_DEBUG
/NDEBUG
/Py_DEBUG
/etc.) and contains binary code, it is probably most useful to think ofjit_stencils.h
as a binary extension module.If we could build, host, and manage compiled versions of, say,
itertoolsmodule.c
somewhere and have it work correctly for those who need it, then such a scheme would probably work forjit_stencils.h
.Open questions
#ifdef
s), or many files?Linked PRs
jit_stencils.h
reproducible #127166The text was updated successfully, but these errors were encountered: