-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for speeding up rustc via its build configuration #103595
Comments
Another thing to try that was brought up on zulip is aligning the text segment to 2MB so it can be loaded in a way that transparent huge pages kick in, but I'm not 100% sure if this even works yet, last time I checked large page support in the page cache was still a WIP and I haven't seen it in release notes. |
An update for windows:
|
Given that it's not even clear that this works, let's leave it off this issue for now. |
It looks like file-backed huge pages are supported now. https://www.kernel.org/doc/html/latest/filesystems/proc.html#meminfo
I think it was introduced with this one torvalds/linux@793917d so it needs at least 5.18 It also depends on filesystem support. |
I read and actually checked that rustc links dynamically against rustc_driver. The same article said we dlopen codegen backends. Isn't there a way for the common case, a static binary: rustc + rustc_driver + LLVM? |
We could do that in theory, but I'm not sure if it would be that useful. Static linking has some benefits, but mostly only for tiny crates, the diminishing returns start to kick in early. |
I would enable LTO over the full binary. Startup time may also be better. |
Well, the "full binary" is one function call into Shipping a statically linked |
My LLVM folder is full of large statically linked LTO'd binaries (OSX). Could you also statically link in the LLVM backend? |
In theory, yes. In practice, I'm not sure if our current build system supports it (will check). |
No worries. -rwxr-xr-x 1 xxx staff 129M Jul 28 18:44 clang-15 It only links against system libraries and no LLVM libraries. |
I'm not sure if it helps, but on Windows LLVM can be built with a different allocator using the The LLVM pr (https://reviews.llvm.org/D71786) also showed some pretty major performance gains when it was implemented. |
@michaelwoerister also suggested in #49180 that we could set |
Note that there's a bit of a catch-22. We could start adding specialized SIMD impls for some important core routines if std were built with a higher baseline, which would increase the performance delta. But as long as such builds don't exist it's hardly worth it because it'll only benefit users of Do any of the ARM targets offer a baseline that's high enough to include fancy simd features? Maybe some generic simdification work on those could show potential benefits that would be enabled on top of what's gained by compiling with a higher baseline. Or maybe an AVX2 codepath could be added to hashbrown since that can benefit some users even without build-std. @Amanieu have there been any experiments in that direction? |
Apple aarch64 should have a very modern baseline. |
That's a very good point, I agree with that! I think that we could start with the compiler (and its stdlib), to potentially make it faster, but still keep the actual Linux x64 target without v2/v3/v4 CPU features by default. Do you have any ideas where we could start using x86 v2 CPU features? |
Doesn't the compiler link the same stdlib it uses to build programs?
Adopting the utf8 validation impl from simdutf8. Other than that it'll need some exploration. Maybe the stdsimd folks have some ideas on tap. Maybe rustc hash could be replaced with a different mixing function? ... odd, I can't find the feature level of the PCLMULQDQ instruction. |
IIRC it doesn't, it has its own copy. Conceptually, it needs to be able to use to add any (target) stdlib to the resulting program to cross-compile. But I might be wrong. |
The targets on the Platform Support page look like legit target triples. Could you encode v4 in triple and ship two Linux versions for x86? |
In theory yes, but I'm not sure if it's the best solution. Maybe we could bump the defaut target to v2/v3 and keep an unoptimized v1 target for people with old CPUs. In any case, maintaining a target is not for free, so maybe there are better solutions. To clarify, it's a very different thing to ship a v2 compiler and to make the x86 Linux target v2 by default. We're really only considering the first thing for now. |
Shipping a highly tuned v4 compiler with -mtune=icelake should give some speedup and you can use current cpu features: AVX512, PCLMULQDQ, .. Haswell was launched June 4, 2013.
|
Yes, our measurements show that v3 produces ~1-3% speedup for the compiler. But on it's own, that hasn't been worth it to use it so far, because there is non-trivial maintenance costs for doing that, plus we would drop some existing users. We'll need to tread carefully. |
AVX-512 is not really viable for broad use because AMD only started shipping it recently and intel has only shipped it to some of their market segments (workstation/server chips) and even disabled it on some recent chips due to inconsistencies between P and E cores. See https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam?platform=linux |
Build `rustc` with 1CGU on `x86_64-pc-windows-msvc` Distribute `x86_64-pc-windows-msvc` artifacts built with `rust.codegen-units=1`, like we already do on Linux. 1) effect on code size on `x86_64-pc-windows-msvc`: it's a 3.67% reduction on `rustc_driver.dll` - before, [`41d97c8a5dea2731b0e56fe97cd7cb79e21cff79`](https://ci-artifacts.rust-lang.org/rustc-builds/41d97c8a5dea2731b0e56fe97cd7cb79e21cff79/rustc-nightly-x86_64-pc-windows-msvc.tar.xz): 137605632 - after, [`704aaa875e4acccc973cbe4579e66afbac425691`](https://ci-artifacts.rust-lang.org/rustc-builds/704aaa875e4acccc973cbe4579e66afbac425691/rustc-nightly-x86_64-pc-windows-msvc.tar.xz): 132551680 2) time it took on CI - the [first `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8155647651/job/22291592507) took: 1h 31m - the [second `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8157043594/job/22295790552) took: 1h 32m 3) most recent perf results: - on a slightly noisy desktop [here](rust-lang#112267 (comment)) - ChrisDenton's results [here](rust-lang#112267 (comment)) Related tracking issue for build configuration: rust-lang#103595
Build `rustc` with 1CGU on `x86_64-apple-darwin` Distribute `x86_64-apple-darwin` artifacts built with `rust.codegen-units=1`, like we already do on Linux. 1) effect on code size on `x86_64-apple-darwin`: it's a 11.14% reduction on `librustc_driver.dylib` - before, [`41d97c8a5dea2731b0e56fe97cd7cb79e21cff79`](https://ci-artifacts.rust-lang.org/rustc-builds/41d97c8a5dea2731b0e56fe97cd7cb79e21cff79/rustc-nightly-x86_64-apple-darwin.tar.xz): 161232048 - after, [`7549dbdc09f0c4f6cc84002ac03081828054784b`](https://ci-artifacts.rust-lang.org/rustc-builds/7549dbdc09f0c4f6cc84002ac03081828054784b/rustc-nightly-x86_64-apple-darwin.tar.xz): 143256928 2) time it took on CI: - the [first `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8155512915/job/22291187124) took: 1h 33m - the [second `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8157057880/job/22295839911) took: 1h 45m 3) most recent perf results on (a noisy) x64 mac are [here](rust-lang#112268 (comment)). Related tracking issue for build configuration: rust-lang#103595
Build `rustc` with 1CGU on `x86_64-apple-darwin` Distribute `x86_64-apple-darwin` artifacts built with `rust.codegen-units=1`, like we already do on Linux. 1) effect on code size on `x86_64-apple-darwin`: it's a 11.14% reduction on `librustc_driver.dylib` - before, [`41d97c8a5dea2731b0e56fe97cd7cb79e21cff79`](https://ci-artifacts.rust-lang.org/rustc-builds/41d97c8a5dea2731b0e56fe97cd7cb79e21cff79/rustc-nightly-x86_64-apple-darwin.tar.xz): 161232048 - after, [`7549dbdc09f0c4f6cc84002ac03081828054784b`](https://ci-artifacts.rust-lang.org/rustc-builds/7549dbdc09f0c4f6cc84002ac03081828054784b/rustc-nightly-x86_64-apple-darwin.tar.xz): 143256928 2) time it took on CI: - the [first `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8155512915/job/22291187124) took: 1h 33m - the [second `try` build](https://github.com/rust-lang-ci/rust/actions/runs/8157057880/job/22295839911) took: 1h 45m 3) most recent perf results on (a noisy) x64 mac are [here](rust-lang#112268 (comment)). Related tracking issue for build configuration: rust-lang#103595
Attempted to tune ColdFuncOpt option for PGO (llvm/llvm-project#69030) #132779 to decrease file size, but sadly, no good results. Possible reasons: current linux dist uses bolt on top of pgo, which increases size; i forgot to apply the same optimization for libllvm (applied only for rustc_driver). |
With the recent-ish promotion of |
It has LTO and jemalloc, but not the 1CGU config. It’s not going to be earth shattering or anything, but I’ll make some local tests and a PR for try builds soon, to see the potential improvements there. |
done in #133747 -- I've updated the OP now that it has merged. |
There are several ways to speed up rustc by changing its build configuration, without changing its code: use a single codegen unit (CGU), profile-guided optimization (PGO), link-time optimization (LTO), post-link optimization (via BOLT), and using a better allocator (e.g. jemalloc or mimalloc).
This is a tracking issue for doing these for the most popular Tier 1 platforms: Linux64 (
x86_64-unknown-linux-gnu
), Win64 (x86_64-pc-windows-msvc
), and Mac (x86_64-apple-darwin
, and more recentlyaarch64-apple-darwin
).Items marked with [2022] are on the Compiler performance roadmap for 2022.
Single CGU
Benefits: rustc is faster, uses less memory, has a smaller binary.
Costs: rustc takes longer to build.
rustc
with a single CGU on x64 Linux #115554, merged 2023-10-01.rustc
with 1CGU onx86_64-pc-windows-msvc
#112267, merged 2024-03-12.rustc
with 1CGU onx86_64-apple-darwin
#112268, merged 2024-03-12.rustc
with 1 CGU onaarch64-apple-darwin
#133747, merged 2024-12-03.PGO
Benefits: rustc is faster.
Costs: rustc takes longer to build.
Other PGO attempts:
libstd
: Apply PGO to libstd on CI #97038, no speed-up measured.LTO
Benefits: rustc is faster.
Costs: rustc takes longer to build.
x86_64-apple-darwin
#103647 and Re-enable ThinLTO for rustc onx86_64-apple-darwin
#105845, merged 2022-12-19.This is all thin LTO, which gets most of the benefits of fat LTO with a much lower link-time cost.
Other LTO attempts:
rustdoc
: [perftest] Use LTO for compilingrustdoc
#102885, no speed-up measured.rustc
#103453, no speed-up measured, large CI build cost.BOLT
Benefits: rustc is faster.
Costs: rustc takes longer to build.
librustc_driver.so
with BOLT #116352, merged 2023-10-14.Bolt only works on ELF binaries, and thus is Linux-only.
Instruction set
Benefits: rustc is faster?
Costs: rustc won't run on old CPUs.
Linker
Benefits: rustc (linking) is faster.
Costs: hard to get working.
lld
by default onx86_64-unknown-linux-gnu
:rust-lld
on nightlyx86_64-unknown-linux-gnu
#124129, merged 2024-05-17Better allocator
Benefits: rustc (linking) is faster.
Costs: rustc uses more memory?
Note: #92249 and #92317 tried using two different versions of mimalloc (one 1.7-based, one 2.0-based) instead of jemalloc, but the speed/memory tradeoffs in both cases were deemed inferior (the max-rss regressions expected to be fixed in the 2.x series still exist as of 2.0.6, see #103944).
Note: we use a better allocator by simply overriding malloc/free, rather than using
#[global_allocator]
. See this Zulip thread for some discussion about the sub-optimality of this.About tracking issues
Tracking issues are used to record the overall progress of implementation.
They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions.
A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature.
Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.
The text was updated successfully, but these errors were encountered: