compose: add a cache for parsed files #220

linkmauve · 2021-03-21T00:44:48Z

Parsing /usr/share/X11/locale/en_US.UTF-8/Compose takes about 21% of the startup of GLFW on my PinePhone according to perf, this patch lowers it by about 33ms down to approximately 2%.

This adds an optional dependency on xxhash, its BSD two clauses license is compatible and it was the fastest non-cryptographic hash I could find.

I went for a fopen()/fread() approach in order not to change the internal structs in any way, but with some simple changes we could do a mmap() and avoid the reading cost altogether, additionally letting multiple programs share a single read-only memory page.

Signed-off-by: Emmanuel Gil Peyrot [email protected]

Parsing /usr/share/X11/locale/en_US.UTF-8/Compose takes about 21% of the startup of GLFW on my PinePhone according to perf, this patch lowers it by about 33ms down to approximately 2%. This adds an optional dependency on xxhash. Signed-off-by: Emmanuel Gil Peyrot <[email protected]>

Signed-off-by: Emmanuel Gil Peyrot <[email protected]>

bluetech · 2021-03-28T13:40:49Z

Parsing /usr/share/X11/locale/en_US.UTF-8/Compose takes about 21% of the startup of GLFW on my PinePhone according to perf, this patch lowers it by about 33ms down to approximately 2%.

Interesting. I would definitely want this to work well on PinePhone and these numbers do sound like a lot.

Some historical context: the original libX11 Compose implementation does in fact have a cache (~/.compose-cache/XCOMPOSECACHE, code is here). The libxkbcommon implementation is new, and I explicitly chose not to add a cache, for the following reasons:

Complicates the code
Cache invalidation trouble
Filesystem interaction trouble
Still doesn't help for the first run

In that spirit, I'd like to first see if we can make the Compose parsing itself fast enough so that a cache is not necessary.

I had a quick look now and committed 7d84809 which brings down the timing from 8ms to 5.3ms on my old laptop, according to bench/compose. I think we can optimize it more.

I'd be interested to see the perf output from the phone, as well as the absolute timings. Maybe something is particularly slow there.

linkmauve · 2021-03-28T14:29:35Z

Hi, and thanks for your interest! I just profiled again, before any optimisation (on 95e2907) xkb_compose_table_new_from_locale() used to take 38~43ms (I didn’t change anything in the scheduler or frequency scaling of the phone, hence the quite big variance), after them (7d84809) 35~39ms, and with the cache from this PR ~3.4ms.

I’ve attached the perf and flamegraph of three of these runs: perf+flamegraph.tar.gz

It doesn’t seem like the case-sensitive optimisation had such a big effect on AArch64, I’ll have a look whether glibc has the same optimisation there as on x86, since that’s a bit weird there was such a big change on your laptop and almost none on my phone.

As for the issues with the cache, we might have a look how other libraries such as Mesa are handling it, I’d trust them on that. :)

Note that xkb_keymap_new_from_string() also takes 15~18ms and would be my next target for caching.

Edit: it seems glibc has no asm whatsoever for strcmp, it’s all relying on the compiler doing a good job at autovectorisation.

bluetech · 2021-03-29T22:56:09Z

xkb_compose_table_new_from_locale() used to take 38~43ms (I didn’t change anything in the scheduler or frequency scaling of the phone, hence the quite big variance), after them (7d84809) 35~39ms, and with the cache from this PR ~3.4ms.

Hmm pretty interesting that it had no effect.

Unfortunately when I try to perf report -f the perf files, it's missing all the symbols, maybe it needs the original binaries, not sure how that works. But I can't quite see the details there.

I refreshed my memory on the code a bit more. Basically, at least on my machines, the time is mostly split up between these 3 parts:

Lexing -- I don't have any simple ideas on how to optimize that at the moment.
Keysym name to keysym value mapping -- the commit in master was a first shot but apparently not enough. Currently it is using a binary search which does a lot of strcmps. I changed it to use a perfect hash in this branch: https://github.com/bluetech/libxkbcommon/commits/perfect-hash (last commit, not polished but sufficient as a POC). This does just a hash+one strcmp.
Parsing and constructing the data structure -- currently using a simple trie with linear search for each chain. The chains can get pretty long. I changed it to use a ternary search tree here: https://github.com/bluetech/libxkbcommon/commits/ternary-tree. Also unfinished but good enough for testing.

I'd be curious to know how it does with these two patches applied. (Also I assume you're using an optimized build, maybe also LTO).

Note that xkb_keymap_new_from_string() also takes 15~18ms and would be my next target for caching.

Amusingly it was the same thing for the keymaps -- the original xkbcomp has a cache, we got rid of it.

Edit: it seems glibc has no asm whatsoever for strcmp, it’s all relying on the compiler doing a good job at autovectorisation.

glibc is rather unreadable but the ASM for arm64 is found here:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aarch64/strcmp.S;h=f225d718f4bef0781896d7331b3c4e6667df135c;hb=HEAD

If for some reason it's not used that can explain some of the slowness.

linkmauve · 2021-04-02T17:45:54Z

I’ve just tested your latest improvements from master on my phone, they bring it down from ~34ms to ~19ms, nice!

As for perf not giving you relevant function names, do you want me to send you the DWARF data from my built library?

I checked and the strcmp() I’m running is indeed from that one AArch64 source, so no worries on that end (that’d have been surprising!).

bluetech · 2021-04-02T19:26:06Z

I’ve just tested your latest improvements from master on my phone, they bring it down from ~34ms to ~19ms, nice!

For me it's down to 2.8ms so I hoped it would be better, but still not bad. Thanks for testing again.

As for perf not giving you relevant function names, do you want me to send you the DWARF data from my built library?

Sure, maybe that will do the trick. Still curious what is slowing things down - cache misses, branch mispredictions, memory fetches, or just plain CPU time (if I'm reading the right place, the PinePhone has a 1.2GHz CPU). So BTW the output of perf stat -ddd ./build/bench-compose would also be good for comparison.

If possible, maybe also a callgrind output file? (Don't know it that's portable, hopefully yes).

I checked and the strcmp() I’m running is indeed from that one AArch64 source, so no worries on that end (that’d have been surprising!).

Great (even if now strcmp is not called as much...)

linkmauve force-pushed the compose-cache branch 8 times, most recently from fcf7de4 to 564a25f Compare March 21, 2021 01:32

linkmauve added 2 commits March 21, 2021 12:05

ci: install or disable Compose cache

8c96caf

Signed-off-by: Emmanuel Gil Peyrot <[email protected]>

linkmauve force-pushed the compose-cache branch from 564a25f to 8c96caf Compare March 21, 2021 11:05

linkmauve mentioned this pull request Mar 21, 2021

Cache mappings on disk glfw/glfw#1868

Closed

wismill added enhancement Indicates new feature requests compose Indicates a need for improvements or additions to Compose handling labels May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compose: add a cache for parsed files #220

compose: add a cache for parsed files #220

linkmauve commented Mar 21, 2021 •

edited

Loading

bluetech commented Mar 28, 2021

linkmauve commented Mar 28, 2021 •

edited

Loading

bluetech commented Mar 29, 2021

linkmauve commented Apr 2, 2021 •

edited

Loading

bluetech commented Apr 2, 2021

compose: add a cache for parsed files #220

Are you sure you want to change the base?

compose: add a cache for parsed files #220

Conversation

linkmauve commented Mar 21, 2021 • edited Loading

bluetech commented Mar 28, 2021

linkmauve commented Mar 28, 2021 • edited Loading

bluetech commented Mar 29, 2021

linkmauve commented Apr 2, 2021 • edited Loading

bluetech commented Apr 2, 2021

linkmauve commented Mar 21, 2021 •

edited

Loading

linkmauve commented Mar 28, 2021 •

edited

Loading

linkmauve commented Apr 2, 2021 •

edited

Loading