Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up identifier detection #89

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open

Conversation

jserv
Copy link
Contributor

@jserv jserv commented Nov 16, 2024

Both 'is_ident1' and 'is_ident2' are now macros instead of function calls, and they are tweaked for ASCII detection in advance with the fallback to table lookup for non-ASCII characters.

The original code in function 'read_punct' relies on heavy string
specific function calls, resulting in slower execution. Instead, this
function can be faster using straightforward control flow.
@fuhsnn
Copy link
Owner

fuhsnn commented Nov 16, 2024

Thank you for the patches. Honestly performance hasn't been a priority, I consider the lexer less significant in the total pipeline compared to cache-misses from leaking and the overhead of external assembler. I do have some ideas as well, so a branch for performance experiments should make sense.

In terms of performance I want to ensure a common basis to work on: which projects to benchmark and what kind of environment, is it self-hosted, another simple C compiler or full-on gcc/clang optimizations with custom allocator etc. Do you have a preference?

@fuhsnn
Copy link
Owner

fuhsnn commented Nov 16, 2024

On the pull request: I think the short-circuiting of ASCII characters can be done earlier, skipping decode_utf8() altogether.

@jserv
Copy link
Contributor Author

jserv commented Nov 16, 2024

In terms of performance I want to ensure a common basis to work on: which projects to benchmark and what kind of environment, is it self-hosted, another simple C compiler or full-on gcc/clang optimizations with custom allocator etc.

Using uftrace, I identified the performance bottlenecks in the compilation process.

Here's what I found:
First, apply the following changes:

--- a/GNUmakefile
+++ b/GNUmakefile
@@ -1,4 +1,4 @@
-CFLAGS=-std=c99 -g -fno-common -Wall -pedantic -Wno-switch
+CFLAGS=-std=c99 -g -pg -fno-common -Wall -pedantic -Wno-switch
 
 SRCS=$(wildcard *.c)
 OBJS=$(SRCS:.c=.o)

After running uftrace ./slimcc -c -o stage2/parse.o parse.c, the trace showed the following call stack:

Copy[6] is_ident2
[5] read_ident
[4] tokenize
[3] tokenize_file
[2] must_tokenize_file
[1] cc1
[0] main

Based on this analysis, I tweaked the is_ident1 and is_ident2 functions.

@jserv
Copy link
Contributor Author

jserv commented Nov 16, 2024

By the way, I am working with my students to enhance the shecc project, a self-compiling C compiler with some optimizations. Since shecc only implements a subset of C, we are planning to modify slimcc to output compatible C code that shecc can process, rather than having slimcc generate x86-64 assembly directly. This way, slimcc would serve as a full C language frontend for shecc, while allowing shecc to focus on optimization and backend support. To achieve this, we plan to contribute improvements to slimcc's parsing, tokenization, and preprocessing components.

@fuhsnn
Copy link
Owner

fuhsnn commented Nov 16, 2024

I'm thrilled to know that the project has some real world value and am happy to target different backends as long as the IR is stable, but I imagine memory usage will be a major pain point on embedded targets that shecc currently self-hosts on. I thought about re-architecting, but then the code wouldn't be as accessible and one might as well work on other more optimized C frontend like tcc, cproc, tyfkda/xcc, libfirm/cparser etc.

If C++ is allowed in your classroom, I do have an unfinished C++ port sitting around that use RAII for nearly identical code with much better memory consumption, a minimal C++ front-end bootstrappable with C99 is also something I've always wanted to work on.

@fuhsnn
Copy link
Owner

fuhsnn commented Nov 16, 2024

The C++ version is uploaded in partial_c++_port branch, it's less C++ than I remembered and still able to build as C with just a little macro and typedef's.

If that particular style is okay to you, I'd prefer basing performance optimizations on that version, since it feels a bit bike shedding to optimize parsing while compiling larger projects hit OOM on an embedded device.

@jserv
Copy link
Contributor Author

jserv commented Nov 22, 2024

I defer this pull request to @ChAoSUnItY, who is assigned to proceed with the above research prototype under my supervision.

fuhsnn and others added 7 commits November 22, 2024 16:51
Previously, HashMap relied on division operations, which are expensive.
This update replaces divisions with logical AND operations using an
additional mask, improving performance.
Avoid expensive division in HashMap
@fuhsnn
Copy link
Owner

fuhsnn commented Nov 23, 2024

I believe identifier detection is of decent speed after efcbb85, feel free to open new issues regarding the C backend.

Btw thanks for mentioning uftrace, logging exact call counts is incredible. Can't believe it's not more well known.

jserv and others added 2 commits November 23, 2024 14:26
In 'preprocess2' function, 'expand_macro' was called multiple times,
some of which were unnecessary. This change ensures that 'expand_macro'
is called only when the preprocessor token is an identifier.
Expand macro only when preprocessor token is identifier
@ChAoSUnItY
Copy link

I've adopted identifier reading logic from efcbb85 and further refine it with the macro functions introduced previously in this PR, and also applied fool-proofing logic to the function. This way, it improves readability.

@jserv
Copy link
Contributor Author

jserv commented Nov 28, 2024

Let's rebase the latest main branch and squash the commits.

Both 'is_ident1' and 'is_ident2' are now macros instead of function
calls, and they are tweaked for ASCII detection in advance with the
fallback to table lookup for non-ASCII characters.

Also discard unnecessary variable and add safe guard:
- Variable is_first can be replaced with boolean expression "p == start"
- Add safe guard for ascii character checking to ensure starting identifier
  character must not be numberic

Additionally, replace first ascii character check with macro
is_ident2_ascii to keep readability. The later is_ident2 function call
is replaced with is_ident2_non_ascii because the expanded macro
function will result in multiple decode_utf8 function call, also
it's redundant to check if it's an ascii character or not.

Co-authored-by:  Jim Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants