Copyright (C) 2022 Bryan A. Jones.
This file is part of the CodeChat Editor.
The CodeChat Editor is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
The CodeChat Editor is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with the CodeChat Editor. If not, see https://www.gnu.org/licenses/.
This walkthrough shows how the lexer parses the following Python code fragment:
print("""¶
# This is not a comment! It's a multi-line
string.¶
""")¶
# This is a comment.
Paragraph marks (the ¶ character) are included to show how the lexer handles
newlines. To explain the operation of the lexer, the code will be highlighted
in yellow to represent the
unlexed source code,
represented by the contents of the
variable source_code[source_code_unlexed_index..]
and in green for the
current code block,
defined by source_code[current_code_block_index..source_code_unlexed_index]
.
Code that is classified by the lexer will be placed in the classified_code
array.
The unlexed source code holds all the code (everything is highlighted in yellow); the current code block is empty (there is no green highlight).
print("""¶
#
This is not a comment! It's a multi-line string.¶
""")¶
#
This is a comment.
classified_code = [
]
The lexer begins by searching for the regex in
language_lexer_compiled.next_token
, which is (\#)|(""")|(''')|(")|(')
. The
first token found is
"""
. Everything up
to the match is moved from the unlexed source code to the current code block,
giving:
print("""¶
#
This is not a comment! It's a multi-line string.¶
""")¶
#
This is a comment.
classified_code = [
]
The regex is accompanied by a map named language_lexer_compiled.map
, which
connects the mapped group to which token it matched (see
struct RegexDelimType
):
Regex: (#) | (""") | (''') | (") | (')
Mapping: Inline comment String String String String
Group: 1 2 3 4 5
Since group 2 matched, looking up this group in the map tells the lexer it’s a string, and also gives a regex which identifies the end of the string . This regex identifies the end of the string, moving it from the (unclassified) source code to the (classified) current code block. It correctly skips what looks like a comment but is not a comment. After this step, the lexer’s state is:
print("""¶
#
This is not a comment! It's a multi-line string.¶
""")¶
#
This is a comment.
classified_code = [
]
Now, the lexer is back to its state of looking through code (as opposed to
looking inside a string, comment, etc.). It uses the next_token
regex as
before to identify the next token
#
and moves all the
preceding characters from source code to the current code block. The lexer
state is now:
print("""¶
#
This is not a comment! It's a multi-line string.¶
""")¶
#
This is a comment.
classified_code = [
]
Based on the map, the lexer identifies this as an inline comment. The inline
comment lexer first identifies the end of the comment (the next newline or, as
in this case, the end of the file), putting the entire inline comment except
for the comment opening delimiter
#
into
full_comment
.
It then splits the current code block into two
groups: code_lines_before_comment
(lines in the current code block which come before the current line) and the
comment_line_prefix
(the current line up to the start of the comment). The classification is:
print("""¶
#
This is not a comment! It's a multi-line string.¶
""")¶
#
This is a comment.
classified_code = [
]
Because
comment_line_prefix
contains only whitespace and
full_comment has a
space after the comment delimiter, the lexer classifies this as a doc block. It
adds code_lines_before_comment
as a code block, then the text of the comment as a doc block:
classified_code = [
Item 0 = CodeDocBlock {
indent: "", delimiter: "", contents = "print("""¶
# This is not a comment! It's a multi-line string.¶
""")¶
"},
Item 1 = CodeDocBlock {
indent: " ", delimiter: "#", contents = "This is a comment"
},
]
After this, the unlexed source code is empty since the inline comment
classified moved the remainder of its contents into classified_code
. The
function exits.