-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable general parse-time code execution, to facilitate writing a bl:mp parser in bl:mp #14
Comments
The new protocol encodes terminals differently: as objects that send no sub-trees and which return the terminal symbol, rather than as the terminal symbol itself. As a result, empty non-terminals (technically not supported but used as a convention in macro productions, where only the grammar symbol of the parse tree is relevant) are represented differently as well, as objects with a single sub-tree, the empty terminal {``}. In addition, the mechanism for interpeting the production of a macro definition is changed. Instead of taking just the top-level grammar symbols of the production (which is all that is relevant to the macro definition itself) using an ad hoc visitor, we use the normal ObjectToParseTree conversion, which deeply evaluates its argument, and _then_ take only the top-level grammar symbols from the evaluated parse tree. This is an important semantic distinction when the production has side effects. These two changes represent the first two steps of the roadmap to completing #14.
Allow parse tree symbols to be arbitrary objects. Implement untether expressions, which evaluate their arguments to object literals at parse time. This is step 3 of the roadmap towards completing #14.
This ensures that the std prelude is fully evaluated when any part of it is used in application code, which means we don't need to worry about whether std has been evaluated in the expansion of `syntax`, and can do away with __std_imported. This is step 5 of the roadmap towards #14.
Like all other built-in expressions (except !) macros only execute when the expression containing them executes. A macro definition can be wrapped in !() to execute it immediately and make the new syntax available for parsing further down in the file. This is step 4 towards #14.
Decided not to include step 6 (check non-terminals in All that's left to do, then, is step 7:
and then this issue can be closed. |
With the addition of ! giving full control over execution order to the programmer, there is no longer a need for exeuction to be a built-in step in interpreting a bl:mp program. Now, the semantics of a bl:mp program are defined as the semantics of parsing the program, and the interpreter (in non-interactive) mode no longer executes the parsed expression. The std prelude is modified such that if it is prepended to a file, it will automatically execute the contents of that file when it is parsed. The interpreter is extended with a --preload option to prepend files in this way, and the std prelude is implicitly prepended if --core is not passed. This is the first part of the final step towards #14 (the other part being the analagous change for the test runner).
Untethering
Lighter-than-air expressions for parse-time code execution
Design Principles
Motivation
Macros are currently used for two purposes: adding rules to the grammar, and executing code (including macro definitions themselves) at parse time. This behavior is confusing and arbitrary, requiring strange idioms like enclosing parse-time macro definitions in
{}
to prevent them from executing at run-time. Further, it stands in direct opposition to the principles of orthogonality (since one feature, macros, has been hacked to accomplish two unrelated goals) and extensibility (since it is hard to build abstractions that behave like macros, due to how magical macros are).This proposal aims to decouple parse-time execution from the ability to extend the grammar. Providing a more general parse-time execution mechanism also makes it easier to build abstractions that behave like macros, since it allows macro handlers to execute their own code that, among other things, defines more macros.
Goals
lazy_static!
from Rust) that can themselves be used to execute code at parse time.bl:mp
parser inbl:mp
(note that because of macros and parse-time execution,bl:mp
parsers need to interpretbl:mp
code)std
prelude as soon as it is parsed, before the module importing it is parsed. This would remove the need for the__std_imported
check in the expansion of thesyntax
macro, because we are guaranteed thatstd
is already imported and evaluated when asyntax
macro expands.Syntax and Semantics
This proposal makes
bl:mp
expressions lighter-than-air, so that they take flight (execute) as soon as they are untethered. It adds a new built-in untether expression, notated!<expr>
, which causes its sub-expression to execute as soon as the untether expression is parsed. Each time the parser produces a parse tree (including when it is the result of a macro handler), any untethered expressions within the parse tree spontaneously take flight and become object literal expressions.Untether expressions
Semantically (with reference to a hypothetical semantic model for the parser which, in its current form, does not exist) untether expressions are evaluated in a new predicate
parse-tree
with the signatureThat is, the
parse-tree
predicate acts on the machine state (the heapH
andglobal scope
r
) to construct new parse trees (T
) from a non-terminal symbolnt
and a sequence of sub-treesTs
, according to the rules:The
parse-tree
predicate is used to construct all parse trees produced by theparses-to
predicate which, again, is not currently specified. This includes parse trees generated by normal reductions, parse trees generated to pass in to macro handlers, and parse trees returned by macro handlers.Note that the
parse-tree
predicate considers a parse tree untethered only if the non-terminal used to construct that parse tree is the non-terminal of built-in expressions. This allows the user to produce tethered! <expr>
parse trees in their macro expansions by giving the parse trees a different non-terminal. They can also produce any expression that they would have been able to produce before in the final, executed, parse tree, including sending an expression to the symbol!
, since such a send would eventually be spelled< ! <expr>
, not! <expr>
. These two properties preserve grammar modularity.Parse tree to expression conversion
Since
!
-evaluation depends on the non-terminal of the parse tree being evaluated, the parse-tree-to-expression conversion should do so as well, for consistency. Currently, any parse tree of the right shape that ends up in the final, evaluated parse tree can be converted to an expression and evaluated. This should be changed so that only parse trees with the built-in non-terminal can be evaluated; any other parse tree would produce aBLIMP_INVALID_PARSE_TREE
error. This should not pose a burden, as the user must already control which parse trees end up in the final tree in order to ensure they have the right shape; additionally ensuring that they have the right non-terminal should be no harder.Parse tree protocol
The presence of untether expressions in turn changes the structure of a parse tree. Previously, a parse tree consisted of either a non-terminal symbol and a sequence of sub-trees, or a terminal symbol (represented by a
Symbol
inbl:mp
). Now, the grammar symbol associated with a parse tree may be any valuev
, including symbol objects as well as block objects, since the use of!
can lead to parse trees where the symbol is the evaluation of an arbitrary expression. This requires changes in the parse tree protocol used to represent parse trees as objects.The grammar symbol associated with a parse tree may now be any object, so it is no longer possible to distinguish between terminal and non-terminal parse tree objects by checking if the object is a symbol. Instead, the parse tree protocol is changed so that terminals are represented by parse trees with no sub-trees, and the result value of the parse tree object's method handler is the terminal object (or symbol).
Even apart from the rest of this proposal, this is an improvement over the previous parse tree protocol. It is conceptually simpler and more consistent, since the result value of a parse tree object is always the grammar symbol (whether terminal or not), and since a terminal parse tree really is just a parse tree with no sub-trees (i.e. a leaf).
However, this change does make the parse trees for
s
(wheres
is a terminal, previouslys
and now{s}
) ands -> ε
(wheres
is a non-terminal, previously{s}
) ambiguous. While nullable productions likes -> ε
are not permitted as grammar rules, it is idiomatic to use the parse tree object{s}
as an example of the non-terminals
as a child of a macro production. For example, we would define a new grammar ruleN -> s
(wheres
is a non-terminal) with an object like{{s}; N}
. Furthermore, support for nullable productions in general is planned, so we must have a way to distinguish betweens
ands -> ε
, both in the production of a macro definition and in a macro expansion when we convert it to an expression.In this proposal, we adopt for
bl:mp
the same notational convention as traditional grammar notations (e.g. BNF) and introduce a pseudo-terminal ε which represents the empty list of symbols. To avoid choosing an arbitrary name for this token (built-inbl:mp
constructs tend to avoid alphabetical names) we use the empty token``
, which makes some sense because this token does literally match nothing. Sos -> ε
would be represented in the parse tree protocol by{^``; s}
, while the parse tree for the terminals
would be represented by the object{s}
.This solution may also pay dividends when it is time to implement nullable productions, because it may be possible to think of ε as an actual terminal which matches no input. Adaptive lexing should be able to support this. Nullable productions could thus be implemented in the parser as well as in
bl:mp
code by a production with a single terminal ε, which would preserve the invariant that all productions contain at least one symbol, which the parser currently relies on and for which there is not yet an alternative plan. However, this idea requires more investigation in the empty productions proposal. For example, does this implementation really work with the LALR algorithm? And, is it allowed to mix ε freely with other terminals in a production? For now, it should at least be apparent that this change solves the problems presented here while making it no more difficult to implement nullable productions later, because we can trivially reduce to the current parsing algorithm by internally removing all empty terminals from productions.Macros
Macros are no longer evaluated by the parser at parse-time (unless inside a
!
expression, of course). Instead, they only have the run-time behavior that is currently implemented in the interpreter.In addition, the semantics of macros are changed so that the production of a macro is deeply evaluated as a parse tree, in exactly the same way that the parse tree object returned by a macro is converted to a parse tree. Currently, the production is only evaluated one level deep, since we only need to inspect the top-level grammar symbols to generate the new grammar rule. This change is motivated by the fact that the conversion from object to parse tree (the
parse-tree
predicate) now has very obvious side-effects in the form of!
evaluations, and these side-effects should be handled consistently everywhere the parse tree protocol is used. The truth is, though, that this conversion has always had side-effects, since it involves sending a message to the parse tree object. By convention, the parse tree object's method handler usually has no side-effects, but this is not required. Thus, this is more of a regression fix than a fundamental change.No more run-time
Untether expressions give the user full control over when code is evaluated, which means we no longer need separate parsing and interpreting steps. The semantics of an entire
bl:mp
program are changed so that the result of evaluating abl:mp
program is not a value plus side-effects; instead it is simply the side-effects of parsing the program. It falls to the user to wrap the top-level parse tree with an untether operator in order to force the program to execute.In a sense, we are not adding an entirely new feature with this proposal. We are taking an existing feature (the ability to execute
bl:mp
code) and making it more general, extensible, and explicit by giving it first-class syntax.Prelude
For convenience, since
bl:mp
programs are no longer executed after parsing, the bootstrap prelude is modified and extended to wrap the top-level expression in an untether, so that thebl:mp
program is actually executed. This ensures that the user does not have to think about manually executing their program whenever the bootstrap prelude is implicitly included (i.e. the default interpreter mode).The bootstrap prelude should also add a parse-time macro expression to the core language, which expands to a run-time macro wrapped in a
!
.An untether is used to evaluate the
std
prelude when it is imported, which ensures that it will always be fully imported whenever it is fully parsed. In other words, the expansion of standard macros can assume thatstd
is imported, and do away with the__std_imported
complexity that is currently inmacro.bli
.Technical Details
component:parser
The new parse tree structure no longer requires a union to represent different structures for terminals and non-terminals. (In a way, the removal of a union from the core interpreter is a good argument for the improved simplicity of the new parse tree protocol). We should now be able to represent the
ParseTree
data type with a unified structure, containing:The GrammarSymbol is necessary only for reparsing, so eventually it should be removed, leaving the
ParseTree
struct very simple indeed.This allows us to implement the semantics of the
parse-tree
predicate from the formal outline above quite simply using a new function likeThis function simply checks if
symbol == BUILT_IN_NON_TERMINAL && children == {"!", _}
. If so, it evaluates the inner expression to an object, and generates an object literal parse tree, which is a terminal parse tree whose symbol Object is the evaluated Object and which has no children.As we do this refactor, the parser will be updated to handle non-Symbol objects as the symbol of a terminal parse tree (these correspond to object literal expressions). However, we will run into the question of what to do if the symbol of a non-terminal parse tree is a non-Symbol object. Such a parse tree can only be created via a macro expansion, and macro expansion parse trees are generally opaque except for the cases of reparsing and conversion to expressions. For now, it is fine to error in reparsing if we encounter a non-Symbol non-terminal symbol, since reparsing is being removed anyways. In
ParseTree_Eval
, it should always be an error if we encounter a non-Symbol non-terminal, because any non-terminal parse trees that get converted to expressions are required to have the built-in non-terminal symbol.There is an open question as to how parse trees containing non-Symbol object terminals should behave when used as the production of a macro definition. There is no obvious answer because the current parser requires that terminals be symbols. There are a few ways we could go here:
!
expressions that evaluate to the same object literal. This is probably the most elegant behavior, because it makes terminal symbols truly polymorphic: both symbols and non-symbol objects behave in analagous ways. However, this might be quite tricky to implement in the parser.Eventually, we should settle on 2 or 3. For now, though, 1 is acceptable, because this is very a niche use case that is neither common nor required for moving forward. It is okay to address this in future work.
component:compiler
The compiler must be extended to compile object literal expressions. Fortunately, the target language already has an
OBJI
instruction, so this change is easy, and further downstream changes are not required.component:interpreter
To conform to the new parse-time-only semantics, the interpeter driver must be modified so that when executing non-interactively, it does not execute the program after parsing; it simply executes the side-effects of parsing. For interactive evaluation, expressions are still evaluated as they are entered and parsed, for convenience.
In addition, a
--prepend
option is added, which prepends the given file to the input stream which is parsed. It is important that this happens before parsing, so that a bootstrap prelude that ends in!
can be prepended to a normal.bli
file to generate a parse tree which is executed after it is parsed. Preset options are also added for common configurations of this option; for example,--prepend bootstrap.bli
may be implied if no other options are given, and--raw
(similar to--core
) may be used to omit the implicit bootstrap prelude.component:tests
The test runner is modified to prepend
bootstrap.bli
(in a manner similar to the driver) and to time the parsing of the test files rather than the execution of them. In fact, the tests should not be executed at all; the entire running process can consist of parsing the test.Alternatives
Implement a
bl:mp
interpreter inbl:mp
.This is necessary to allow macros that define macros without reparsing, because
syntax
macro, which expands to a parse tree for a primitive macro definition) have no way to achieve side-effects (such as expanding their own macros) since they are neither reparsed nor interpreted.Implementing a
bl:mp
interpreter would solve either problem (by manually parsing the parse tree returned from a macro handler and interpreting any macro definitions therein, or by evaluating input parse trees for use in a run-time macro definition within a macro handler) and it would also enable the implementation of an untether macro that executes the sub-parse-tree of the untether expression. However, this method carries considerable performance cost, as abl:mp
interpreter implemented inbl:mp
is never going to be nearly as efficient as the optimizing C compiler/interpreter. Performance may not be a huge concern for the first use case, as macro handlers are usually small, and are executed once and done. But code executed statically with the untether operator may be quite large (for example, the entire prelude, or a static typechecking system). Moreover, this approach is just extremely complicated and ugly. Finally, it still relies on macros being executable at both parse-time and run-time in order to execute any code at all at parse time (even the interpreter which then executes the rest of the code).Make expressions untethered by default, and add a built-in "tether" operator
/<expr>
.This is arguably a bit more elegant, because it doesn't require the addition of a purely semantic
parse-tree
step before the construction of new parse trees to explain why untethered expressions within a parse tree are evaluated immediately. Immediate evaluation is simply the natural behavior ofbl:mp
expressions, and it is the tethers (the things manually inserted by the user) that halt or control this natural process.However, this approach has some serious technical problems. First of all, parse trees are supposed to take flight from the bottom up, since parse trees are generated from the bottom up (you can't construct a new parse tree until you have constructed its children). But if tethers are expressions wrapping lighter-than-air expressions, then the sub-expressions will take flight before the parser even has the chance to construct the tethered parse tree. This behavior is obviously unacceptable, as it makes tethers useless, but it's not at all clear how to fix it. One possible solution is to make the tether part of the expression it is tethering, e.g. for each built-in feature we would have a tethered version
/ <operator> <expr>
as well as an untethered version<operator> <expr>
. But this is really just the same as the solution in this proposal, we have just factored out the untethered variants to cut the size of the grammar in half.The issue above is fatal to this approach, but for completeness, there are a few more serious problems. For one, tethers should be "sticky" in a way that untethers need not be. If a parse tree is tethered, the user probably wants it to stay tethered if it is re-evaluated; for instance, it becomes an argument to a macro and then is returned as a sub-tree of the parse tree returned by the macro. For this stickiness to work, we would need to do one of two things:
add_tethers
macro, but this is cumbersome.Contrast this with the current proposal: tethered expressions are trivially sticky, since tethered is the default. And untethers need not remain in the parse tree, because re-evaluating an object literal tree (the result of a previously evaluated untethered tree) does not change the tree, since the evaluation of an object literal is the object.
Finally, there is an argument from simplicity: tethered expressions are by far the more common case, so making untethered expressions the default and requiring all of the common tethers to be added manually (even if automatable by a macro) seems like an odd choice. And, if the user does want to work in a language where untethered-ness is default and "tethering" is an explicit instruction (seemingly a rare situation) they can always define this language in terms of the reverse using macros, and handle the technical problems above however makes sense for their specific use case.
Roadmap
ObjectToParseTree
to evaluate the production of a macro definition!
!
to define macros at compile time!
to executestd
as soon as it is parsed, and remove__std_imported
ParseTree_Eval
!
The text was updated successfully, but these errors were encountered: