Enable general parse-time code execution, to facilitate writing a bl:mp parser in bl:mp #14

jbearer · 2021-08-19T03:43:33Z

Untethering

Lighter-than-air expressions for parse-time code execution

Design Principles

Orthogonality
- Each feature does exactly one thing, and does not overlap any other feature
Extensibility
- New behavior can be added using existing features so seamlessly that it is indistinguishable from a built-in feature

Motivation

Macros are currently used for two purposes: adding rules to the grammar, and executing code (including macro definitions themselves) at parse time. This behavior is confusing and arbitrary, requiring strange idioms like enclosing parse-time macro definitions in {} to prevent them from executing at run-time. Further, it stands in direct opposition to the principles of orthogonality (since one feature, macros, has been hacked to accomplish two unrelated goals) and extensibility (since it is hard to build abstractions that behave like macros, due to how magical macros are).

This proposal aims to decouple parse-time execution from the ability to extend the grammar. Providing a more general parse-time execution mechanism also makes it easier to build abstractions that behave like macros, since it allows macro handlers to execute their own code that, among other things, defines more macros.

Goals

Enable parse-time evaluation of code
- Remove the need for macros to execute at both parse-time and run-time.
- Allow macro expansions to have side-effects (such as defining new macros themselves, or executing code).
- Allow the insertion of objects directly into the parse tree; for example, singleton objects or static local variables.
Enable abstractions (e.g. macros) that behave like macros, and abstractions (e.g. lazy_static! from Rust) that can themselves be used to execute code at parse time.
- Facilitate writing a bl:mp parser in bl:mp (note that because of macros and parse-time execution, bl:mp parsers need to interpret bl:mp code)
- Evaluating the std prelude as soon as it is parsed, before the module importing it is parsed. This would remove the need for the __std_imported check in the expansion of the syntax macro, because we are guaranteed that std is already imported and evaluated when a syntax macro expands.
- Setting up global state at parse time; for example, state needed by a parse-time static typechecker implemented using macros.
- Evaluating code at parse time when possible to improve run-time performance; for example, constant folding.
Remove the need for non-parse-time execution of code
- Since code can be evaluated at parse time, we no longer need separate parsing and executing steps. We can leave it up to the user to untether the top-level parse tree if they want it to execute.
Keep parse trees inspectible
- This proposal introduces a new kind of terminal parse tree, an object literal. It should be possible to write code in such a way that both symbol and object terminal parse trees can be detected, so that code does not attempt to use a symbol or object literal as if it were an object implementing the parse tree protocol.
Maintain grammar modularity
- The new feature of the built-in grammar should not preclude the user from implementing a similar-in-appearence-but-different feature in a self-contained sub-grammar.

Syntax and Semantics

This proposal makes bl:mp expressions lighter-than-air, so that they take flight (execute) as soon as they are untethered. It adds a new built-in untether expression, notated !<expr>, which causes its sub-expression to execute as soon as the untether expression is parsed. Each time the parser produces a parse tree (including when it is the result of a macro handler), any untethered expressions within the parse tree spontaneously take flight and become object literal expressions.

Untether expressions

Semantically (with reference to a hypothetical semantic model for the parser which, in its current form, does not exist) untether expressions are evaluated in a new predicate parse-tree with the signature

#:mode     (parse-tree I I I  I  O O)
#:contract (parse-tree H r nt Ts H T)

That is, the parse-tree predicate acts on the machine state (the heap H and
global scope r) to construct new parse trees (T) from a non-terminal symbol nt and a sequence of sub-trees Ts, according to the rules:

; Tethered parse trees are transformed recursively.
 (side-condition (tethered v Ts))
-------------------------------------------------
 (parse-tree H r v Ts H (v Ts))

; Untethered parse trees are evaluated to objects.
; The resulting parse tree is nested, with a non-terminal
; parse tree (so that macros can inspect the parse tree and
; determine from the non-terminal symbol that the sub-tree
; is an object literal) on the outside, and an object literal
; parse tree on the inside.
 (steps-to H r e H_2 v)
-----------------------
 (parse-tree H r built-in-non-terminal (! e) H_2 (v ()))

where
    (tethered built-in-non-terminal (! _)) -> #f
    (tethered _ _) -> #t

    T ::= (v Ts) ; A parse tree is a grammar symbol (which can be any object)
                 ; and a (possibly empty) sequence of sub-trees.

The parse-tree predicate is used to construct all parse trees produced by the parses-to predicate which, again, is not currently specified. This includes parse trees generated by normal reductions, parse trees generated to pass in to macro handlers, and parse trees returned by macro handlers.

Note that the parse-tree predicate considers a parse tree untethered only if the non-terminal used to construct that parse tree is the non-terminal of built-in expressions. This allows the user to produce tethered ! <expr> parse trees in their macro expansions by giving the parse trees a different non-terminal. They can also produce any expression that they would have been able to produce before in the final, executed, parse tree, including sending an expression to the symbol !, since such a send would eventually be spelled < ! <expr>, not ! <expr>. These two properties preserve grammar modularity.

Parse tree to expression conversion

Since !-evaluation depends on the non-terminal of the parse tree being evaluated, the parse-tree-to-expression conversion should do so as well, for consistency. Currently, any parse tree of the right shape that ends up in the final, evaluated parse tree can be converted to an expression and evaluated. This should be changed so that only parse trees with the built-in non-terminal can be evaluated; any other parse tree would produce a BLIMP_INVALID_PARSE_TREE error. This should not pose a burden, as the user must already control which parse trees end up in the final tree in order to ensure they have the right shape; additionally ensuring that they have the right non-terminal should be no harder.

Parse tree protocol

The presence of untether expressions in turn changes the structure of a parse tree. Previously, a parse tree consisted of either a non-terminal symbol and a sequence of sub-trees, or a terminal symbol (represented by a Symbol in bl:mp). Now, the grammar symbol associated with a parse tree may be any value v, including symbol objects as well as block objects, since the use of ! can lead to parse trees where the symbol is the evaluation of an arbitrary expression. This requires changes in the parse tree protocol used to represent parse trees as objects.

The grammar symbol associated with a parse tree may now be any object, so it is no longer possible to distinguish between terminal and non-terminal parse tree objects by checking if the object is a symbol. Instead, the parse tree protocol is changed so that terminals are represented by parse trees with no sub-trees, and the result value of the parse tree object's method handler is the terminal object (or symbol).

Even apart from the rest of this proposal, this is an improvement over the previous parse tree protocol. It is conceptually simpler and more consistent, since the result value of a parse tree object is always the grammar symbol (whether terminal or not), and since a terminal parse tree really is just a parse tree with no sub-trees (i.e. a leaf).

However, this change does make the parse trees for s (where s is a terminal, previously s and now {s}) and s -> ε (where s is a non-terminal, previously {s}) ambiguous. While nullable productions like s -> ε are not permitted as grammar rules, it is idiomatic to use the parse tree object {s} as an example of the non-terminal s as a child of a macro production. For example, we would define a new grammar rule N -> s (where s is a non-terminal) with an object like {{s}; N}. Furthermore, support for nullable productions in general is planned, so we must have a way to distinguish between s and s -> ε, both in the production of a macro definition and in a macro expansion when we convert it to an expression.

In this proposal, we adopt for bl:mp the same notational convention as traditional grammar notations (e.g. BNF) and introduce a pseudo-terminal ε which represents the empty list of symbols. To avoid choosing an arbitrary name for this token (built-in bl:mp constructs tend to avoid alphabetical names) we use the empty token ``, which makes some sense because this token does literally match nothing. So s -> ε would be represented in the parse tree protocol by {^``; s}, while the parse tree for the terminal s would be represented by the object {s}.

This solution may also pay dividends when it is time to implement nullable productions, because it may be possible to think of ε as an actual terminal which matches no input. Adaptive lexing should be able to support this. Nullable productions could thus be implemented in the parser as well as in bl:mp code by a production with a single terminal ε, which would preserve the invariant that all productions contain at least one symbol, which the parser currently relies on and for which there is not yet an alternative plan. However, this idea requires more investigation in the empty productions proposal. For example, does this implementation really work with the LALR algorithm? And, is it allowed to mix ε freely with other terminals in a production? For now, it should at least be apparent that this change solves the problems presented here while making it no more difficult to implement nullable productions later, because we can trivially reduce to the current parsing algorithm by internally removing all empty terminals from productions.

Macros

Macros are no longer evaluated by the parser at parse-time (unless inside a ! expression, of course). Instead, they only have the run-time behavior that is currently implemented in the interpreter.

In addition, the semantics of macros are changed so that the production of a macro is deeply evaluated as a parse tree, in exactly the same way that the parse tree object returned by a macro is converted to a parse tree. Currently, the production is only evaluated one level deep, since we only need to inspect the top-level grammar symbols to generate the new grammar rule. This change is motivated by the fact that the conversion from object to parse tree (the parse-tree predicate) now has very obvious side-effects in the form of ! evaluations, and these side-effects should be handled consistently everywhere the parse tree protocol is used. The truth is, though, that this conversion has always had side-effects, since it involves sending a message to the parse tree object. By convention, the parse tree object's method handler usually has no side-effects, but this is not required. Thus, this is more of a regression fix than a fundamental change.

No more run-time

Untether expressions give the user full control over when code is evaluated, which means we no longer need separate parsing and interpreting steps. The semantics of an entire bl:mp program are changed so that the result of evaluating a bl:mp program is not a value plus side-effects; instead it is simply the side-effects of parsing the program. It falls to the user to wrap the top-level parse tree with an untether operator in order to force the program to execute.

In a sense, we are not adding an entirely new feature with this proposal. We are taking an existing feature (the ability to execute bl:mp code) and making it more general, extensible, and explicit by giving it first-class syntax.

Prelude

For convenience, since bl:mp programs are no longer executed after parsing, the bootstrap prelude is modified and extended to wrap the top-level expression in an untether, so that the bl:mp program is actually executed. This ensures that the user does not have to think about manually executing their program whenever the bootstrap prelude is implicitly included (i.e. the default interpreter mode).

The bootstrap prelude should also add a parse-time macro expression to the core language, which expands to a run-time macro wrapped in a !.

An untether is used to evaluate the std prelude when it is imported, which ensures that it will always be fully imported whenever it is fully parsed. In other words, the expansion of standard macros can assume that std is imported, and do away with the __std_imported complexity that is currently in macro.bli.

Technical Details

`component:parser`

The new parse tree structure no longer requires a union to represent different structures for terminals and non-terminals. (In a way, the removal of a union from the core interpreter is a good argument for the improved simplicity of the new parse tree protocol). We should now be able to represent the ParseTree data type with a unified structure, containing:

An Object representing the symbol of the parse tree
A GrammarSymbol, which is the parser's interpretation of that symbol (e.g. it's unique numeric code, and whether it is a terminal)
A (possibly empty) sequence of sub-trees

The GrammarSymbol is necessary only for reparsing, so eventually it should be removed, leaving the ParseTree struct very simple indeed.

This allows us to implement the semantics of the parse-tree predicate from the formal outline above quite simply using a new function like

Status ParseTree_Init(
    Blimp *blimp, GrammarSymbol symbol, ParseTree *children, ParseTree *tree);

This function simply checks if symbol == BUILT_IN_NON_TERMINAL && children == {"!", _}. If so, it evaluates the inner expression to an object, and generates an object literal parse tree, which is a terminal parse tree whose symbol Object is the evaluated Object and which has no children.

As we do this refactor, the parser will be updated to handle non-Symbol objects as the symbol of a terminal parse tree (these correspond to object literal expressions). However, we will run into the question of what to do if the symbol of a non-terminal parse tree is a non-Symbol object. Such a parse tree can only be created via a macro expansion, and macro expansion parse trees are generally opaque except for the cases of reparsing and conversion to expressions. For now, it is fine to error in reparsing if we encounter a non-Symbol non-terminal symbol, since reparsing is being removed anyways. In ParseTree_Eval, it should always be an error if we encounter a non-Symbol non-terminal, because any non-terminal parse trees that get converted to expressions are required to have the built-in non-terminal symbol.

There is an open question as to how parse trees containing non-Symbol object terminals should behave when used as the production of a macro definition. There is no obvious answer because the current parser requires that terminals be symbols. There are a few ways we could go here:

Require that terminals in the rule are symbols, and error if we get a non-symbol object. This matches the current behavior, but requiring something to be a symbol is always inelegant and this is a good opportunity to remove this restriction. In fact, this whole proposal should be a good step towards removing symbols at all in the representation of productions and parse trees, and just using Objects (which may be symbols).
Accept the rule, but never match it since we will never lex an object literal. This seems reasonable and should be very easy to implement.
Accept the rule and match against ! expressions that evaluate to the same object literal. This is probably the most elegant behavior, because it makes terminal symbols truly polymorphic: both symbols and non-symbol objects behave in analagous ways. However, this might be quite tricky to implement in the parser.

Eventually, we should settle on 2 or 3. For now, though, 1 is acceptable, because this is very a niche use case that is neither common nor required for moving forward. It is okay to address this in future work.

`component:compiler`

The compiler must be extended to compile object literal expressions. Fortunately, the target language already has an OBJI instruction, so this change is easy, and further downstream changes are not required.

`component:interpreter`

To conform to the new parse-time-only semantics, the interpeter driver must be modified so that when executing non-interactively, it does not execute the program after parsing; it simply executes the side-effects of parsing. For interactive evaluation, expressions are still evaluated as they are entered and parsed, for convenience.

In addition, a --prepend option is added, which prepends the given file to the input stream which is parsed. It is important that this happens before parsing, so that a bootstrap prelude that ends in ! can be prepended to a normal .bli file to generate a parse tree which is executed after it is parsed. Preset options are also added for common configurations of this option; for example, --prepend bootstrap.bli may be implied if no other options are given, and --raw (similar to --core) may be used to omit the implicit bootstrap prelude.

`component:tests`

The test runner is modified to prepend bootstrap.bli (in a manner similar to the driver) and to time the parsing of the test files rather than the execution of them. In fact, the tests should not be executed at all; the entire running process can consist of parsing the test.

Alternatives

Implement a bl:mp interpreter in bl:mp.

This is necessary to allow macros that define macros without reparsing, because
- Macro expansions (such as the expansion of the syntax macro, which expands to a parse tree for a primitive macro definition) have no way to achieve side-effects (such as expanding their own macros) since they are neither reparsed nor interpreted.
- Macro handlers cannot evaluate new macros in their run-time if part of the handler or production of the new macro is one of the input parse trees to the outer parse tree, because the handler and production of a macro must be evaluated before the macro can be defined, and there is currently no way to evaluate parse trees at run-time.
Implementing a bl:mp interpreter would solve either problem (by manually parsing the parse tree returned from a macro handler and interpreting any macro definitions therein, or by evaluating input parse trees for use in a run-time macro definition within a macro handler) and it would also enable the implementation of an untether macro that executes the sub-parse-tree of the untether expression. However, this method carries considerable performance cost, as a bl:mp interpreter implemented in bl:mp is never going to be nearly as efficient as the optimizing C compiler/interpreter. Performance may not be a huge concern for the first use case, as macro handlers are usually small, and are executed once and done. But code executed statically with the untether operator may be quite large (for example, the entire prelude, or a static typechecking system). Moreover, this approach is just extremely complicated and ugly. Finally, it still relies on macros being executable at both parse-time and run-time in order to execute any code at all at parse time (even the interpreter which then executes the rest of the code).
Make expressions untethered by default, and add a built-in "tether" operator /<expr>.

This is arguably a bit more elegant, because it doesn't require the addition of a purely semantic parse-tree step before the construction of new parse trees to explain why untethered expressions within a parse tree are evaluated immediately. Immediate evaluation is simply the natural behavior of bl:mp expressions, and it is the tethers (the things manually inserted by the user) that halt or control this natural process.

However, this approach has some serious technical problems. First of all, parse trees are supposed to take flight from the bottom up, since parse trees are generated from the bottom up (you can't construct a new parse tree until you have constructed its children). But if tethers are expressions wrapping lighter-than-air expressions, then the sub-expressions will take flight before the parser even has the chance to construct the tethered parse tree. This behavior is obviously unacceptable, as it makes tethers useless, but it's not at all clear how to fix it. One possible solution is to make the tether part of the expression it is tethering, e.g. for each built-in feature we would have a tethered version / <operator> <expr> as well as an untethered version <operator> <expr>. But this is really just the same as the solution in this proposal, we have just factored out the untethered variants to cut the size of the grammar in half.

The issue above is fatal to this approach, but for completeness, there are a few more serious problems. For one, tethers should be "sticky" in a way that untethers need not be. If a parse tree is tethered, the user probably wants it to stay tethered if it is re-evaluated; for instance, it becomes an argument to a macro and then is returned as a sub-tree of the parse tree returned by the macro. For this stickiness to work, we would need to do one of two things:
- Keep tethers in the parse tree when a tethered parse tree is created, which clutters parse trees and makes traversing them pretty annoying, or...
- Manually add tethers back in to macro arguments used in macro expansions. This could be automated by an add_tethers macro, but this is cumbersome.
Contrast this with the current proposal: tethered expressions are trivially sticky, since tethered is the default. And untethers need not remain in the parse tree, because re-evaluating an object literal tree (the result of a previously evaluated untethered tree) does not change the tree, since the evaluation of an object literal is the object.

Finally, there is an argument from simplicity: tethered expressions are by far the more common case, so making untethered expressions the default and requiring all of the common tethers to be added manually (even if automatable by a macro) seems like an odd choice. And, if the user does want to work in a language where untethered-ness is default and "tethering" is an explicit instruction (seemingly a rare situation) they can always define this language in terms of the reverse using macros, and handle the technical problems above however makes sense for their specific use case.

Roadmap

Change parse tree protocol and update tests
Use ObjectToParseTree to evaluate the production of a macro definition
Implement !
Make macros execute only at run-time, and add a new prelude macro that uses ! to define macros at compile time
Use ! to execute std as soon as it is parsed, and remove __std_imported
Check non-terminals in ParseTree_Eval
Update the interpreter and test runner to not execute the parsed tree when running non-interactively. Add a bootstrap prelude which forces the entire program to be executed after parsing using !

The text was updated successfully, but these errors were encountered:

The new protocol encodes terminals differently: as objects that send no sub-trees and which return the terminal symbol, rather than as the terminal symbol itself. As a result, empty non-terminals (technically not supported but used as a convention in macro productions, where only the grammar symbol of the parse tree is relevant) are represented differently as well, as objects with a single sub-tree, the empty terminal {``}. In addition, the mechanism for interpeting the production of a macro definition is changed. Instead of taking just the top-level grammar symbols of the production (which is all that is relevant to the macro definition itself) using an ad hoc visitor, we use the normal ObjectToParseTree conversion, which deeply evaluates its argument, and _then_ take only the top-level grammar symbols from the evaluated parse tree. This is an important semantic distinction when the production has side effects. These two changes represent the first two steps of the roadmap to completing #14.

Allow parse tree symbols to be arbitrary objects. Implement untether expressions, which evaluate their arguments to object literals at parse time. This is step 3 of the roadmap towards completing #14.

This ensures that the std prelude is fully evaluated when any part of it is used in application code, which means we don't need to worry about whether std has been evaluated in the expansion of `syntax`, and can do away with __std_imported. This is step 5 of the roadmap towards #14.

Like all other built-in expressions (except !) macros only execute when the expression containing them executes. A macro definition can be wrapped in !() to execute it immediately and make the new syntax available for parsing further down in the file. This is step 4 towards #14.

jbearer · 2021-10-11T16:56:47Z

Decided not to include step 6 (check non-terminals in ParseTree_Eval) as part of this change because it is a large, breaking change that is made complicated by the complexity of the built-in grammar. It should get much simpler after the built-in grammar is simplified down to a single built-in non-terminal (#6). Therefore, this work will be tracked under a separate issue (#15).

All that's left to do, then, is step 7:

Update the interpreter and test runner to not execute the parsed tree when running non-interactively. Add a bootstrap prelude which forces the entire program to be executed after parsing using !

and then this issue can be closed.

With the addition of ! giving full control over execution order to the programmer, there is no longer a need for exeuction to be a built-in step in interpreting a bl:mp program. Now, the semantics of a bl:mp program are defined as the semantics of parsing the program, and the interpreter (in non-interactive) mode no longer executes the parsed expression. The std prelude is modified such that if it is prepended to a file, it will automatically execute the contents of that file when it is parsed. The interpreter is extended with a --preload option to prepend files in this way, and the std prelude is implicitly prepended if --core is not passed. This is the first part of the final step towards #14 (the other part being the analagous change for the test runner).

jbearer self-assigned this Aug 19, 2021

jbearer added component:compiler component:interpreter component:parser component:prelude component:tests feature New feature or request labels Aug 19, 2021

jbearer added this to the 0.5 Roadmap milestone Aug 19, 2021

jbearer added a commit that referenced this issue Oct 9, 2021

Untether expressions

475ed1e

Allow parse tree symbols to be arbitrary objects. Implement untether expressions, which evaluate their arguments to object literals at parse time. This is step 3 of the roadmap towards completing #14.

jbearer mentioned this issue Oct 11, 2021

Check non-terminals in parse-tree-to-expression conversion #15

Open

jbearer closed this as completed in 98b8c2d Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable general parse-time code execution, to facilitate writing a bl:mp parser in bl:mp #14

Enable general parse-time code execution, to facilitate writing a bl:mp parser in bl:mp #14

jbearer commented Aug 19, 2021

jbearer commented Oct 11, 2021 •

edited

Loading

Enable general parse-time code execution, to facilitate writing a bl:mp parser in bl:mp #14

Enable general parse-time code execution, to facilitate writing a bl:mp parser in bl:mp #14

Comments

jbearer commented Aug 19, 2021

Untethering

Design Principles

Motivation

Goals

Syntax and Semantics

Untether expressions

Parse tree to expression conversion

Parse tree protocol

Macros

No more run-time

Prelude

Technical Details

component:parser

component:compiler

component:interpreter

component:tests

Alternatives

Roadmap

jbearer commented Oct 11, 2021 • edited Loading

`component:parser`

`component:compiler`

`component:interpreter`

`component:tests`

jbearer commented Oct 11, 2021 •

edited

Loading