From a191f385878fd8a7725b3a95b0af1a8b0cef6225 Mon Sep 17 00:00:00 2001 From: David Van Horn Date: Mon, 2 Dec 2024 12:30:52 -0500 Subject: [PATCH] Overhaul of notes from Abscond to Hustle. --- www/notes/abscond.scrbl | 113 ++++---- www/notes/blackmail.scrbl | 240 +++++++++++------ www/notes/con.scrbl | 76 +++--- www/notes/dodger.scrbl | 62 ++--- www/notes/dupe.scrbl | 464 +++++++++++++++++++++----------- www/notes/evildoer.scrbl | 545 ++++++++++++++++++-------------------- www/notes/extort.scrbl | 315 ++++++++++++++++++---- www/notes/fraud.scrbl | 243 +++++++++-------- www/notes/hustle.scrbl | 352 ++++++++++++++---------- 9 files changed, 1442 insertions(+), 968 deletions(-) diff --git a/www/notes/abscond.scrbl b/www/notes/abscond.scrbl index 149ceae4..760c2ea8 100644 --- a/www/notes/abscond.scrbl +++ b/www/notes/abscond.scrbl @@ -15,7 +15,7 @@ @(ev '(require rackunit a86)) @(for-each (λ (f) (ev `(require (file ,(path->string (build-path langs "abscond" f)))))) - '("interp.rkt" "ast.rkt" "compile.rkt")) + '("main.rkt" "correct.rkt")) @(define (shellbox . s) (parameterize ([current-directory (build-path langs "abscond")]) @@ -80,7 +80,15 @@ implementation artifacts. Formal definitions balance precision while allowing for under-specification, but require detailed definitions and training to understand. -We will use a combination of each. +For the purposes of this course, we will use interpreters to specify +the meaning of programs. The interpreters provide a specification for +the compilers we write and make precise what means for a compiler to +be @emph{correct}. Any time the compiler produces code that, when +run, produces a different result that the interpreter produces for the +same program, the compiler is broken (or the specification is wrong). +Interpreters are useful for specifying what the compiler should do and +sometimes writing interpreters is also useful for informing @emph{how} +it should do it. To begin, let's start with a dead simple programming language called @@ -151,13 +159,6 @@ While not terribly useful for a language as overly simplistic as Abscond, we use an AST datatype for representing expressions and another syntactic categories. For each category, we will have an appropriate constructor. In the case of Abscond all expressions are integers, so we have a single constructor, @racket[Lit]. - -@(define-language A-concrete - (e ::= (Lit i)) - (i ::= integer)) - -@centered{@render-language[A-concrete]} - A datatype for representing expressions can be defined as: @codeblock-include["abscond/ast.rkt"] @@ -168,6 +169,12 @@ an integer and constructs an instance of the AST datatype if it is, otherwise it signals an error: @codeblock-include["abscond/parse.rkt"] +@ex[ +(parse 5) +(parse 42) +(eval:error (parse #t))] + + @section{Meaning of Abscond programs} The meaning of an Abscond program is simply the number itself. So @@ -184,6 +191,14 @@ produces it's meaning: (interp (Lit -8)) ) +The @racket[interp] function specifies the meaning of expressions, +i.e. elements of the type @tt{Expr}. This language is so simple, the +@racket[interp] function really doesn't @emph{do} much of anything, +but this will change as the langauge grows. + + + + We can add a command line wrapper program for interpreting Abscond programs from stdin: @@ -198,6 +213,7 @@ the result. For example, interpreting the program @tt{42.rkt} shown above: @shellbox["cat 42.rkt | racket -t interp-stdin.rkt -m"] +@;{ Even though the semantics is obvious, we can provide a formal definition of Abscond using @bold{operational semantics}. @@ -249,6 +265,7 @@ and integers @racket[i], if (@racket[e],@racket[i]) in @render-term[A We now have a complete (if overly simple) programming language with an operational semantics and an interpreter, which is (obviously) correct. Now let's write a compiler. +} @section{Toward a Compiler for Abscond} @@ -283,7 +300,7 @@ computer; it's interpreter is implemented in hardware on your computer's CPU,} @item{it is one of the two dominant computing architectures (the other -being ARM), and} +being ARM) in use today, and} @item{it is a mature technology with good tools and materials.} ] @@ -303,7 +320,7 @@ as follows: Separating out @tt{print_result}, which at this point is just a simple @tt{printf} statement, seems like overkill, but it will be useful in -the future as the language gets more complicated. +the future as the language and its set of values gets more complicated. The runtime must be linked against an object file that provides the definition of @tt{entry}; this is the code our compiler will emit. @@ -496,75 +513,69 @@ Moreover, we can compare our compiled code to code compiled by Racket: @section{But is it @emph{Correct}?} At this point, we have a compiler for Abscond. But is it correct? +What does that even mean, to be correct? -Here is a statement of compiler correctness: +First, let's formulate an alternative implementation of +@racket[interp] that composes our compiler and a86 interpreter to define +a (hopefully!) equivalent function to @racket[interp]: -@bold{Compiler Correctness}: @emph{For all expressions @racket[e] and -integers @racket[i], if (@racket[e],@racket[i]) in @render-term[A -𝑨], then @racket[(asm-interp (compile e))] equals -@racket[i].} +@codeblock-include["abscond/exec.rkt"] -Ultimately, we want the compiler to capture the operational semantics -of our language (the ground truth of what programs mean). However, -from a practical stand-point, relating the compiler to the intepreter -may be more straightforward. What's nice about the interpreter is we -can run it, so we can @emph{test} the compiler against the -interpreter. Moreover, since we claimed the interpreter is correct -(w.r.t. to the semantics), testing the compiler against the interpreter -is a way of testing it against the semantics, indirectly. If the -compiler and interpreter agree on all possible inputs, then the -compiler is correct with respect to the semantics since it is -equivalent to the interpreter, and the interpreter is correct. +This function can be used as a drop-in replacement to @racket[interp]: -So, in this setting, means we have the following equivaluence: +@ex[ +(exec (Lit 42)) +(exec (Lit 19))] -@verbatim{ -(interp e) @emph{equals} (asm-interp (compile e)) -} +It captures the idea of a phase-distinction in that you can first +compile a program into a program in another language---in this case +a86---and can then interpret @emph{that} program to get the result. +If the compiler is correct, the result should be the same: + +@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈} +@tt{Expr} and @racket[i] @math{∈} @tt{Integer}, if @racket[(interp e)] +equals @racket[i], then @racket[(exec e)] equals +@racket[i].} + +One thing that is nice about specifying our language with an +interpreter is that we can run it. So we can @emph{test} the compiler +against the interpreter. If the compiler and interpreter agree on all +possible inputs, then the compiler is correct. -But we don't actually have @racket[asm-interp], a function that -interprets the Asm code we generate. Instead we printed the code and -had @tt{gcc} assembly and link it into an executable, which the OS -could run. But this is a minor distinction. We can use -@racket[asm-interp] to interact with the OS to do all of these steps. This is actually a handy tool to have for experimenting with compilation within Racket: -@examples[#:eval ev -(asm-interp (compile (Lit 42))) -(asm-interp (compile (Lit 37))) -(asm-interp (compile (Lit -8))) -] +@ex[ +(exec (Lit 42)) +(exec (Lit 37)) +(exec (Lit -8))] This of course agrees with what we will get from the interpreter: -@examples[#:eval ev +@ex[ (interp (Lit 42)) (interp (Lit 37)) -(interp (Lit -8)) -] +(interp (Lit -8))] We can turn this in a @bold{property-based test}, i.e. a function that computes a test expressing a single instance of our compiler correctness claim: -@examples[#:eval ev -(define (check-compiler e) - (check-eqv? (interp e) - (asm-interp (compile e)))) +@codeblock-include["abscond/correct.rkt"] + +@ex[ (check-compiler (Lit 42)) (check-compiler (Lit 37)) -(check-compiler (Lit -8)) -] +(check-compiler (Lit -8))] This is a powerful testing technique when combined with random generation. Since our correctness claim should hold for @emph{all} Abscond programs, we can randomly generate @emph{any} Abscond program and check that it holds. -@examples[#:eval ev +@ex[ (check-compiler (Lit (random 100))) ; test 10 random programs diff --git a/www/notes/blackmail.scrbl b/www/notes/blackmail.scrbl index 6b2e05c1..0492230f 100644 --- a/www/notes/blackmail.scrbl +++ b/www/notes/blackmail.scrbl @@ -13,7 +13,7 @@ @(ev '(require rackunit a86)) @(for-each (λ (f) (ev `(require (file ,(path->string (build-path langs "blackmail" f)))))) - '("interp.rkt" "compile.rkt" "random.rkt" "ast.rkt")) + '("main.rkt" "random.rkt" "correct.rkt")) @(define (shellbox . s) (parameterize ([current-directory (build-path langs "blackmail")]) @@ -41,9 +41,9 @@ @section{Refinement, take one} -We've seen all the essential pieces (a grammar, an AST data type -definition, an operational semantics, an interpreter, a compiler, -etc.) for implementing a programming language, albeit for an amazingly +We've seen all the essential pieces---a grammar, an AST data type +definition, a semantic specification (the interpreter), a compiler, +etc.---for implementing a programming language, albeit for an amazingly simple language. We will now, through a process of @bold{iterative refinement}, grow @@ -73,24 +73,27 @@ An example concrete program: @section{Abstract syntax for Blackmail} -The grammar of abstract Backmail expressions is: +The datatype for abstractly representing expressions can be defined +as: -@centered{@render-language[B]} +@codeblock-include["blackmail/ast.rkt"] So, @racket[(Lit 0)], @racket[(Lit 120)], and @racket[(Lit -42)] are Blackmail AST expressions, but so are @racket[(Prim1 'add1 (Lit 0))], @racket[(Sub1 (Lit 120))], @racket[(Prim1 'add1 (Prim1 'add1 (Prim1 'add1 (Lit -42))))]. -A datatype for representing expressions can be defined as: - -@codeblock-include["blackmail/ast.rkt"] The parser is more involved than Abscond, but still straightforward: @codeblock-include["blackmail/parse.rkt"] +@ex[ +(parse '42) +(parse '(add1 42)) +(parse '(add1 (sub1 (add1 42))))] + @section{Meaning of Blackmail programs} @@ -101,6 +104,7 @@ The meaning of a Blackmail program depends on the form of the expression: @item{the meaning of an increment expression is one more than the meaning of its subexpression, and} @item{the meaning of a decrement expression is one less than the meaning of its subexpression.}] +@;{ The operational semantics reflects this dependence on the form of the expression by having three rules, one for each kind of expression: @@ -150,13 +154,27 @@ This may seem a bit strange at the moment, but it helps to view the semantics through its correspondence with an interpreter, which given an expression @math{e}, computes an integer @math{i}, such that @math{(e,i)} is in @render-term[B 𝑩]. +} -Just as there are three rules, there will be three cases to the -interpreter, one for each form of expression: +To compute the meaning of an expression, the @racket[interp] +function does a case analysis of the expression: @codeblock-include["blackmail/interp.rkt"] -@examples[#:eval ev +In the case of a @racket[(Prim1 p e)] expression, the interpreter +first recursively computes the meaning of the subexpression @racket[e] +and then defers to a helper function @racket[interp-prim1] which +interprets the meaning of a given unary operation and value: + +@codeblock-include["blackmail/interp-prim.rkt"] + +If the given operation is @racket['add1], the function adds 1; +if it's @racket['sub1], it subtracts 1. + +We can write examples of the @racket[interp] function, writing inputs +in abstract syntax: + +@ex[ (interp (Lit 42)) (interp (Lit -7)) (interp (Prim1 'add1 (Lit 42))) @@ -164,6 +182,20 @@ interpreter, one for each form of expression: (interp (Prim1 'add1 (Prim1 'add1 (Prim1 'add1 (Lit 8))))) ] +We could also write examples using concrete syntax, using the parser +to construct the appropriate abstract syntax for us: + +@ex[ +(interp (parse '42)) +(interp (parse '-7)) +(interp (parse '(add1 42))) +(interp (parse '(sub1 8))) +(interp (parse '(add1 (add1 (add1 8))))) +] + + + +@;{ Here's how to connect the dots between the semantics and interpreter: the interpreter is computing, for a given expression @math{e}, the integer @math{i}, such that @math{(e,i)} is in @render-term[B 𝑩]. The @@ -195,56 +227,95 @@ induction of the interpreter's correctness: @render-term[B 𝑩], then @racket[(interp e)] equals @racket[i].} +} + @section{An Example of Blackmail compilation} Just as we did with Abscond, let's approach writing the compiler by first writing an example. -Suppose we want to compile @racket[(add1 (add1 40))]. We already -know how to compile the @racket[40]: @racket[(Mov 'rax 40)]. To do -the increment (and decrement) we need to know a bit more x86-64. In -particular, the @tt{add} (and @tt{sub}) instruction is relevant. It -increments the contents of a register by some given amount. - -Concretely, the program that adds 1 twice to 40 looks like: +Suppose we want to compile @racket[(add1 (add1 40))]. We already know +how to compile the @racket[40]: @racket[(Mov 'rax 40)]. To do the +increment (and decrement) we need to know a bit more a86. In +particular, the @racket[Add] instruction is relevant. It increments +the contents of a register by some given amount. -@filebox-include[fancy-nasm blackmail "add1-add1-40.s"] +So, a program that adds 1 twice to 40 looks like: -The runtime stays exactly the same as before. -@shellbox["make add1-add1-40.run" "./add1-add1-40.run"] +@ex[ +(asm-interp + (prog (Global 'entry) + (Label 'entry) + (Mov 'rax 40) + (Add 'rax 1) + (Add 'rax 1) + (Ret)))] + -@section{A Compiler for Blackmail} +@;{filebox-include[fancy-nasm blackmail "add1-add1-40.s"]} -To compile Blackmail, we make use of two more a86 -instructions, @racket[Add] and @racket[Sub]: -@ex[ -(asm-display - (list (Label 'entry) - (Mov 'rax 40) - (Add 'rax 1) - (Add 'rax 1) - (Ret))) -] +@section{A Compiler for Blackmail} -The compiler consists of two functions: the first, which is given a -program, emits the entry point and return instructions, invoking -another function to compile the expression: +To compile Blackmail, we make use of two more a86 instructions, +@racket[Add] and @racket[Sub]. The compiler consists of two +functions: the first, which is given a program, emits the entry point +and return instructions, invoking another function to compile the +expression: @codeblock-include["blackmail/compile.rkt"] Notice that @racket[compile-e] is defined by structural recursion, much like the interpreter. +In the case of a unary primitive @racket[(Prim1 p e)], the compiler +first compiles the subexpression @racket[e] obtaining a list of +instructions that, when executed, will place @racket[e]'s value in the +@racket['rax] register. After that sequence of instructions, the +compiler emits instructions for carrying out the operation @racket[p], +defering to a helper function @racket[compile-op1]: + +@codeblock-include["blackmail/compile-ops.rkt"] + +This function either emits an @racket[Add] or @racket[Sub] +instruction, depending upon @racket[p]. We can now try out a few examples: @ex[ -(compile (Prim1 'add1 (Prim1 'add1 (Lit 40)))) -(compile (Prim1 'sub1 (Lit 8))) -(compile (Prim1 'add1 (Prim1 'add1 (Prim1 'sub1 (Prim1 'add1 (Lit -8)))))) -] +(compile-e (parse '(add1 (add1 40)))) +(compile-e (parse '(sub1 8))) +(compile-e (parse '(add1 (add1 (sub1 (add1 -8))))))] + +To see the complete code for these examples, we can use the +@racket[compile] function: + +@ex[ +(compile (parse '(add1 (add1 40)))) +(compile (parse '(sub1 8))) +(compile (parse '(add1 (add1 (sub1 (add1 -8))))))] + +We can also run the code produced in these examples in order to see +what they each produce: + +@ex[ +(asm-interp (compile (parse '(add1 (add1 40))))) +(asm-interp (compile (parse '(sub1 8)))) +(asm-interp (compile (parse '(add1 (add1 (sub1 (add1 -8)))))))] + +Based on this, it's useful to define an @racket[exec] function that +(should) behave like @racket[interp], just as we did for Abscond: + +@codeblock-include["blackmail/exec.rkt"] + +@ex[ +(exec (parse '(add1 (add1 40)))) +(exec (parse '(sub1 8))) +(exec (parse '(add1 (add1 (sub1 (add1 -8))))))] + +This function will be the basis of our compiler correctness statement +and a primary tool for testing the compiler. And give a command line wrapper for parsing, checking, and compiling in @link["code/blackmail/compile-stdin.rkt"]{@tt{compile-stdin.rkt}}, @@ -259,34 +330,26 @@ single command: @void[(shellbox "touch add1-add1-40.rkt")] @shellbox["make add1-add1-40.run" "./add1-add1-40.run"] -Likewise, to test the compiler from within Racket, we use -the same @racket[asm-interp] function to encapsulate running -assembly code: - -@ex[ -(asm-interp (compile (Prim1 'add1 (Prim1 'add1 (Lit 40))))) -(asm-interp (compile (Prim1 'sub1 (Lit 8)))) -(asm-interp (compile (Prim1 'add1 (Prim1 'add1 (Prim1 'add1 (Prim1 'add1 (Lit -8))))))) -] @section{Correctness and random testing} We can state correctness similarly to how it was stated for Abscond: -@bold{Compiler Correctness}: @emph{For all expressions @racket[e] and -integers @racket[i], if (@racket[e],@racket[i]) in @render-term[B -𝑩], then @racket[(asm-interp (compile e))] equals +@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈} +@tt{Expr} and @racket[i] @math{∈} @tt{Integer}, if @racket[(interp e)] +equals @racket[i], then @racket[(exec e)] equals @racket[i].} +(This statement is actually identical to the statement of correctness +for Abscond, however, it should be noted that the meaning of +@tt{Expr}, @racket[interp], @racket[exec] refer to their Blackmail +definitions.) And we can test this claim by comparing the results of running compiled and interpreted programs, leading to the following property, which hopefully holds: -@ex[ -(define (check-compiler e) - (check-eqv? (interp e) - (asm-interp (compile e))))] +@codeblock-include["blackmail/correct.rkt"] The problem, however, is that generating random Blackmail programs is less obvious compared to generating random Abscond programs @@ -311,6 +374,8 @@ e (check-compiler (random-expr))) ] +@section[#:tag "broken"]{A Broken Compiler} + It's now probably time to acknowledge a short-coming in our compiler. Although it's great that random testing is confirming the correctness of the compiler on @@ -332,10 +397,10 @@ x86 does. Let's see: @ex[ (define max-int (sub1 (expt 2 63))) (define min-int (- (expt 2 63))) -(asm-interp (compile (Lit max-int))) -(asm-interp (compile (Prim1 'add1 (Lit max-int)))) -(asm-interp (compile (Lit min-int))) -(asm-interp (compile (Prim1 'sub1 (Lit min-int))))] +(exec (Lit max-int)) +(exec (Prim1 'add1 (Lit max-int))) +(exec (Lit min-int)) +(exec (Prim1 'sub1 (Lit min-int)))] Now there's a fact you didn't learn in grade school: in the first example, adding 1 to a number made it smaller; in the @@ -358,6 +423,29 @@ correctness: (check-compiler (Prim1 'sub1 (Lit min-int))) ] +The problem also exists in Abscond in that we can write literals that +exceed these bounds. The interpreter has no problem with such +literals: + +@ex[ +(interp (Lit (add1 max-int)))] + +But the compiler will produce the wrong result: + +@ex[ +(exec (Lit (add1 max-int)))] + +It's also possible to exceed the bounds so thoroughly, that the +program can't even be compiled: + +@ex[ +(interp (Lit (expt 2 64))) +(eval:error (exec (Lit (expt 2 64))))] + +The issue here being that a @racket[Mov] instruction can only take an +argument that can be represented in 64-bits. + + What can we do? This is the basic problem of a program not satisfying its specification. We have two choices: @@ -366,17 +454,16 @@ satisfying its specification. We have two choices: @item{change the program (i.e. the compiler)} ] -We could change the spec to make it match the behaviour of -the compiler. This would involve writing out definitions -that match the ``wrapping'' behavior we see in the compiled -code. Of course if the specification is meant to capture -what Racket actually does, taking this route would be a -mistake. Even independent of Racket, this seems like a -questionable design choice. Wouldn't it be nice to reason -about programs using the usual laws of mathematics (or at -least something as close as possible to what we think of as -math)? For example, wouldn't you like know that -@racket[(< i (add1 i))] for all integers @racket[i]? +We could change the spec to make it match the behaviour of the +compiler. This would involve writing out definitions that match the +``wrapping'' behavior we see in the compiled code. Of course if the +specification is meant to capture what Racket actually does, taking +this route would be a mistake. Even independent of Racket, this seems +like a questionable design choice. Wouldn't it be nice to reason about +programs using the usual laws of mathematics (or at least something as +close as possible to what we think of as math)? For example, wouldn't +you like know that @racket[(< i (add1 i))] for all integers +@racket[i]? Unforunately, the other choice seems to paint us in to a corner. How can we ever hope to represent all possible @@ -402,15 +489,13 @@ these pieces in the two compilers we've written: @itemlist[@item{we use @racket[parse] to convert an s-expression into an AST}]} -@item{@bold{Checked} to make sure code is well-formed (and well-typed)} - -@item{@bold{Simplified} into some convenient @bold{Intermediate Representation} +@item{@bold{Checked} to make sure code is well-formed -@itemlist[@item{we don't do any; the AST is the IR}]} +@itemlist[@item{we don't current do any, but more checking will come with more sophisticated langauges.}]} @item{@bold{Optimized} into (equivalent) but faster program -@itemlist[@item{we don't do any}]} +@itemlist[@item{we don't do any yet}]} @item{@bold{Generated} into assembly x86 @@ -428,7 +513,8 @@ Our recipe for building compiler involves: @itemlist[#:style 'ordered @item{Build intuition with @bold{examples},} @item{Model problem with @bold{data types},} -@item{Implement compiler via @bold{type-transforming-functions},} +@item{Implement compiler via @bold{type-guided-functions},} + @item{Validate compiler via @bold{tests}.} ] diff --git a/www/notes/con.scrbl b/www/notes/con.scrbl index c2a1ebac..5365b33c 100644 --- a/www/notes/con.scrbl +++ b/www/notes/con.scrbl @@ -18,7 +18,7 @@ @(ev '(require rackunit a86)) @(for-each (λ (f) (ev `(require (file ,(path->string (build-path langs "con" f)))))) - '("interp.rkt" "compile.rkt" "parse.rkt" "ast.rkt" "random.rkt")) + '("main.rkt" "random.rkt" "correct.rkt")) @title[#:tag "Con"]{Con: branching with conditionals} @@ -44,10 +44,6 @@ This leads to the following grammar for concrete Con: And abstract grammar: -@centered{@render-language[C]} - -Which can be modeled with the following definitions: - @codeblock-include["con/ast.rkt"] @;{ @@ -59,6 +55,11 @@ The parser is similar to what we've seen before: @codeblock-include["con/parse.rkt"] +@ex[ +(parse '(if (zero? 42) 1 2)) +(parse '(if (zero? (sub1 1)) (add1 2) (sub1 7))) +(parse '(if (zero? 0) (if (zero? 1) 2 3) 4))] + @section{Meaning of Con programs} @@ -85,7 +86,7 @@ Let's consider some examples (using concrete notation): ] -The semantics is inductively defined as before. There are @emph{two} +@;{The semantics is inductively defined as before. There are @emph{two} new rules added for handling if-expressions: one for when the test expression means @racket[0] and one for when it doesn't. @@ -160,19 +161,14 @@ according to @render-term[C 𝑪𝒓]: @(show-judgment 𝑪 0 1) } +} -The interpreter has an added case for if-expressions, which -recursively evaluates the test expression and branches based on its -value. +The semantics is defined by extending the interpreter to add a case +for if-expressions, which recursively evaluates the test expression +and branches based on its value. @codeblock-include["con/interp.rkt"] -We've also made one trivial change, which is to move @racket[interp-prim1] to its -own module. This will be useful in the future when more primitive operations are -added, we won't have to clutter up the interpreter: - -@codeblock-include["con/interp-prim.rkt"] - We can confirm the interpreter computes the right result for the examples given earlier (using @racket[parse] to state the examples with concrete notation): @@ -184,9 +180,6 @@ with concrete notation): (interp (parse '(if (zero? (add1 0)) (add1 2) (if (zero? (sub1 1)) 1 0)))) ] -The argument for the correctness of the interpreter follows the same -structure as for @seclink["Blackmail"]{Blackmail}, but with an added case for -if-expressions. @section{An Example of Con compilation} @@ -247,14 +240,14 @@ this, we arrive at the following code for the compiler: @racketblock[ (let ((l0 (gensym 'if)) (l1 (gensym 'if))) - (append (compile-e e1) - (list (Cmp 'rax 0) - (Je l0)) - (compile-e e3) - (list (Jmp l1) - (Label l0)) - (compile-e e2) - (list (Label l1)))) + (seq (compile-e e1) + (Cmp 'rax 0) + (Je l0) + (compile-e e3) + (Jmp l1) + (Label l0) + (compile-e e2) + (Label l1))) ] @@ -266,11 +259,6 @@ The complete compiler code is: @codeblock-include["con/compile.rkt"] -Mirroring the change we made to the interpreter, we separate out a -module for compiling primitives: - -@codeblock-include["con/compile-ops.rkt"] - Let's take a look at a few examples: @ex[ (define (show s) @@ -284,13 +272,10 @@ Let's take a look at a few examples: And confirm they are running as expected: @ex[ -(define (tell s) - (asm-interp (compile (parse s)))) - -(tell '(if (zero? 8) 2 3)) -(tell '(if (zero? 0) 1 2)) -(tell '(if (zero? 0) (if (zero? 0) 8 9) 2)) -(tell '(if (zero? (if (zero? 2) 1 0)) 4 5)) +(exec (parse '(if (zero? 8) 2 3))) +(exec (parse '(if (zero? 0) 1 2))) +(exec (parse '(if (zero? 0) (if (zero? 0) 8 9) 2))) +(exec (parse '(if (zero? (if (zero? 2) 1 0)) 4 5))) ] @@ -298,17 +283,12 @@ And confirm they are running as expected: The statement of correctness follows the same outline as before: -@bold{Compiler Correctness}: @emph{For all expressions @racket[e] and -integers @racket[i], if (@racket[e],@racket[i]) in @render-term[C 𝑪], -then @racket[(asm-interp (compile e))] equals @racket[i].} +@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈} @tt{Expr}, +@racket[(interp e)] equals @racket[(exec e)].} Again, we formulate correctness as a property that can be tested: -@ex[ -(define (check-compiler e) - (check-equal? (asm-interp (compile e)) - (interp e) - e))] +@codeblock-include["con/correct.rkt"] Generating random Con programs is essentially the same as Blackmail programs, and are provided in a @link["con/random.rkt"]{random.rkt} @@ -323,3 +303,7 @@ module. (for ([i (in-range 10)]) (check-compiler (random-expr))) ] + +This compiler has continues to have the issues identified in +@secref{broken}, but appears correct in its implementation of +conditional expressions. \ No newline at end of file diff --git a/www/notes/dodger.scrbl b/www/notes/dodger.scrbl index 83945c16..ae85114a 100644 --- a/www/notes/dodger.scrbl +++ b/www/notes/dodger.scrbl @@ -15,7 +15,7 @@ @(ev '(require rackunit a86)) @(for-each (λ (f) (ev `(require (file ,(path->string (build-path langs "dodger" f)))))) - '("interp.rkt" "compile.rkt" "ast.rkt" "parse.rkt" "types.rkt")) + '("main.rkt" "random.rkt" "correct.rkt")) @title[#:tag "Dodger"]{Dodger: addressing a lack of character} @@ -55,6 +55,12 @@ The s-expression parser is defined as follows: @codeblock-include["dodger/parse.rkt"] +@ex[ +(parse #\a) +(parse '(char? #\λ)) +(parse '(char->integer #\λ)) +(parse '(integer->char 97))] + @section{Characters in Racket} @@ -98,8 +104,6 @@ system, described below, takes care of printing them. @section{Meaning of Dodger programs} -The semantics are omitted for now (there's really nothing new that's interesting). - The interpeter is much like that of Dupe, except we have a new base case: @codeblock-include["dodger/interp.rkt"] @@ -154,41 +158,40 @@ We can use the following encoding scheme: Notice that each kind of value is disjoint. -We can write an interpreter that operates at the level of bits just as we did for Dupe; -notice that it only ever constructs characters at the very end when converting from -bits to values. Let's first define our bit encodings: +We can write down functions for encoding into and decoding out of bits: @codeblock-include["dodger/types.rkt"] -And now the interpreter: - -@codeblock-include["dodger/interp-bits.rkt"] - - @section{A Compiler for Dodger} -Compilation is pretty easy, particularly since we took the -time to develop the bit-level interpreter. The compiler uses -the same bit-level representation of values and uses logical -operations to implement the same bit manipulating operations. -Most of the work happens in the compilation of primitives: +Compilation is pretty easy. The compiler uses the bit-level +representation of values described earlier and uses logical operations +to implement the bit manipulating operations. Most of the work +happens in the compilation of primitives: @codeblock-include["dodger/compile-ops.rkt"] -The top-level compiler for expressions now has a case for character -literals, which are compiled like other kinds of values: +In fact the @racket[compile] is identical to its Dodger predecessor, +since all the new work is done in @racket[value->bits] and +@racket[compile-op1]: @codeblock-include["dodger/compile.rkt"] We can take a look at a few examples: @ex[ -(define (show e) - (displayln (asm-string (compile-e (parse e))))) -(show '#\a) -(show '#\λ) -(show '(char->integer #\λ)) -(show '(integer->char 97))] +(compile-e (parse #\a)) +(compile-e (parse #\λ)) +(compile-e (parse '(char->integer #\λ))) +(compile-e (parse '(integer->char 97)))] + +We can run them: + +@ex[ +(exec (parse #\a)) +(exec (parse #\λ)) +(exec (parse '(char->integer #\λ))) +(exec (parse '(integer->char 97)))] @section{A Run-Time for Dodger} @@ -210,13 +213,4 @@ the case of printing characters: @filebox-include[fancy-c dodger "print.c"] -Will these pieces in place, we can try out some examples: - -@ex[ - (define (run e) - (bits->value (asm-interp (compile (parse e))))) - (run '#\a) - (run '(integer->char (add1 (char->integer #\a)))) - (run '(integer->char 955)) -] - +@;{FIXME: examples should be creating executable at the command-line, not exec.} \ No newline at end of file diff --git a/www/notes/dupe.scrbl b/www/notes/dupe.scrbl index 5c52a18c..20535ce0 100644 --- a/www/notes/dupe.scrbl +++ b/www/notes/dupe.scrbl @@ -18,7 +18,7 @@ @(ev '(require rackunit a86)) @(for-each (λ (f) (ev `(require (file ,(path->string (build-path langs "dupe" f)))))) - '("interp.rkt" "interp-prim.rkt" "compile.rkt" "ast.rkt" "parse.rkt" "random.rkt" "types.rkt")) + '("main.rkt" "random.rkt" "correct.rkt")) @title[#:tag "Dupe"]{Dupe: a duplicity of types} @@ -39,7 +39,7 @@ To start, we will consider two: integers and booleans. We'll call it @bold{Dupe}. We will use the following syntax, which relaxes the syntax of Con to -make @racket['(zero? _e)] its own expression form and conditionals can +make @racket[(zero? _e)] its own expression form and conditionals can now have arbitrary expressions in test position: @racket[(if _e0 _e1 _e2)] instead of just @racket[(if (zero? _e0) _e1 _e2)]. We also add syntax for boolean literals. @@ -50,21 +50,18 @@ Together this leads to the following grammar for concrete Dupe. And abstract Dupe: -@centered{@render-language[D]} - - -One thing to take note of is the new nonterminal @math{v} -which ranges over @bold{values}, which are integers and -booleans. - -Abstract syntax is modelled with the following datatype definition: - @codeblock-include["dupe/ast.rkt"] The s-expression parser is defined as follows: @codeblock-include["dupe/parse.rkt"] +@ex[ +(parse '#t) +(parse '(if #t 1 2)) +(parse '(zero? 8)) +(parse '(if (zero? 8) 1 2))] + @section{Meaning of Dupe programs} To consider the meaning of Dupe programs, we must revisit the meaning @@ -117,7 +114,7 @@ Languages adopt several approaches: We are going to start by taking the last approach. Later we can reconsider the design, but for now this is a simple approach. - +@;{ The semantics is still a binary relation between expressions and their meaning, however the type of the relation is changed to reflect the values of Dupe, which may either be integers or booleans: @@ -179,66 +176,72 @@ meaning to an @racket[(Prim1 'add1 _e0)] expression and it's premise is that @math{(@racket[(Lit #f)], i) ∉ 𝑫} for any @math{i}. So there's no value @math{v} such that @math{(@racket[(Prim1 'add1 (Lit #f))], v) ∈ 𝑫}. This expression is @bold{undefined} according to the semantics. +} - -The interpreter follows the rules of the semantics closely and is -straightforward: +The interpreter follows a similar pattern to what we've done so far, +although notice that the result of interpretation is now a @tt{Value}: +either a boolean or an integer: @codeblock-include["dupe/interp.rkt"] -And the interpretation of primitives closely matches @math{𝑫-𝒑𝒓𝒊𝒎}: +The interpretation of primitives is extended to account for the new +@racket[zero?] primitive: @codeblock-include["dupe/interp-prim.rkt"] + We can confirm the interpreter computes the right result for the examples given earlier: @ex[ -(interp (Lit #t)) -(interp (Lit #f)) -(interp (If (Lit #f) (Lit 1) (Lit 2))) -(interp (If (Lit #t) (Lit 1) (Lit 2))) -(interp (If (Lit 0) (Lit 1) (Lit 2))) -(interp (If (Lit 7) (Lit 1) (Lit 2))) -(interp (If (Prim1 'zero? (Lit 7)) (Lit 1) (Lit 2))) -] +(interp (parse #t)) +(interp (parse #f)) +(interp (parse '(if #f 1 2))) +(interp (parse '(if #t 1 2))) +(interp (parse '(if 0 1 2))) +(interp (parse '(if 7 1 2))) +(interp (parse '(if (zero? 7) 1 2)))] + +@section{(Lack of) Meaning for some Dupe programs} -Correctness follows the same pattern as before, although it is worth -keeping in mind the ``hypothetical'' form of the statement: @emph{if} -the expression has some meaning, then the interpreter must produce it. -In cases where the semantics of the expression is undefined, the -interpreter can do whatever it pleases; there is no specification. +Viewed as a specification, what is this interpreter saying about programs that +do nonsensical things like @racket[(add1 #f)]? +First, let's revise the statement of compiler correctness to reflect +the fact that @racket[interp] can return different kinds of values: -@bold{Interpreter Correctness}: @emph{For all Dupe expressions -@racket[e] and values @racket[v], if (@racket[e],@racket[v]) in -@render-term[D 𝑫], then @racket[(interp e)] equals +@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈} +@tt{Expr} and @racket[v] @math{∈} @tt{Value}, if @racket[(interp e)] +equals @racket[v], then @racket[(exec e)] equals @racket[v].} -Consider what happens with @racket[interp] on undefined programs such -as @racket[(add1 #f)]: the interpretation of this expression is just -the application of the Racket @racket[add1] function to @racket[#f], -which results in the @racket[interp] program crashing and Racket -signalling an error: +Now, the thing to notice here is that this specification only +obligates the compiler to produce a result consistent with the +interpreter when the interpreter produces a @emph{value}. It says +nothing about what the compiler must produce if the intepreter fails +to produce a value, which is exactly what happens when the interpreter +is run on examples like @racket[(add1 #f)]: @ex[ -(eval:error (interp (Prim1 'add1 (Lit #f)))) -] - -This isn't a concern for correctness, because the interpreter is free -to crash (or do anything else) on undefined programs; it's not in -disagreement with the semantics, because there is no semantics. - -From a pragmatic point of view, this is a concern because it -complicates testing. If the interpreter is correct and every -expression has a meaning (as in all of our previous languages), it -follows that the interpreter can't crash on any @tt{Expr} input. That -makes testing easy: think of an expression, run the interpreter to -compute its meaning. But now the interpreter may break if the -expression is undefined. We can only safely run the interpreter on -expressions that have a meaning. - -We'll return to this point in the design of later langauges. +(eval:error (interp (parse '(add1 #f))))] + +Since the intepreter does not return a value on such an input, the +meaning of such expression is undefined and the compiler is +unconstrained and may do whatever it likes: crash at compile-time, +crash at run-time, return 7, format your hard drive; there are no +wrong answers. (These possibilities hopeful make clear while +undefined behavior opens up all kinds of bad outcomes and has been the +source of many costly computing failures and vulnerabilities and why +as a language designer, leaving the behavior of some programs +undefined may not be a great choice. As a compiler implementor, +however, it makes our life easier: we simply don't need to worry about +what happens on these kinds of programs.) + +This will, however complicate testing the correctness of the compiler, +which is addressed in @secref["Correctness_and_testing"]. + +Let's now turn to the main technical challenge introduced by the Dupe +langauge. @section{Ex uno plures: Out of One, Many} @@ -253,30 +256,98 @@ as a 64-bit integer. But we put off worrying about this until later.) The problem now is how to represent integers @emph{and} booleans, which should be @bold{disjoint} sets of values. Representing these -things in the interpreter as Racket values was easy: we used booleans -and integers. Representing this things in x86 will be more -complicated. x86 doesn't have a notion of ``boolean'' per se and it -doesn't have a notion of ``disjoint'' datatypes. There is only one -data type and that is: bits. - -We chose 64-bit integers to represent Con integers, because that's the -kind of value that can be stored in a register or used as an argument -to a instruction. It's the kind of thing that can be returned to the -C run-time. It's (more or less) the only kind of thing we have to -work with. So had we started with booleans instead of integers, we -still would have represented values as a sequence of bits -@emph{because that's all there is}. Now that we have booleans -@emph{and} integers, we will have to represent both as bits (64-bit -integers). The bits will have to encode both the value and the -@emph{type} of value. - -@margin-note{As discussed in the lecture video, there are many possible ways of -representing multiple types, this is just how we've decided to do it.} +things in the interpreter as Racket values was easy: we Racket +booleans to represent Dupe booleans; we used Racket integers to +represent Dupe integers. Since Racket booleans and integers are +disjoint types, everything was easy and sensible. + +Representing this things in x86 will be more complicated. x86 doesn't +have a notion of ``boolean'' per se and it doesn't have a notion of +``disjoint'' datatypes. There is only one data type and that is: +bits. + +To make the problem concrete, consider the Dupe expression +@racket[5]; we know the compiler is going to emit an single instruction +that moves this value into the @racket[rax] register: + +@ex[ +(Mov 'rax 5)] + +But now consider @racket[#t]. The compiler needs to emit an +instruction that moves ``@racket[#t]'' into @racket[rax], but the +@racket[Mov] instruction doesn't take booleans: + +@ex[ +(eval:error (Mov 'rax #t))] + +We have to move some 64-bit integer into @racket[rax], but the +question is: which one? + +The immediate temptation is to just pick a couple of integers, one for +representing @racket[#t] and one for @racket[#f]. We could follow the +C tradition and say @racket[#f] will be @racket[0] and @racket[#t] +will be 1. So compiling @racket[#t] would emit: + +@ex[ +(Mov 'rax 1)] + +And compiling @racket[#f] would emit: + +@ex[ +(Mov 'rax 0)] + +Seems reasonable. Well except that the specification of @racket[if] +in our interpreter requires that @racket[(if 0 1 2)] evaluates to +@racket[1] and @racket[(if #f 1 2)] evaluates to @racket[2]. But +notice that @racket[0] and @racket[#f] compile to exactly the same +thing under the above scheme. How could the code emitted for +@racket[if] possibly distinguish between whether the test expression +produced @racket[#f] or @racket[0] if they are represented by the same +bits? + +So the compiler is doomed to be incorrect on some expressions under +this scheme. Maybe we should revise our choice and say @racket[#t] +will be represented as @racket[0] and @racket[#f] should be +@racket[1], but now the compiler must be incorrect either on +@racket[(if #f 1 2)] or @racket[(if 1 1 2)]. We could choose +different integers for the booleans, but there's no escaping it: just +picking some bits for @racket[#t] and @racket[#f] without also +changing the representation of integers is bound to fail. + +Here's another perspective on the same problem. Our compiler emits +code that leaves the value of an expression in the @racket[rax] +register, which is how the value is returned to the run-time system +that then prints the result. Things were easy before having only +integers: the run-time just prints the integer. But with booleans, +the run-time will need to print either @tt{#t} or @tt{#f} if the +result is the true or false value. But if we pick integers to +represent @racket[#t], how can the run-time know whether it should +print the result as an integer or as @tt{#t}? It can't! + +The fundamental problem here is that according to our specification, +@racket[interp], these values need to be disjoint. No value can be +both an integer @emph{and} a boolean. Yet in x86 world, only one kind +of thing exists: bits, so how can we make values of different kinds? + +Ultimately, the only solution is to incorporate an explicit +representation of the @emph{type} of a value and use this information +to distinguish between values of different type. We could do this in +a number of ways: we could desginate another register to hold (a +representation of) the type of the value in @racket[rax]. This would +mean you couldn't know what the value in @racket[rax] meant without +consulting this auxiliary register. In essence, this is just using +more than 64-bits to represent values. Alternatively, we could +instead encode this information within the 64-bits so that only a +single register is needed to completely determine a value. This does +come at a cost though: if some bits are needed to indicate the type, +there are fewer bits for the values! Here is the idea of how this could be done: We have two kinds of data: integers and booleans, so we could use one bit to indicate whether a value is a boolean or an integer. The remaining 63 bits can be used to represent the value itself, either true, false, or some integer. +There are other approaches to solving this problem, but this is a +common approach called @emph{type-tagging}. Let's use the least significant bit to indicate the type and let's use @binary[type-int] for integer and @@ -298,9 +369,29 @@ is represented by the number @racket[#,(value->bits #f)] One nice thing about our choice of encoding: @racket[0] is represented as @racket[0] (@binary[0 2]). + +To encode a value as bits: + +@itemlist[ + +@item{If the value is an integer, shift the value to the left one bit. (Mathematically, this has the effect of doubling the number.)} +@item{If the value is a boolean, + @itemlist[ + @item{if it's the boolean @racket[#t], encode as @racket[#,(value->bits #t)],} + @item{if it's the value @racket[#f], encode as @racket[#,(value->bits #f)].}]}] + +To decode bits as a value: + +@itemlist[ +@item{If the least significant bit is @racket[0], shift to the right one bit. (Mathematically, this has the effect of halving the number.)} +@item{If the bits are @racket[#,(value->bits #t)], decode to @racket[#t].} +@item{If the bits are @racket[#,(value->bits #f)], decode to @racket[#f].} +@item{All other bits don't encode a value.}] + + If you wanted to determine if a 64-bit integer represented -an integer or a boolean, you simply need to inquire about -the value of the least significant bit. At a high-level, +an integer value or a boolean value, you simply need to inquire about +the value of the least significant bit. Mathematically, this just corresponds to asking if the number is even or odd. Odd numbers end in the bit (@binary[1]), so they reprepresent booleans. Even numbers represent integers. Here @@ -347,7 +438,11 @@ We can also write the inverse: ) The interpreter operates at the level of @tt{Value}s. The compiler -will have to work at the level of @tt{Bits}. Of course, we could, +will have to work at the level of @tt{Bits}. + + +@;{ +Of course, we could, as an intermediate step, define an interpreter that works on bits, which may help us think about how to implement the compiler. @@ -817,13 +912,12 @@ Notice the last two examples. What's going on? The @racket[interp.v2] function is also a correct interpreter for Dupe, and importantly, it sheds light on how to implement the compiler since it uses the same representation of values. +} @section{An Example of Dupe compilation} The most significant change from Con to Dupe for the compiler is the -change in representation, but having sorted those issues out at the -level of @racket[interp-bits], it should be pretty easy to write the -compiler. +change in representation of values. Let's consider some simple examples: @@ -831,30 +925,36 @@ Let's consider some simple examples: @item{@racket[42]: this should compile just like integer literals before, but needs to use the new representation, i.e. the compiler -should produce @racket[(Mov 'rax 84)], which is @racket[(* 42 2)].} +should produce @racket[(Mov 'rax #,(value->bits 42))], which is +@racket[42] shifted to the left @racket[#,int-shift]-bit.} @item{@racket[#f]: this should produce @racket[(Mov 'rax #,(value->bits #f))].} @item{@racket[#t]: this should produce @racket[(Mov 'rax #,(value->bits #t))].} @item{@racket[(add1 _e)]: this should produce the instructions for -@racket[_e] followed by an instruction to add @racket[#,(value->bits 1)], which is -just how @racket[interp-bits] interprets an @racket[add1].} +@racket[_e], which when executed would leave @emph{the encoding of the +value of @racket[_e]} in the @racket[rax] register. To these +instructions, the compiler needs to append instructions that will +leave the encoding of one more than the value of @racket[_e] in +@racket[rax]. In other words, it should add @racket[#,(value->bits +1)] to @racket[rax]!} @item{@racket[(sub1 _e)]: should work like @racket[(add1 _e)] but subtracting @racket[#,(value->bits 1)].} -@item{@racket[(zero? _e)]: this should produce the - instructions for @racket[_e] followed by instructions that - compare @racket['rax] to 0 and set @racket['rax] to - @racket[#t] (i.e. @binary[(value->bits #t) 2]) if true and - @racket[#f] (i.e. @binary[(value->bits #f) 2]) otherwise. +@item{@racket[(zero? _e)]: this should produce the instructions for + @racket[_e] followed by instructions that compare @racket[rax] to + the encoding of the value @racket[0], which is just the bits + @racket[0], and set @racket[rax] to @racket[#t] + (i.e. @binary[(value->bits #t) 2]) if true and @racket[#f] + (i.e. @binary[(value->bits #f) 2]) otherwise. This is a bit different from what we saw with Con, which combined conditional execution with testing for equality to @racket[0]. Here there is no need to @emph{jump} anywhere based on whether @racket[_e] produces @racket[0] or not. Instead we want to move either the -encoding of @racket[#t] or @racket[#f] into @racket['rax] depending on +encoding of @racket[#t] or @racket[#f] into @racket[rax] depending on what @racket[_e] produces. To accomplish that, we can use a new kind of instruction, the @bold{conditional move} instruction: @racket[Cmov]. @@ -864,7 +964,7 @@ of instruction, the @bold{conditional move} instruction: @racket[Cmov]. compiling each subexpression, generating some labels and the appropriate comparison and conditional jump. The only difference is we now want to compare the result of executing @racket[_e0] with -@racket[#f] (i.e. @binary[(value->bits #f) 2]) and jumping to the code for @racket[_e2] when +(the encoding of the value) @racket[#f] (i.e. @binary[(value->bits #f) 2]) and jumping to the code for @racket[_e2] when they are equal.} ] @@ -880,7 +980,14 @@ they are equal.} @section{A Compiler for Dupe} -Based on the examples, we can write the compiler: +Based on the examples, we can write the compiler. Notice that the +compiler uses the @racket[value->bits] function we wrote earlier for +encoding values as bits. This helps make the code more readable and +easier to maintain should the encoding change in the future. But it's +important to note that this function is used only at compile-time. By +the time the assemble code executes (i.e. run-time) the +@racket[value->bits] function (and indeed all Racket functions) no +longer exists. @codeblock-include["dupe/compile.rkt"] @@ -905,34 +1012,43 @@ but you'll notice the results are a bit surprising: The reason for this is @racket[asm-interp] doesn't do any interpretation of the bits it gets back; it is simply -producing the integer that lives in @racket['rax] when the +producing the integer that lives in @racket[rax] when the assembly code finishes. This suggests adding a call to @racket[bits->value] can be added to interpret the bits as values: @ex[ -(define (interp-compile e) - (bits->value (asm-interp (compile e)))) - -(interp-compile (Lit #t)) -(interp-compile (Lit #f)) -(interp-compile (parse '(zero? 0))) -(interp-compile (parse '(zero? -7))) -(interp-compile (parse '(if #t 1 2))) -(interp-compile (parse '(if #f 1 2))) -(interp-compile (parse '(if (zero? 0) (if (zero? 0) 8 9) 2))) -(interp-compile (parse '(if (zero? (if (zero? 2) 1 0)) 4 5))) -] +(bits->value (asm-interp (compile (Lit #t))))] + +Which leads us to the following definition of @racket[exec]: +@codeblock-include["dupe/exec.rkt"] +@ex[ +(exec (parse #t)) +(exec (parse #f)) +(exec (parse '(zero? 0))) +(exec (parse '(zero? -7))) +(exec (parse '(if #t 1 2))) +(exec (parse '(if #f 1 2))) +(exec (parse '(if (zero? 0) (if (zero? 0) 8 9) 2))) +(exec (parse '(if (zero? (if (zero? 2) 1 0)) 4 5)))] The one last peice of the puzzle is updating the run-time system to incorporate the new representation. The run-time system is essentially playing the role of @racket[bits->value]: it determines what is being represented and prints it appropriately. -For the run-time system, we define the bit representations in a header -file corresponding to the definitions given in @tt{types.rkt}: +@section{Updated Run-time System for Dupe} + +Any time there's a change in the representation or set of values, +there's going to be a required change in the run-time system. From +Abscond through Con, there were no such changes, but now we have to +udpate our run-time system to reflect the changes made to values in +Dupe. + +We define the bit representations in a header file corresponding to +the definitions given in @tt{types.rkt}: @filebox-include[fancy-c dupe "types.h"] @@ -965,68 +1081,112 @@ type of the result and print accordingly: @section{Correctness and testing} -We can randomly generate Dupe programs. The problem is many randomly -generated programs will have type errors in them: +We already established our definition of correctness: + +@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈} +@tt{Expr} and @racket[v] @math{∈} @tt{Value}, if @racket[(interp e)] +equals @racket[v], then @racket[(exec e)] equals +@racket[v].} + +As a starting point for testing, we can consider a revised version of +@racket[check-compiler] that also uses @racket[bits->value]: @ex[ -(eval:alts (require "random.rkt") (void)) -(random-expr) -(random-expr) -(random-expr) -(random-expr) -(random-expr) -(random-expr) -] +(define (check-compiler* e) + (check-equal? (interp e) + (exec e)))] + +This works just fine for meaningful expressions: -When interpreting programs with type errors, we get @emph{Racket} -errors, i.e. the Racket functions used in the implementation of the -interpreter will signal an error: @ex[ -(eval:error (interp (parse '(add1 #f)))) -(eval:error (interp (parse '(if (zero? #t) 7 8)))) -] +(check-compiler* (parse #t)) +(check-compiler* (parse '(add1 5))) +(check-compiler* (parse '(zero? 4))) +(check-compiler* (parse '(if (zero? 0) 1 2)))] + +But, as discussed earlier, the hypothetical form of the compiler +correctness statement is playing an important role: @emph{if} the +interpreter produces a value, the compiler must produce code that when +run produces the (representation of) that value. But if the +interpeter does not produce a value, all bets are off. + +From a testing perspective, this complicates how we use the +interpreter to test the compiler. If every expression has a meaning +according to @racket[interp] (as in all of our previous languages), it +follows that the interpreter cannot crash on any @tt{Expr} input. +That makes testing easy: think of an expression, run the interpreter +to compute its meaning and compare it to what running the compiled +code produces. But now the interpreter may break if the expression is +undefined. We can only safely run the interpreter on expressions that +have a meaning. + +This means the above definition of @racket[check-compiler] won't +suffice, because this function crashes on inputs like @racket[(add1 +#f)], even though such an example doesn't actually demonstrate a +problem with the compiler: -On the other hand, the compiler may produce bits that are illegal +@ex[ +(check-compiler* (parse '(add1 #f)))] + +To overcome this issue, we can take advantage of Racket's exception +handling mechanism to refine the @racket[check-compiler] function: + +@codeblock-include["dupe/correct.rkt"] + +This version installs an exception handler around the call to +@racket[interp] and in case the interpreter raises an exception (such +as when happens when interpreting @racket[(add1 #f)]), then the +handler simply returns the exception itself. + +The function then guards the @racket[check-equal?] test so that the +test is run @emph{only} when the result was not an exception. (It's +important that the compiler is not run within the exception handler +since we don't want the compiler crashing to circumvent the test: if +the interpreter produces a value, the compiler is not allowed to +crash!) + +We can confirm that the compiler is still correct on meaningful expressions: + +@ex[ +(check-compiler (parse #t)) +(check-compiler (parse #f)) +(check-compiler (parse '(if #t 1 2))) +(check-compiler (parse '(if #f 1 2))) +(check-compiler (parse '(if 0 1 2))) +(check-compiler (parse '(if 7 1 2))) +(check-compiler (parse '(if (zero? 7) 1 2)))] + + +For meaningless expressions, the compiler may produce bits that are illegal or, even worse, simply do something by misinterpreting the meaning of the bits: @ex[ -(eval:error (interp-compile (parse '(add1 #f)))) -(interp-compile (parse '(if (zero? #t) 7 8))) +(eval:error (exec (parse '(add1 #f)))) +(exec (parse '(if (zero? #t) 7 8))) ] -@;codeblock-include["dupe/correct.rkt"] +Yet these are not counter-examples to the compilers correctness: -This complicates testing the correctness of the compiler. Consider -our usual appraoch: @ex[ -(define (check-correctness e) - (check-equal? (interp-compile e) - (interp e))) +(check-compiler (parse '(add1 #f))) +(check-compiler (parse '(if (zero? #t) 7 8)))] -(check-correctness (parse '(add1 7))) -;;(eval:error (check-correctness (parse '(add1 #f)))) -] - -This isn't a counter-example to correctness because @racket['(add1 -#f)] is not meaningful according to the semantics. Consequently the -interpreter and compiler are free to do anything on this input. -Since we know Racket will signal an error when the interpreter tries -to interpret a meaningless expression, we can write an alternate -@racket[check-correctness] function that first runs the interpreter -with an exception handler installed. Should an error occur, -the test is ignored, otherwise the value produced is compared -to that of the compiler: +With this set up, we can randomly generate Dupe programs and throw +them at @racket[check-compiler]. Many randomly generated programs will +have type errors in them, but with our revised version +@racket[check-compiler], it won't matter (although it's worth thinking about +how well this is actually testing the compiler). @ex[ -(define (check-correctness e) - (with-handlers ([exn:fail? void]) - (let ((v (interp e))) - (check-equal? v (interp-compile e))))) - -(check-correctness (parse '(add1 7))) -(check-correctness (parse '(add1 #f))) +(eval:alts (require "random.rkt") (void)) +(random-expr) +(random-expr) +(random-expr) +(random-expr) +(random-expr) +(random-expr) +(for ([i (in-range 10)]) + (check-compiler (random-expr))) ] -Using this approach, we check the equivalence of the results only when -the interpreter runs without causing an error. diff --git a/www/notes/evildoer.scrbl b/www/notes/evildoer.scrbl index 7f771bdb..3ae9feb8 100644 --- a/www/notes/evildoer.scrbl +++ b/www/notes/evildoer.scrbl @@ -4,6 +4,7 @@ @(require redex/pict racket/runtime-path scribble/examples + evildoer/types "../fancyverb.rkt" "utils.rkt" "ev.rkt" @@ -19,7 +20,7 @@ @(ev '(require rackunit a86)) @(for-each (λ (f) (ev `(require (file ,(path->string (build-path langs "evildoer" f)))))) - '("interp.rkt" "interp-io.rkt" "compile.rkt" "ast.rkt" "parse.rkt")) + '("main.rkt" "compile-ops.rkt" "correct.rkt")) @(ev `(current-directory ,(path->string (build-path langs "evildoer")))) @(void (ev '(with-output-to-string (thunk (system "make runtime.o"))))) @@ -154,6 +155,16 @@ The s-expression parser is defined as follows: @codeblock-include["evildoer/parse.rkt"] +@ex[ +(parse 'eof) +(parse '(void)) +(parse '(read-byte)) +(parse '(peek-byte)) +(parse '(write-byte 97)) +(parse '(eof-object? eof)) +(parse '(begin (write-byte 97) + (write-byte 98)))] + @section{Reading and writing bytes in Racket} @@ -297,16 +308,6 @@ can then use to assert the expected behavior: @section{Meaning of Evildoer programs} -Formulating the semantics of Evildoer is more complicated -than the languages we've developed so far. Let's put it off -for now and instead focus on the interpreter, which remains -basically as simple as before. The reason for this disparity -is that math doesn't have side-effects. The formal semantics -will need to account for effectful computations without -itself having access to them. Racket, on the other hand, can -model effectful computations directly as effectful Racket -programs. - Here's an interpreter for Evildoer: @codeblock-include["evildoer/interp.rkt"] @@ -328,8 +329,9 @@ read and write: (interp (parse '(write-byte (read-byte)))))) ] -We can also build a useful utility for interpreting programs -with strings representing stdin and stdout: +Using @racket[with-input-from-string] and +@racket[with-output-to-string], we can also build a useful utility for +interpreting programs with strings representing stdin and stdout: @codeblock-include["evildoer/interp-io.rkt"] @@ -347,6 +349,7 @@ computation into a pure one: (cons (void) "h")) ] +@;{ OK, so now, what about the formal mathematical model of Evildoer? We have to reconsider the domain of program meanings. No longer does an expression just mean a value; @@ -366,251 +369,201 @@ facility, but instead of capturing the effects with string ports, we will define the meaning of effects directly. (Semantics omitted for now.) +} -@section{A Run-Time for Evildoer} +@section{Encoding values in Evildoer} -With new values comes the need to add new bit encodings. So -we add new encodings for @racket[eof] and @racket[void]: +With new values, namely the void and eof values, comes the need to add +new bit encodings. So we add new encodings for @racket[eof] and +@racket[void], for which we simply pick two unused bit patterns: +@binary[(value->bits eof)] and @binary[(value->bits (void))], +respectively. -@filebox-include[fancy-c evildoer "types.h"] +@codeblock-include["evildoer/types.rkt"] -The main run-time file is extended slightly to take care of -printing the new kinds of values (eof and void). Note that a -void result causes nothing to be printed: +@section[#:tag "calling-c"]{Detour: Calling external functions} -@filebox-include[fancy-c evildoer "runtime.h"] -@filebox-include[fancy-c evildoer "main.c"] +Some aspects of the Evildoer compiler will be straightforward, e.g., +adding @racket[eof], @racket[(void)], @racket[eof-object?], etc. +There's conceptually nothing new going on there. But what about +@racket[read-byte], @racket[write-byte] and @racket[peek-byte]? These +will require a new set of tricks to implement. -But the real novelty of the Evildoer run-time is that there -will be new functions that implement @racket[read-byte], -@racket[peek-byte], and @racket[write-byte]; these will be C -functions called @racket[read_byte], @racket[peek_byte] and -@racket[write_byte]: +We have a couple of options for how to approach these primitives: -@filebox-include[fancy-c evildoer "io.c"] +@itemlist[#:style 'ordered -The main novely of the @emph{compiler} will be that emits code -to make calls to these C functions. +@item{generate assembly code for issuing operating system calls to +do I/O operations, or} -@section[#:tag "calling-c"]{Calling C functions from a86} +@item{add C code for I/O primitives in the run-time and generate +assembly code for calling them.} -If you haven't already, be sure to read up on how calls work -in @secref{a86}. +] -Once you brushed up on how calls work, you'll know you can -define labels that behave like functions and call them. +The first option will require looking up details for system calls on +the particular operating system in use, generating code to make those +calls, and adding logic to check for errors. For the second option, +we can simply write C code that calls standard functions like +@tt{getc}, @tt{putc}, etc. and let the C compiler do the heavy lifting +of generating robust assembly code for calling into the operating +system. The compiler would then only need to generate code to call +those functions defined in the run-time system. This is the simpler +approach and the one we adopt. -Let's start by assuming we have a simple stand-in for the -run-time system, which is this C program that invokes an -assembly program with a label called @tt{entry} and prints -the result: +Up to this point, we've seen how C code can call code written in +assembly as though it were a C function. To go the other direction, +we need to explore how to make calls to functions written in C from +assembly. Let's look at that now. -@filebox-include[fancy-c evildoer "simple.c"] +@margin-note{If you haven't already, be sure to read up on how calls +work in @secref{a86}.} -Now, here is a little program that has a function called @tt{meaning} -that returns @tt{42}. The main entry point calls @tt{meaning}, -adds 1 to the result, and returns: -@ex[ -(define p - (prog (Global 'entry) - (Label 'entry) - (Call 'meaning) - (Add 'rax 1) - (Ret) - (Label 'meaning) - (Mov 'rax 42) - (Ret))) -] -Let's save it to a file called @tt{p.s}: +Once you brushed up on how calls work, you'll know you can +define labels that behave like functions and call them. -@ex[ - (with-output-to-file "p.s" - (λ () - (asm-display p)) - #:exists 'truncate)] -We can assemble it, link it together with the printer, and run it: +Instead of @racket[read-byte] and friends, let's first start with +something simpler. Imagine we want a function to compute the greatest +common divisor of two numbers. We could of course write such a +function in assembly, but it's convenient to be able to write it +in a higher-level language like C: + +@filebox-include[fancy-c evildoer "gcd.c"] + +We can compile this into an object file: @(define format (if (eq? (system-type 'os) 'macosx) "macho64" "elf64")) -@shellbox["gcc -c simple.c -o simple.o" - (string-append "nasm -f " format " p.s -o p.o") - "gcc simple.o p.o -o simple" - "./simple"] - -In this case, the @tt{meaning} label is defined @emph{within} the same -assembly program as @tt{entry}, although that doesn't have to be the case. -We can separate out the definition of @tt{meaning} into its own file, -so long as we declare in this one that @tt{meaning} is an external label: +@shellbox["gcc -c gcd.c -o gcd.o"] -@ex[ -(define p - (prog (Extern 'meaning) - (Global 'entry) - (Label 'entry) - (Call 'meaning) - (Add 'rax 1) - (Ret))) -(define life - (prog (Global 'meaning) - (Label 'meaning) - (Mov 'rax 42) - (Ret))) -] - -By declaring an external label, we're saying this program -makes use of that label, but doesn't define it. The -definition will come from a later phase where the program is -linked against another that provides the definition. - -There is an important invariant that has to be maintained -once these programs are moved into separate object files -though. According to the System V ABI, the stack address -must be aligned to 16-bytes before the call instruction. Not -maintaining this alignment can result in a segmentation -fault. Since the @racket[p] program is the one doing the -calling, it is the one that has to worry about the issue. - -Now keep in mind that the @racket[p] program is itself -called by the C program that prints the result. So when the -call @emph{to} @racket[p] was made, the stack was aligned. -In executing the @racket[Push] instruction, a word, which is -8-byte, was pushed. This means at the point that control -transfers to @racket['entry], the stack is not aligned to a -16-byte boundary. To fix the problem, we can push another -element to the stack, making sure to pop it off before -returning. We opt to decrement (remember the stack grows -toward low memory) and increment to make clear we're not -saving anything; this is just about alignment. The revised -@racket[p] program is: - -@ex[ -(define p - (prog (Extern 'meaning) - (Global 'entry) - (Label 'entry) - (Sub 'rsp 8) - (Call 'meaning) - (Add 'rax 1) - (Add 'rsp 8) - (Ret)))] +Now, how can we call @tt{gcd} from aseembly code? Just as there is a +convention that a return value is communicated through @racket[rax], +there are conventions governing the communication of arguments. The +conventions are known as an @bold{Application Binary Interface} or +ABI. The set of conventions we're following is called the +@bold{System V} ABI, and it used by Unix variants like Mac OS, Linux, +and BSD systems. (Windows follows a different ABI.) +The convention for arguments is that the first six integer +or pointer parameters are passed in the registers +@racket['rdi], @racket['rsi], @racket['rdx], @racket['rcx], +@racket['r8], @racket['r9]. Additional arguments and large +arguments such as @tt{struct}s are passed on the stack. -Now save each program in its nasm format: +So we will pass the two arguments of @tt{gcd} in registers +@racket[rdi] and @racket[rsi], respectively, then we use the +@racket[Call] instruction to call @tt{gcd}. Suppose we want to +compute @tt{gcd(36,60)}: @ex[ -(with-output-to-file "p.s" - (λ () - (asm-display p)) - #:exists 'truncate) -(with-output-to-file "life.s" - (λ () - (asm-display life)) - #:exists 'truncate)] - -And assemble: - -@shellbox[(string-append "nasm -f " format " p.s -o p.o") - (string-append "nasm -f " format " life.s -o life.o")] - - -Then we can link all the pieces together and run it: +(define p + (prog (Global 'entry) + (Label 'entry) + (Mov 'rdi 36) + (Mov 'rsi 60) + (Sub 'rsp 8) + (Extern 'gcd) + (Call 'gcd) + (Sal 'rax int-shift) + (Add 'rsp 8) + (Ret)))] + +A few things to notice in the above code: + +@itemlist[#:style 'ordered + +@item{The label @racket['gcd] is declared to be external with +@racket[Extern], this means the label is used but not defined in this +program. We placed this declaration immediately before the @racket[Call] +instruction, but it can appear anywhere in the program.} + +@item{The stack pointer register is decremented by @racket[8] before +the @racket[Call] instruction and then incremented by @racket[8] after +the call returns. This is to ensure the stack is aligned to 16-bytes +for call; a requirement of the System V ABI.} + +@item{For consistency with our run-time system, we return the result +encoded as an integer value, which is accomplished by shifting the +result in @racket[rax] to the left by @racket[#,int-shift].} -@shellbox["gcc simple.o p.o life.o -o simple" - "./simple"] - -Now if we look at @tt{life.s}, this is an assembly program -that defines the @tt{meaning} label. We defined it by -writing assembly code, but we could've just as easily -defined it in any other language that can compile to an -object file. So let's write it in C: +] +We could attempt to run this program with @racket[asm-interp], but it +will complain about @tt{gcd} being an undefined label: +@ex[ +(eval:error (bits->value (asm-interp p)))] -@filebox-include[fancy-c evildoer "life.c"] +The problem is that @racket[asm-interp] doesn't know anything about +the @tt{gcd.o} file, which defines the @tt{gcd} symbol, however, +there is a mechanism for linking in object files to the assembly +interprer: -We can compile it to an object file: +@ex[ +(current-objs '("gcd.o")) +(bits->value (asm-interp p))] -@shellbox["gcc -c life.c -o life.o"] +We also could create an executable using the run-time system. +To do this, first, let's save the assembly code to a file: -This object file will have a single globally visible label -called @tt{meaning}, just like our previous implementation. -@;{ -To confirm this, the standard @tt{nm} utility can be used to -list the defined symbols of an object file: +@ex[ + (with-output-to-file "p.s" + (λ () + (asm-display p)) + #:exists 'truncate)] -@shellbox["nm -j life.o"] -} -We can again link together the pieces and confirm that it -still produces the same results: +Now we can assemble it into an object file, link the objects together +to make an executable, and then run it: -@shellbox["gcc simple.o p.o life.o -o simple" - "./simple"] +@shellbox[(string-append "nasm -f " format " p.s -o p.o") + "gcc runtime.o gcd.o p.o -o p.run" + "./p.run"] +So now we've seen the essence of how to call functions from assembly +code, which opens up an implementation strategy for implementing +features: write C code as part of the run-time system and call it from +the compiled code. -At this point, we've written a little assembly program (@tt{ - p.s}) that calls a function named @tt{meaning}, that was -written in C. +@section{A Run-Time for Evildoer} -One thing that you can infer from this example is that the C -compiler generates code for @tt{meaning} that is like the -assembly code we wrote, namely it ``returns'' a value to the -caller by placing a value in @racket['rax]. +With new values comes the need to add new bit encodings. So +we add new encodings for @racket[eof] and @racket[void]: -The next natural question to ask is, how does an assembly -program provide arguments to the call of a C function? +@filebox-include[fancy-c evildoer "types.h"] -Just as there is a convention that a return value is -communicated through @racket['rax], there are conventions -governing the communication of arguments. The conventions -are known as an @bold{Application Binary Interface} or ABI. -The set of conventions we're following is called the @bold{ - System V} ABI, and it used by Unix variants like Mac OS, -Linux, and BSD systems. (Windows follows a different ABI.) +The interface for the run-time system is extended to include +file pointers for the input and output ports: -The convention for arguments is that the first six integer -or pointer parameters are passed in the registers -@racket['rdi], @racket['rsi], @racket['rdx], @racket['rcx], -@racket['r8], @racket['r9]. Additional arguments and large -arguments such as @tt{struct}s are passed on the stack. - -So now let's try calling a C function that takes a -parameter. Here we have a simple C function that doubles -it's input: +@filebox-include[fancy-c evildoer "runtime.h"] -@filebox-include[fancy-c evildoer "double.c"] +The main entry point for the run-time sets up the input and output +pointers to point to @tt{stdin} and @tt{stdout} and is updated +to handle the proper printing of a void result: -We can compile it to an object file: +@filebox-include[fancy-c evildoer "main.c"] -@shellbox["gcc -c double.c -o double.o"] +But the real novelty of the Evildoer run-time is that there +will be new functions that implement @racket[read-byte], +@racket[peek-byte], and @racket[write-byte]; these will be C +functions called @racket[read_byte], @racket[peek_byte] and +@racket[write_byte]: -Now, to call it, the assembly program should put the value -of its argument in @racket['rdi] before the call: +@filebox-include[fancy-c evildoer "io.c"] -@ex[ - (define q - (prog (Extern 'dbl) - (Global 'entry) - (Label 'entry) - (Mov 'rdi 21) - (Call 'dbl) - (Add 'rax 1) - (Ret))) -(with-output-to-file "q.s" - (λ () - (asm-display q)) - #:exists 'truncate)] +This functionality is implemented in terms of standard C library +functions @tt{getc}, @tt{ungetc}, @tt{putc} and the run-time system's +functions for encoding and decoding values such as +@tt{val_unwrap_int}, @tt{val_wrap_void}, etc. -We can assemble it into an object file: +As we'll see in the next section, the main novely of the +@emph{compiler} will be that emits code to make calls to these C +functions. -@shellbox[(string-append "nasm -f " format " q.s -o q.o")] -And linking everything together and running shows it works -as expected: -@shellbox["gcc simple.o q.o double.o -o simple" - "./simple"] +@;{ Now we have all the tools needed to interact with libraries @@ -627,29 +580,6 @@ result of calling @tt{dbl}. Now we need to save it before writing the argument. All we need to do is add a push and pop around the call: -@ex[ - (define q - (prog (Extern 'dbl) - (Global 'entry) - (Label 'entry) - (Sub 'rsp 8) - (Mov 'rdi 1) - (Push 'rdi) - (Mov 'rdi 21) - (Call 'dbl) - (Pop 'rdi) - (Add 'rax 'rdi) - (Add 'rsp 8) - (Ret))) -(with-output-to-file "q.s" - (λ () - (asm-display q)) - #:exists 'truncate)] - -@shellbox[(string-append "nasm -f " format " q.s -o q.o") - "gcc simple.o q.o double.o -o simple" - "./simple"] - The wrinkle is actually a bit deeper than this too. Suppose we are using other registers, maybe some that are not used for parameters, but nonetheless are registers that the @@ -687,6 +617,10 @@ registers. OK, now let's use these new powers to write the compiler. +} + + + @section{A Compiler for Evildoer} @@ -715,67 +649,110 @@ The primitive operation compiler: @codeblock-include["evildoer/compile-ops.rkt"] + +Notice how expressions like @racket[(read-byte)] and @racket[(write-byte)] +compile to calls into the run-time system: + +@ex[ +(compile-op0 'read-byte) +(compile-op1 'write-byte)] + + + +@section{Testing and correctness} + + + We can continue to interactively try out examples with @racket[asm-interp], although there are two issues we need to deal with. -The first is that the @racket[asm-interp] utility doesn't -know anything about the Evildoer run-time. Hence we need to -tell @racket[asm-interp] to link it in when running -an example; otherwise labels like @tt{byte_write} will be -undefined. +The first is that the @racket[asm-interp] utility doesn't know +anything about the Evildoer run-time. Hence we need to tell +@racket[asm-interp] to link it in when running an example; otherwise +labels like @tt{byte_write} will be undefined. We saw how to do this +in @secref["calling-c"] using the @racket[current-objs] parameter to +link in object files to @racket[asm-interp]. This time, the object +file we want to link in is the Evildoer run-time. + +The other is that we need to have an @racket[asm-interp/io] analog of +@racket[interp/io], i.e. we need to be able to redirect input and +output so that we can run programs in a functional way. The +@secref["a86"] library provides this functionality by providing +@racket[asm-interp/io]. The way this function works is @emph{if} +linked objects define an @tt{in} and @tt{out} symbol, it will set +these appropriately to read input from a given string and collect +output into a string. -The other is that we need to have an @racket[asm-interp/io] -counterpart that is analogous to @racket[interp/io], i.e. we -need to be able to redirect input and output so that we can -run programs in a functional way. +@ex[ +(current-objs '("runtime.o")) +(asm-interp/io (compile (parse '(write-byte (read-byte)))) "a")] -There is a parameter that @racket[asm-interp] uses called -@racket[current-objs] that can be used to add additional -object files to be linked against when running examples. +Notice though, that @racket[asm-interp/io] gives back a pair +consisting of the @emph{bits} and the output string. To match the +return type of @racket[interp/io] we need to convert the bits to a +value: + +@ex[ +(match (asm-interp/io (compile (parse '(write-byte (read-byte)))) "a") + [(cons b o) (cons (bits->value b) o)])] + +Using these pieces, we can write a function that matches the type signature +of @racket[interp/io]: -So for example, to make an example with the @tt{dbl} -function from before, we can do the following: +@codeblock-include["evildoer/exec-io.rkt"] @ex[ - (current-objs '("double.o")) - (asm-interp - (prog (Extern 'dbl) - (Global 'entry) - (Label 'entry) - (Mov 'rdi 21) - (Call 'dbl) - (Ret)))] - - -The other issue is bit uglier to deal with. We need to do -this redirection at the C-level. Our solution is write an -alternative version of @tt{byte.o} that has functions for -setting the input and out streams that are used in @tt{ - write_byte} etc. The implementation of -@racket[asm-interp/io] is expected to be linked against a -library that implements these functions and will use them to -set up temporary files and redirect input and output there. -It's a hack, but a useful one. +(exec/io (parse '(write-byte (read-byte))) "z")] -@;{ -You can see the alternative implementation of @tt{io.c} in -@link["code/evildoer/byte-shared.c"]{@tt{byte-shared.c}} if -interested. Once compiled, it can be used with -@racket[current-objs] in order to interactively run examples -involving IO: -} +Note that we still provide an @racket[exec] function that works for +programs that don't do I/O: + +@ex[ +(exec (parse '(eof-object? #f)))] + +But it will fail if executing a program that uses I/O: + +@ex[ +(eval:error (exec (parse '(write-byte 97))))] + +We can now state the correctness property we want of the compiler: + +@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈} +@tt{Expr}, @racket[i], @racket[o] @math{∈} @tt{String}, and @racket[v] +@math{∈} @tt{Value}, if @racket[(interp/io e i)] equals @racket[(cons +v o)], then @racket[(exec/io e i)] equals +@racket[(cons v o)].} + +Testing compiler correctness is updated as follows, notice it now +takes an additional parameter representing the state of the input +stream: + +@codeblock-include["evildoer/correct.rkt"] + +@ex[ +(check-compiler (parse '(void)) "") +(check-compiler (parse '(read-byte)) "a") +(check-compiler (parse '(write-byte 97)) "")] + +The @racket[random-expr] function generates random expressions and +@racket[random-good-expr] generates random expressions that are +guaranteed to be well-defined, as usual. Additionally, the +@racket[random-input] function produces a random string that can be +used as the input. + +@ex[ +(require "random.rkt") +(random-expr) +(random-good-expr) +(random-input)] + +Together, these can be used to randomly test the correctness of the +compiler: @ex[ - (current-objs '("runtime.o")) - (asm-interp/io - (prog (Extern 'read_byte) - (Extern 'write_byte) - (Global 'entry) - (Label 'entry) - (Call 'read_byte) - (Mov 'rdi 'rax) - (Call 'write_byte) - (Mov 'rax 42) - (Ret)) - "a")] +(for ((i 100)) + (check-compiler (random-expr) (random-input))) +(for ((i 100)) + (check-compiler (random-good-expr) (random-input)))] + \ No newline at end of file diff --git a/www/notes/extort.scrbl b/www/notes/extort.scrbl index 9d83100f..0e4d0c03 100644 --- a/www/notes/extort.scrbl +++ b/www/notes/extort.scrbl @@ -7,6 +7,7 @@ "../fancyverb.rkt" "utils.rkt" "ev.rkt" + extort/types extort/semantics "../utils.rkt") @@ -16,10 +17,14 @@ @(ev '(require rackunit a86)) @(for-each (λ (f) (ev `(require (file ,(path->string (build-path langs "extort" f)))))) - '("interp.rkt" "ast.rkt" "parse.rkt" "compile.rkt" "types.rkt")) + '("main.rkt" "correct.rkt" "compile-ops.rkt")) @(ev `(current-directory ,(path->string (build-path langs "extort")))) @(void (ev '(with-output-to-string (thunk (system "make runtime.o"))))) +@;{Hack to get un-provided functions from compile-ops} +@(ev '(require (only-in rackunit require/expose))) +@(ev '(require/expose extort/compile-ops [assert-integer assert-char assert-byte assert-codepoint])) + @(define this-lang "Extort") @@ -67,7 +72,11 @@ does and signal an error. The meaning of @this-lang programs that have type errors will now be -defined as @racket['err]: +defined as @racket['err]. (You shouldn't make too much out of how we +choose to represent the ``this program caused an error'' answer; all +that really matters at this point is that it is disjoint from values. +Since the langauge doesn't yet include symbols, using a symbol is a +fine choice.): @itemlist[ @@ -79,6 +88,7 @@ defined as @racket['err]: ] +@;{ @(define ((rewrite s) lws) (define lhs (list-ref lws 2)) (define rhs (list-ref lws 3)) @@ -110,35 +120,247 @@ And there are four rules for propagating errors from subexpressions: Now what does the semantics say about @racket[(add1 #f)]? What about @racket[(if 7 #t -2)]? +} -The signature of the interpreter is extended to produce answers. Each -use of a Racket primitive is guarded by checking the type of the -arguments and an error is produced if the check fails. Errors are -also propagated when a subexpression produces an error: +In order to define the semantics, we first introduce the type of +results that may be given by the interpretation function: + +@#reader scribble/comment-reader +(ex +;; type Answer = Value | 'err +) + +Type mismatches can arise as the result of primitive operations being +applied to arguments for which the primitive is undefined, so we +revise @racket[interp-prim1] to check all necessary preconditions +before carrying out an operation, and producing an error in case +those conditions are not met: -@codeblock-include["extort/interp.rkt"] @codeblock-include["extort/interp-prim.rkt"] +Within the interpreter, we update the type signature to reflect the +fact that interpreting an expression produces an answer, no longer +just an expression. We must also take care to observe that evaluating +a subexpression may produce an error and as such it should prevent +further evaluation. To do this, the interpreter is written to check +for an error result of any subexpression it evaluates before +proceeding to evaluate another subexpression: + +@codeblock-include["extort/interp.rkt"] + We can confirm the interpreter computes the right result for the examples given earlier: @ex[ -(interp (Prim1 'add1 (Lit #f))) -(interp (Prim1 'zero? (Lit #t))) -(interp (If (Prim1 'zero? (Lit #f)) (Lit 1) (Lit 2))) +(interp (parse '(add1 #f))) +(interp (parse '(zero? #t))) +(interp (parse '(if (zero? #f) 1 2))) ] -The statement of correctness stays the same, but now observe that -there is no way to crash the interpreter with any @tt{Expr} value. +This interpreter implicitly relies on the state of the input and +output port, but we can define a pure interpreter like before, +which we take as the specification of our language: + +@codeblock-include["extort/interp-io.rkt"] + +An important property of this semantics is that it provides a meaning +for all possible expressions; in other words, @racket[interp/io] is a +total function, and our language has no undefined behaviors. + +@bold{Total Semantics}: @emph{For all @racket[e] @math{∈} @tt{Expr}, +there exists @racket[a] @math{∈} @tt{Answer}, @racket[o] @math{∈} +@tt{String}, such that @racket[(interp/io e i)] equals @racket[(cons +a o)].} + +The statement of correctness, which is revised to use the answer type +in place of values, reads: + +@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈} +@tt{Expr}, @racket[i], @racket[o] @math{∈} @tt{String}, and @racket[a] +@math{∈} @tt{Answer}, if @racket[(interp/io e i)] equals @racket[(cons +a o)], then @racket[(exec/io e i)] equals +@racket[(cons a o)].} + +By virtue of the semantics being total, we have a complete +specification for the compiler. There are no longer programs for +which it is free to do arbitrary things; it is always obligated to +produce the same answer as the interpreter. -@section{A Compiler for @this-lang} +@section{Checking and signalling errors at run-time} Suppose we want to compile @racket[(add1 #f)], what needs to happen? Just as in the interpreter, we need to check the integerness of the argument's value before doing the addition operation. +This checking needs to be emitted as part of the compilation of the +@racket[add1] primitive. It no longer suffices to simply do an +@racket[Add] instruction on whatever is in the @racket[rax] register; +the compiled code should first check that the value in @racket[rax] +encodes an integer, and if it doesn't, it should somehow stop the +computation and signal that an error has occurred. + +The checking part is fairly easy. Our encoding of values, first +discussed in @secref["dupe"], devotes some number of bits within a +value to indicate the type. Checking whether something is an integer +involves inspecting just those parts of the value. + +Suppose we have an arbitrary value in @racket[rax]. If it's an +integer, the least significant bit has to be @racket[0]. We could +mask out just that bit by @racket[And]ing the value with the bit +@racket[#,mask-int] and compare the result to type tag for integers +(i.e. @racket[#,type-int]). Let's write a little function to play +around with this idea: + +@#reader scribble/comment-reader +(ex +;; Produces 0 if v is an integer value +(define (is-int? v) ;; Value -> 0 | 1 + (asm-interp + (prog (Global 'entry) + (Label 'entry) + (Mov 'rax (value->bits v)) + (And 'rax mask-int) + (Ret)))) + +(is-int? 0) +(is-int? 1) +(is-int? 2) +(is-int? #t) +(is-int? #f)) + +Unfortunately, the use of @racket[And] here destroys the value in +@racket[rax] in order to determine if it is an integer. That's fine +for this predicate, but if we wanted to compute something with the +integer, we'd be toast as soon as checked it's type. + +Suppose we wanted to write a function that did @racket[add1] in case +the argument is an integer value, otherwise it produces false. For +that, we would need to use a temporary register for the type tag +check: + +@#reader scribble/comment-reader +(ex +;; Produces (add1 v) if v is an integer value, #f otherwise +(define (plus1 v) ;; Value -> Integer | Boolean + (bits->value + (asm-interp + (prog (Global 'entry) + (Label 'entry) + (Mov 'rax (value->bits v)) + (Mov 'r9 'rax) + (And 'r9 mask-int) + (Cmp 'r9 type-int) + (Jne 'err) + (Add 'rax (value->bits 1)) + (Ret) + (Label 'err) + (Mov 'rax (value->bits #f)) + (Ret))))) + +(plus1 0) +(plus1 1) +(plus1 2) +(plus1 #t) +(plus1 #f)) + +This is pretty close to how we can implement primitives like +@racket[add1]. The only missing piece is that instead of returning a +specific @emph{value}, like @racket[#f], we want to stop computation +and signal that an error has occurred. To accomplish this we add a +function called @tt{raise_error} to our run-time system that, when +called, prints @tt{err} and exits with a non-zero number to signal an +error. The @racket[asm-interp] intercepts these calls are returns the +@racket['err] symbol to match what the interpreter does: + +@ex[ +(current-objs '("runtime.o")) +(asm-interp + (prog (Global 'entry) + (Label 'entry) + (Extern 'raise_error) + (Call 'raise_error)))] + +Now we can make a function that either does the addition or signals an +error: + +@#reader scribble/comment-reader +(ex +;; Produces (add1 v) if v is an integer, 'err otherwise +(define (plus1 v) ;; Value -> Integer | 'err + (match + (asm-interp + (prog (Global 'entry) + (Label 'entry) + (Mov 'rax (value->bits v)) + (Mov 'r9 'rax) + (And 'r9 mask-int) + (Cmp 'r9 type-int) + (Jne 'err) + (Add 'rax (value->bits 1)) + (Ret) + (Label 'err) + (Extern 'raise_error) + (Call 'raise_error))) + ['err 'err] + [b (bits->value b)])) + +(plus1 0) +(plus1 1) +(plus1 2) +(plus1 #t) +(plus1 #f)) + +This can form the basis of how primitives with error checking can be +implemented. Looking at @racket[interp-prim1], we can see that in +addition to checking for whether an argument is an integer, we will +also need checks for characters, bytes, and unicode codepoints; the +latter two are refinements of the integer check: they require there +arguments to be integers in a certain range. + +To support these checks, we develop a small library of type assertion +functions that, given a register name, produce code to check the type +of value held in the register, jumping to @racket[err] whenever the +value is not of the asserted type: + +@itemlist[ +@item{@racket[assert-integer] @tt{: Register -> Asm} produces code to check that the value in the given register is an integer,} +@item{@racket[assert-char] @tt{: Register -> Asm} produces code to check that the value in the given register is a character,} +@item{@racket[assert-byte] @tt{: Register -> Asm} produces code to check that the value in the given register is an integer and in the range [0,256).} +@item{@racket[assert-codepoint] @tt{: Register -> Asm} produces code to check that the value in the given register is an integer and either in the range [0,55295] or [57344, 1114111].} +] + +@codeblock-include["extort/assert.rkt"] + +The compiler for primitive operations is updated to include +appropriate type assertions: + +@codeblock-include["extort/compile-ops.rkt"] + +@ex[ +(compile-op1 'add1) +(compile-op1 'char->integer) +(compile-op1 'write-byte)] + + +The top-level compiler largely stays the same, but it now declares an +external label @racket['raise_error] that will be defined by the +run-time system and defines a label called @racket['err] that calls +@tt{raise_error}: + +@codeblock-include["extort/compile.rkt"] + +@ex[ +(compile-e (parse '(add1 #f))) +(compile-e (parse '(char->integer #\a))) +(compile-e (parse '(write-byte 97)))] + + + +@section{Run-time for @this-lang} + + We extend the run-time system with a C function called @tt{raise_error} that prints "err" and exits with a non-zero status to indicate something has gone wrong. @margin-note{The runtime system is @@ -149,60 +371,37 @@ but it can be ignored.} @filebox-include[fancy-c extort "main.c"] -Most of the work of error checking happens in the code emitted for -primitive operations. Whenever an error is detected, control jumps to -a label called @racket['err] that immediately calls @tt{raise_error}: -@codeblock-include["extort/compile-ops.rkt"] +Linking in the run-time allows us to define the @racket[exec] and +@racket[exec/io] functions: -All that's left for the top-level compile function to declare an -external label @racket['raise_error] that will be defined by the -run-time system and to emit a label called @racket['err] that calls -@tt{raise_error}, otherwise this part of the compiler doesn't change: +@codeblock-include["extort/exec.rkt"] +@codeblock-include["extort/exec-io.rkt"] -@codeblock-include["extort/compile.rkt"] +We can run examples: -Here's the code we generate for @racket['(add1 #f)]: @ex[ -(define (show e) - (displayln (asm-string (compile-e (parse e))))) +(exec (parse '(add1 8))) +(exec (parse '(add1 #f)))] + -(show '(add1 #f)) -] -@(void (ev '(current-objs '("runtime.o")))) +@section{Correctness, revisited} + +This allows to re-formulate the check for compiler correctness check +in its earlier, simpler form that just blindly runs the interpreter +and compiler and checks that the results are the same, thanks to the +totality of the semantics: + +@codeblock-include["extort/correct.rkt"] -Here are some examples running the compiler: @ex[ -(define (tell e) - (match (asm-interp (compile (parse e))) - ['err 'err] - [b (bits->value b)])) -(tell #t) -(tell #f) -(tell '(zero? 0)) -(tell '(zero? -7)) -(tell '(if #t 1 2)) -(tell '(if #f 1 2)) -(tell '(if (zero? 0) (if (zero? 0) 8 9) 2)) -(tell '(if (zero? (if (zero? 2) 1 0)) 4 5)) -(tell '(add1 #t)) -(tell '(sub1 (add1 #f))) -(tell '(if (zero? #t) 1 2)) -] +(check-compiler (parse '(add1 8)) "") +(check-compiler (parse '(add1 #f)) "")] -Since the interpreter and compiler have well defined specifications -for what should happen when type errors occur, we can test in the -usual way again: +And again, we can randomly test the compiler by generating programs and inputs: @ex[ -(define (check-correctness e) - (check-equal? (match (asm-interp (compile e)) - ['err 'err] - [b (bits->value b)]) - (interp e) - e)) - -(check-correctness (Prim1 'add1 (Lit 7))) -(check-correctness (Prim1 'add1 (Lit #f))) -] +(require "random.rkt") +(for ((i 100)) + (check-compiler (random-expr) (random-input)))] diff --git a/www/notes/fraud.scrbl b/www/notes/fraud.scrbl index 12cbc880..2ee0f6f0 100644 --- a/www/notes/fraud.scrbl +++ b/www/notes/fraud.scrbl @@ -16,7 +16,7 @@ @(ev `(current-directory ,(path->string (build-path langs "fraud")))) @(void (ev '(with-output-to-string (thunk (system "make runtime.o"))))) @(for-each (λ (f) (ev `(require (file ,f)))) - '("interp.rkt" "compile.rkt" "ast.rkt" "parse.rkt" "types.rkt" "translate.rkt")) + '("main.rkt" "translate.rkt")) @(define this-lang "Fraud") @@ -133,6 +133,83 @@ We can model it as a datatype as usual: } + +@section{Syntax matters} + +With the introduction of variables comes the issue of expressions that +have @bold{free} and @bold{bound variables}. A bound variable is a +variable that occurs within the scope of some binder for that +variable. A free variable is one that occurs within an expression +that does not bind that variable. So for example, in @racket[(let ((x +5)) (add1 x))], the occurrence of @racket[x] is bound because it +occurs within the body of a @racket[let] expression that binds +@racket[x]. Although, if the expression were just @racket[(add1 x)], +then the occurrence of @racket[x] is free (or unbound). Note that an +expression might include both free and bound occurrences of a +variable, e.g. in @racket[(+ (let ((x 5)) x) x)] there are two +occurrences of @racket[x]: one is bound and one is free. + +An important set of expressions are those that contain no free +variables. + +@#reader scribble/comment-reader +(racketblock +;; type ClosedExpr = { e ∈ Expr | e contains no free variables } +) + +This set is important because only closed expressions should be +interpreted (or compiled). We will consider free variables to be a +syntax error. To this end, we provide two versions of the parser. +The @racket[parse-closed] parser raises an error when it encounters an +unbound variable and therefore guarantees to always produce an element +of @tt{ClosedExpr}. The @racket[parse] parser on the other hand +produces elements of @tt{Expr} and parse unbound variables as +variables: + +@ex[ +(parse 'x) +(parse '(add1 x)) +(parse '(let ((x 5)) (add1 x))) +(eval:error (parse-closed 'x)) +(eval:error (parse-closed '(add1 x))) +(parse-closed '(let ((x 5)) (add1 x)))] + +Another issue that comes up now is the potential to bind variables +that conflict with other keywords used in the language such as +@racket[add1], @racket[sub1], @racket[if], etc. Racket, following +Scheme, adopts a flexible approach that allows variables bindings to +shadow @emph{any} keyword in the language. We can do the same. + +This means that parsing an expression depends on the binding structure +parsed so far. For example @racket[add1] might be a variable +occurrence if it appears in a context that binds the variable +@racket[add1]. The parser has been updated to take this in to account. +Consider the following examples: + +@ex[ +(parse '(let ((add1 1)) (sub1 add1))) +(eval:error (parse '(let ((add1 1)) (add1 add1)))) +(parse '(let ((let 1)) let)) +(parse 'let)] + + +The heart of this revised parsing strategy is the function +@racket[parse/acc] with takes an s-expression to be parsed into an +expression as well as a list of bound and free variables. It computes +both the parsed expression and a list of variables that occur free in +it. + +When encountering a keyword like @racket[if], @racket[let], etc., we +check that it is not in the set of bound variables before parsing the +input as that particular form of expression. When we encounter a +variable occurrence, if it is not in the set of bound variables, we +add it to the set of free variables in the result. When binding a +variable, we add it to the set of bound variables before parsing the +relevant part of the input where the variable is bound: + +@codeblock-include["fraud/parse.rkt"] + + @section{Meaning of @this-lang programs} The meaning of @this-lang programs depends on the form of the expression and @@ -226,77 +303,22 @@ expressions, we will need to keep track of some number of pairs of variables and their meaning. We will refer to this contextual information as an @bold{environment}. -@margin-note{To keep things simple, we omit the treatment of - IO in the semantics, but it's easy enough to incorporate - back in if desired following the template of @secref{ - Evildoer}.} - The meaning of a variable is resolved by looking up its meaning in the environment. The meaning of a @racket[let] will depend on the meaning of its body with an extended environment that associates its variable binding to the value of the right hand side. -The heart of the semantics is an auxiliary relation, @render-term[F -𝑭-𝒆𝒏𝒗], which relates an expression and an environement to the integer -the expression evaluates to (in the given environment): - -@(define ((rewrite s) lws) - (define lhs (list-ref lws 2)) - (define rhs (list-ref lws 3)) - (list "" lhs (string-append " " (symbol->string s) " ") rhs "")) - -@(require (only-in racket add-between)) -@(define-syntax-rule (show-judgment name cases) - (with-unquote-rewriter - (lambda (lw) - (build-lw (lw-e lw) (lw-line lw) (lw-line-span lw) (lw-column lw) (lw-column-span lw))) - (with-compound-rewriters (['+ (rewrite '+)] - ['- (rewrite '–)] - ['< (rewrite '<)] - ['= (rewrite '=)]) - (apply centered - (add-between - (map (λ (c) (parameterize ([judgment-form-cases (list c)] - [judgment-form-show-rule-names #f]) - (render-judgment-form name))) - cases) - (hspace 4)))))) - -The rules for dealing with the new forms (variables and lets) are: -@(show-judgment 𝑭-𝒆𝒏𝒗 '("var" "let")) - -These rely on two functions: one for extending an environment with a -variable binding and one for lookup up a variable binding in an -environment: - -@centered{ -@render-metafunction[sem:ext #:contract? #t] - -@(with-atomic-rewriter - 'undefined - "⊥" - (render-metafunction sem:lookup #:contract? #t))} - -The remaining rules are just an adaptation of the existing rules from -Extort to thread the environment through. For example, here are just -a couple: -@(show-judgment 𝑭-𝒆𝒏𝒗 '("prim")) -@(show-judgment 𝑭-𝒆𝒏𝒗 '("if-true" "if-false")) -And rules for propagating errors through let: -@(show-judgment 𝑭-𝒆𝒏𝒗 '("let-err")) - - - -The operational semantics for @this-lang is then defined as a binary relation -@render-term[F 𝑭], which says that @math{(e,i)} in @render-term[F 𝑭], -only when @math{e} evaluates to @math{i} in the empty environment -according to @render-term[F 𝑭-𝒆𝒏𝒗]: - -@(show-judgment 𝑭 '("mt-env")) +The heart of the semantics is a function @racket[interp-env] the +provides the meaning of an expression under a given environment. The +top-level @racket[interp] function simply calls @racket[interp-env] +with an empty enivornment. +These rely on two functions: one for extending an environment with a +variable binding and one for lookup up a variable binding in an +environment. With the semantics of @racket[let] and variables out of the @@ -307,37 +329,10 @@ of @racket[_e0] and @racket[_e1], when they mean integers, otherwise the meaning is an error. -The handling of primitives occurs in the following rule: - -@(show-judgment 𝑮-𝒆𝒏𝒗 '("prim")) - -It makes use of an auxiliary judgment for interpreting primitives: - -@centered[ - - (with-compound-rewriters (['+ (rewrite '+)] - ['- (rewrite '–)] - ['< (rewrite '<)] - ['= (rewrite '=)] - ['= (rewrite '=)] - ['!= (rewrite '≠)]) - (render-metafunction 𝑭-𝒑𝒓𝒊𝒎 #:contract? #t)) - - #;(with-unquote-rewriter - (lambda (lw) - (build-lw (lw-e lw) (lw-line lw) (lw-line-span lw) (lw-column lw) (lw-column-span lw))) - (render-metafunction 𝑮-𝒑𝒓𝒊𝒎 #:contract? #t))] - - - -The interpreter closely mirrors the semantics. The top-level -@racket[interp] function relies on a helper function -@racket[interp-env] that takes an expression and environment and -computes the result. It is defined by structural recursion on the -expression. Environments are represented as lists of associations -between variables and values. There are two helper functions for -@racket[ext] and @racket[lookup]: +It is defined by structural recursion on the expression. Environments +are represented as lists of associations between variables and values. +There are two helper functions for @racket[ext] and @racket[lookup]: @codeblock-include["fraud/interp.rkt"] @@ -352,22 +347,12 @@ examples given earlier: (interp (parse '(let ((x 7)) (let ((y 2)) x)))) (interp (parse '(let ((x 7)) (let ((x 2)) x)))) (interp (parse '(let ((x 7)) (let ((x (add1 x))) x)))) -] - -We can see that it works as expected: - -@ex[ (interp (parse '(+ 3 4))) (interp (parse '(+ 3 (+ 2 2)))) (interp (parse '(+ #f 8))) ] -@bold{Interpreter Correctness}: @emph{For all @this-lang expressions -@racket[e] and values @racket[v], if (@racket[e],@racket[v]) in -@render-term[F 𝑭], then @racket[(interp e)] equals -@racket[v].} - @section{Lexical Addressing} Just as we did with @seclink["Dupe"], the best way of understanding @@ -443,10 +428,8 @@ variables, but just lexical addresses: @#reader scribble/comment-reader (racketblock ;; type IExpr = +;; | (Lit Datum) ;; | (Eof) -;; | (Int Integer) -;; | (Bool Boolean) -;; | (Char Character) ;; | (Prim0 Op0) ;; | (Prim1 Op1 IExpr) ;; | (Prim2 Op2 IExpr IExpr) @@ -840,10 +823,7 @@ stack-alignment issues, but is otherwise the same as before: We can now take a look at the main compiler for expressions. Notice the compile-time environment which is weaved through out the @racket[compile-e] function and its subsidiaries, which is critical in -@racket[compile-variable] and extended in @racket[compile-let]. It is -passed to the @racket[compile-op0], @racket[compile-op1] and -@racket[compile-op2] functions for the purposes of stack alignment -before calls into the runtime system. +@racket[compile-variable] and extended in @racket[compile-let]. @filebox-include[codeblock fraud "compile.rkt"] @@ -872,18 +852,13 @@ Let's take a look at some examples of @racket[let]s and variables: And running the examples: @ex[ -(current-objs '("runtime.o")) -(define (tell e) - (match (asm-interp (compile (parse e))) - ['err 'err] - [b (bits->value b)])) -(tell '(let ((x 7)) x)) -(tell '(let ((x 7)) 2)) -(tell '(let ((x 7)) (add1 x))) -(tell '(let ((x (add1 7))) x)) -(tell '(let ((x 7)) (let ((y 2)) x))) -(tell '(let ((x 7)) (let ((x 2)) x))) -(tell '(let ((x 7)) (let ((x (add1 x))) x))) +(exec (parse '(let ((x 7)) x))) +(exec (parse '(let ((x 7)) 2))) +(exec (parse '(let ((x 7)) (add1 x)))) +(exec (parse '(let ((x (add1 7))) x))) +(exec (parse '(let ((x 7)) (let ((y 2)) x)))) +(exec (parse '(let ((x 7)) (let ((x 2)) x)))) +(exec (parse '(let ((x 7)) (let ((x (add1 x))) x)))) ] Here are some examples of binary operations: @@ -896,9 +871,9 @@ Here are some examples of binary operations: And running the examples: @ex[ -(tell '(+ 1 2)) -(tell '(+ (+ 3 4) (+ 1 2))) -(tell '(let ((y 3)) (let ((x 2)) (+ x y)))) +(exec (parse '(+ 1 2))) +(exec (parse '(+ (+ 3 4) (+ 1 2)))) +(exec (parse '(let ((y 3)) (let ((x 2)) (+ x y))))) ] Finally, we can see the stack alignment issues in action: @@ -909,3 +884,21 @@ Finally, we can see the stack alignment issues in action: (show '(add1 #f) '()) (show '(add1 #f) '(x)) ] + +@section{Correctness} + +For the statement of compiler correctness, we must now restrict the +domain of expressions to be just @bold{closed expressions}, i.e. those +that have no unbound variables. + +@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈} +@tt{ClosedExpr}, @racket[i], @racket[o] @math{∈} @tt{String}, and @racket[v] +@math{∈} @tt{Value}, if @racket[(interp/io e i)] equals @racket[(cons +v o)], then @racket[(exec/io e i)] equals +@racket[(cons v o)].} + +The check for correctness is the same as before, although the check should only be applied +to elements of @tt{ClosedExpr}: + +@filebox-include[codeblock fraud "correct.rkt"] + diff --git a/www/notes/hustle.scrbl b/www/notes/hustle.scrbl index 6630826b..0ad22987 100644 --- a/www/notes/hustle.scrbl +++ b/www/notes/hustle.scrbl @@ -17,7 +17,7 @@ @(ev `(current-directory ,(path->string (build-path langs "hustle")))) @(void (ev '(with-output-to-string (thunk (system "make runtime.o"))))) @(for-each (λ (f) (ev `(require (file ,f)))) - '("interp.rkt" "compile.rkt" "compile-ops.rkt" "ast.rkt" "parse.rkt" "types.rkt")) + '("main.rkt" "heap.rkt" "unload.rkt" "interp-prims-heap.rkt")) @(define this-lang "Hustle") @@ -100,66 +100,57 @@ just need another distinguished value to designate it. Using @racket[cons] and @racket['()] in a structured way we can form @emph{proper list}, among other useful data structures. -We use the following grammar for @|this-lang|: - -@centered[(render-language H)] - -We can model this as an AST data type: +We use the following AST data type for @|this-lang|: @filebox-include-fake[codeblock "hustle/ast.rkt"]{ #lang racket -;; type Expr = ... -;; | (Empty) -;; type Op1 = ... -;; | 'box | 'car | 'cdr | 'unbox | 'box? | 'cons? -;; type Op2 = ... -;; | 'cons +;; type Expr = ... | (Lit Datum) +;; type Datum = ... | '() +;; type Op1 = ... | 'box | 'car | 'cdr | 'unbox | 'box? | 'cons? +;; type Op2 = ... | 'cons } @section{Meaning of @this-lang programs, implicitly} -The meaning of @this-lang programs is just a slight update to the -prior language, namely we add a few new primitives. - -The update to the semantics is just an extension of the semantics of -primitives: - -@(judgment-form-cases #f) +The interpreter has an update to the @racket[interp-prim] +module: -@;centered[(render-judgment-form 𝑯-𝒆𝒏𝒗)] +@codeblock-include["hustle/interp-prim.rkt"] -@(define ((rewrite s) lws) - (define lhs (list-ref lws 2)) - (define rhs (list-ref lws 3)) - (list "" lhs (string-append " " (symbol->string s) " ") rhs "")) +The interpreter doesn't really shed light on how constructing +inductive data works because it simply uses the mechanism of the +defining language to construct it. Inductively defined data is easy +to model in this interpreter because we can rely on the mechanisms +provided for constructing inductively defined data at the meta-level +of Racket. -@centered[ - (with-compound-rewriters (['+ (rewrite '+)] - ['- (rewrite '–)] - ['= (rewrite '=)] - ['!= (rewrite '≠)]) - (render-metafunction 𝑯-𝒑𝒓𝒊𝒎 #:contract? #t)) -] - -The interpreter similarly has an update to the @racket[interp-prim] -module: +The real trickiness comes when we want to model such data in an +impoverished setting that doesn't have such things, which of course is +the case in assembly. -@codeblock-include["hustle/interp-prim.rkt"] +The problem is that a value such as @racket[(box _v)] has a value +inside it. Pairs are even worse: @racket[(cons _v0 _v1)] has +@emph{two} values inside it. If each value is represented with 64 +bits, it would seem a pair takes @emph{at a minimum} 128-bits to +represent (plus we need some bits to indicate this value is a pair). +What's worse, those @racket[_v0] and @racket[_v1] may themselves be +pairs or boxes. The great power of inductive data is that an +arbitrarily large piece of data can be constructed. But it would seem +impossible to represent each piece of data with a fixed set of bits. -Inductively defined data is easy to model in the semantics and -interpreter because we can rely on inductively defined data at the -meta-level in math or Racket, respectively. +The solution is to @bold{allocate} such data in memory, which can in +principle be arbitrarily large, and use a @bold{pointer} to refer to +the place in memory that contains the data. -In some sense, the semantics and interpreter don't shed light on -how constructing inductive data works because they simply use -the mechanism of the defining language to construct inductive data. -Let's try to address that. +Before tackling the compiler, let's look at an alternative version of +the interpreter that makes explicit a representation of memory and is +able to interpret programs that construct and manipulate inductive +data without itself relying on those mechanisms. @section{Meaning of @this-lang programs, explicitly} -Let's develop an alternative semantics and interpreter that -describes constructing inductive data without itself -constructing inductive data. +Let's develop an alternative interpreter that describes constructing +inductive data without itself constructing inductive data. The key here is to describe explicitly the mechanisms of memory allocation and dereference. Abstractly, memory can be @@ -168,120 +159,188 @@ values stored in those addresses. As programs run, there is a current state of the memory, which can be used to look up values (i.e. dereference memory) or to extend by making a new association between an available address and a value -(i.e. allocating memory). Memory will be assumed to be -limited to some finite association, but we'll always assume -programs are given a sufficiently large memory to run to -completion. +(i.e. allocating memory). -In the semantics, we can model memory as a finite function -from addresses to values. The datatype of addresses is left -abstract. All that matters is we can compare them for -equality. +The representation of values changes to represent inductive data +through pointers to memory: -We now change our definition of values to make it -non-recursive: +@#reader scribble/comment-reader +(racketblock +;; type Value* = +;; | Integer +;; | Boolean +;; | Character +;; | Eof +;; | Void +;; | '() +;; | (box-ptr Address) +;; | (cons-ptr Address) +(struct box-ptr (i)) +(struct cons-ptr (i)) -@centered{@render-language[Hm]} +;; type Address = Natural +) -We define an alternative semantic relation equivalent to 𝑯 called -𝑯′: +Here we have two kinds of pointer values, @emph{box pointers} and +@emph{cons pointers}. A box value is represented by an address (some +natural number) and a tag, the @racket[box-ptr] constructor, which +indicates that the address should be interpreted as the contents of a +box. A cons is represented by an address tagged with +@racket[cons-ptr], indicating that the memory contains a pair of +values. + +To model memory, we use a list of @tt{Value*} values. When memory is +allocated, new elements are placed at the front of the list. To model +memory locations, use the distance from the element to the end of the +list, this way addresses don't change as memory is allocated. + +For example, suppose we have allocated memory to hold four values +@racket['(97 98 99 100)]. The address of 100 is 0; the address of 99 +is 1; etc. When a new value is allocated, say, @racket['(96 97 98 +99)], the address of 99 is still 0, and so on. The newly allocated +value 96 is at address 4. In this way, memory grows toward higher +addresses and the next address to allocate is given by the size of the +heap currently in use. -@centered[(render-judgment-form 𝑯′)] +@#reader scribble/comment-reader +(racketblock +;; type Heap = (Listof Value*) +) -Like 𝑯, it is defined in terms of another relation. Instead -of 𝑯-𝒆𝒏𝒗, we define a similar relation 𝑯-𝒎𝒆𝒎-𝒆𝒏𝒗 that has an -added memory component both as input and out: +When a program is intepreted, it results in a @tt{Value*} paired +together with a @tt{Heap} that gives meaning to the addresses in the +value, or an error: -@centered[(render-judgment-form 𝑯-𝒎𝒆𝒎-𝒆𝒏𝒗)] +@#reader scribble/comment-reader +(racketblock +;; type Answer* = (cons Heap Value*) | 'err +) -For most of the relation, the given memory σ is simply -threaded through the judgment. When interpreting a primitive -operation, we also thread the memory through a relation -analagous to 𝑯-𝒑𝒓𝒊𝒎 called 𝑯-𝒎𝒆𝒎-𝒑𝒓𝒊𝒎. The key difference -for 𝑯-𝒎𝒆𝒎-𝒑𝒓𝒊𝒎 is that @racket[cons] and @racket[box] -operations allocate memory by extending the given memory σ -and the @racket[car], @racket[cdr], and @racket[unbox] -operations dereference memory by looking up an association -in the given memory σ: +So for example, to represent a box @racket[(box 99)] we could have +a box value, i.e. a tagged pointer that points to memory containing 99: -@centered[(render-metafunction 𝑯-𝒎𝒆𝒎-𝒑𝒓𝒊𝒎 #:contract? #t)] +@#reader scribble/comment-reader +(racketblock +(cons (list 99) (box-ptr 0))) -There are only two unexplained bits at this point: +The value at list index 0 is 99 and the box value points to that +element of the heap and indicates it is the contents of a box. -@itemlist[ - @item{the metafunction -@render-term[Hm (alloc σ (v ...))] which consumes a memory -and a list of values. It produces a memory and an address -@render-term[Hm (σ_′ α)] such that @render-term[Hm σ_′] is -like @render-term[Hm σ] except it has a new association for -some @render-term[Hm α] and @render-term[Hm α] is @bold{ - fresh}, i.e. it does not appear in the domain of -@render-term[Hm σ].} - - @item{the metafunction @render-term[Hm (unload σ a)] used - in the conclusion of @render-term[Hm 𝑯′]. This function does - a final unloading of the answer and memory to obtain a answer - in the style of 𝑯.}] - - -The definition of @render-term[Hm (alloc σ (v ...))] is -omitted, since it depends on the particular representation -chosen for @render-term[Hm α], but however you choose to -represent addresses, it will be easy to define appropriately. - -The definition of @render-term[Hm (unload σ a)] just traces -through the memory to reconstruct an inductive piece of data: - -@centered[(render-metafunction unload #:contract? #t)] - - -With the semantics of explicit memory allocation and -dereference in place, we can write an interepreter to match -it closely. - -We could define something @emph{very} similar to the -semantics by threading through some representation of a -finite function serving as the memory, just like the -semantics. Or we could do something that will produce the -same result but using a more concrete mechanism that is like -the actual memory on a computer. Let's consider the latter -approach. - -We can use a Racket @racket[list] to model the memory. - -@;{ -We will use a @racket[vector] of some size to model the -memory used in a program's evaluation. We can think of -@racket[vector] as giving us a continguous array of memory -that we can read and write to using natural number indices -as addresses. The interpreter keeps track of both the -@racket[vector] and an index for the next available memory -address. Every time the interpreter allocates, it writes in -to the appropriate cell in the @racket[vector] and bumps the -current address by 1.} +It's possible that other memory was used in computing this result, so +we might end up with an answer like: -@codeblock-include["hustle/interp-heap.rkt"] +@#reader scribble/comment-reader +(racketblock +(cons (list 97 98 99) (box-ptr 0))) +Or: +@#reader scribble/comment-reader +(racketblock +(cons (list 97 98 99 100 101) (box-ptr 2))) -The real trickiness comes when we want to model such data in an -impoverished setting that doesn't have such things, which of course is -the case in assembly. +Both of which really mean the same value: @racket[(box 99)]. + +A pair contains two values, so a @racket[cons-ptr] should point to the +start of elements that comprise the pair. For example, this answer +represents a pair @racket[(cons 100 99)]: + +@#reader scribble/comment-reader +(racketblock +(cons (list 99 100) (cons-ptr 0))) + +Note that the @racket[car] of this pair is at address 0 and the +@racket[cdr] is at address 1. + +Note that we could have other things residing in memory, but so long +as the address points to same values as before, these answers mean the +same thing: + +@#reader scribble/comment-reader +(racketblock +(cons (list 97 98 99 100 101) (cons-ptr 1))) + +In fact, we can reconstruct a @tt{Value} from a @tt{Value*} and +@tt{Heap}: + +@codeblock-include["hustle/unload.rkt"] + +Which relies on our interface for heaps: + +@codeblock-include["hustle/heap.rkt"] + +Try it out: + +@ex[ +(unload-value (box-ptr 0) (list 99)) +(unload-value (cons-ptr 0) (list 99 100)) +(unload-value (cons-ptr 1) (list 97 98 99 100 101))] + +What about nested pairs like @racket[(cons 1 (cons 2 (cons 3 '())))]? +Well, we already have all the pieces we need to represent values like +these. + +@ex[ +(unload-value (cons-ptr 0) + (list '() 3 (cons-ptr 4) 2 (cons-ptr 2) 1))] + +Notice that this list could laid out in many different ways, but when +viewed through the lens of @racket[unload-value], they represent the +same list: + +@ex[ +(unload-value (cons-ptr 4) + (list (cons-ptr 2) 1 (cons-ptr 0) 2 '() 3))] + +The idea of the interpreter that explicitly models memory will be +thread through a heap that is used to represent the memory allocated +by the program. Operations that manipulate or create boxes and pairs +will have to be updated to work with this new representation. + +So for example, the @racket[cons] operation should allocate two new +memory locations and produce a tagged pointer to the address of the +first one. The @racket[car] operation should dereference the memory +pointed to by the given @racket[cons-ptr] value. + +@ex[ +(unload (alloc-cons 100 99 '())) +(unload (alloc-box 99 '())) + +#; +(unload + (match (alloc-cons 3 '() '()) + [(cons h v) + (match (alloc-cons 2 v h) + [(cons h v) + (alloc-cons 1 v h)])]))] + + +Much of the work is handled in the new @tt{interp-prims-heap} module: + +@codeblock-include["hustle/interp-prims-heap.rkt"] + + +@ex[ +(unload (interp-prim1 'box 99 '())) +(unload (interp-prim2 'cons 100 99 '())) +(unload + (match (interp-prim1 'box 99 '()) + [(cons h v) + (interp-prim1 'unbox v h)]))] + + +Finally, we can write the overall interpreter, which threads a heap +throughout the interpretation of a program in +@racket[interp-env-heap]. The top-level @racket[interp] function, +which is intended to be equivalent to the original @racket[interp] +function that modelled memory implicitly, calls +@racket[interp-env-heap] with an initially empty heap and the unloads +the final answer from the result: + +@codeblock-include["hustle/interp-heap.rkt"] -The problem is that a value such as @racket[(box _v)] has a value -inside it. Pairs are even worse: @racket[(cons _v0 _v1)] has -@emph{two} values inside it. If each value is represented with 64 -bits, it would seem a pair takes @emph{at a minimum} 128-bits to -represent (plus we need some bits to indicate this value is a pair). -What's worse, those @racket[_v0] and @racket[_v1] may themselves be -pairs or boxes. The great power of inductive data is that an -arbitrarily large piece of data can be constructed. But it would seem -impossible to represent each piece of data with a fixed set of bits. -The solution is to @bold{allocate} such data in memory, which can in -principle be arbitrarily large, and use a @bold{pointer} to refer to -the place in memory that contains the data. @;{ Really deserves a "bit" level interpreter to bring this idea across. } @@ -551,3 +610,14 @@ values to print them. It also must account for the wrinkle of how the printing of proper and improper lists is different: @filebox-include[fancy-c hustle "print.c"] + +@section{Correctness} + +The statement of correctness for the @|this-lang| compiler is the same +as the previous one: + +@bold{Compiler Correctness}: @emph{For all @racket[e] @math{∈} +@tt{ClosedExpr}, @racket[i], @racket[o] @math{∈} @tt{String}, and @racket[v] +@math{∈} @tt{Value}, if @racket[(interp/io e i)] equals @racket[(cons +v o)], then @racket[(exec/io e i)] equals +@racket[(cons v o)].}