Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support string literals in pattern matching #16

Open
dvanhorn opened this issue Nov 14, 2024 · 0 comments
Open

Support string literals in pattern matching #16

dvanhorn opened this issue Nov 14, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@dvanhorn
Copy link
Member

The current implementation of pattern matching in Knock is incorrect for string literal patterns because the match boils down to an eq? test. All other literal patterns are immediates, but strings are not.

So for example (match "foo" ["foo" #t] [_ #f]) will produce #f in the current implementation. String literal interning would fix this particular problem, but won't fix the problem in general, as (match (make-string #\f 3) ["fff" #t] [_ #f]) would continue to produce #f.

A fix was discussed on Piazza:


Today in class we spent some time talking about what it would take to support string literal patterns correctly. The main issue is that unlike other kinds of literals we've considered so far, string literals are memory allocated (either statically or dynamically) and what it means to match is to be a string that is "equal" in the string=? sense to the string in the pattern; other literals are immediates, and thus can be compared with a single Cmp instruction.

I thought about it more on my bike ride home this evening and this is what I came up with.

I added a special case for matching the empty string because (assuming the empty string is represented by the type tag alone), it is actually an immediate:

;; Pat CEnv Symbol -> (list Asm CEnv)
(define (compile-pattern p cm next)
  (match p
    #;...
    [(Lit "")
     (list
      (let ((ok (gensym)))
        (seq (Cmp rax type-str)
             (Je ok)
             (Add rsp (* 8 (length cm)))
             (Jmp next)
             (Label ok)))
      cm)]))

So the next case applies only to non-empty string patterns. It statically allocates memory for the string literal of the pattern in a data section. The code then checks whether the value in rax is:

  • a string
  • not the empty string
  • of the same length as the pattern string

If all of these are true, then it's time to compare the contents of the two strings for equality. To do this, I call the C standard library function:

int memcmp(const void *str1, const void *str2, size_t n)

This function is given two pointers and a size (in bytes) and returns 0 if the pointers point to n bytes of equal data.

Here's the code:

;; Pat CEnv Symbol -> (list Asm CEnv)
(define (compile-pattern p cm next)
  (match p
    #;...
    [(Lit (? string? s))
     (list      
      (let ((string (gensym))
            (ok     (gensym))
            (bail   (gensym)))
        (seq (Data)
             (Label string)
             (map Dd (map char->integer (string->list s)))
             (if (odd? (string-length s)) (seq (Dd 0)) (seq))
             (Text)
             (Mov r8 rax)
             (And r8 ptr-mask)
             (Cmp r8 type-str)
             (Jne bail) ; value is not a string
             (Cmp rax type-str)
             (Je bail) ; value is the empty string (but pattern is not)
             (Xor rax type-str)
             (Mov r8 (Offset rax 0))
             (Mov 'rdx (string-length s))
             (Cmp r8 'rdx)
             (Jne bail) ; value is not an equal length string
             ;; time to compare the contents
             ;; let's use the C standard library
             (Extern 'memcmp)
             (Sar 'rdx 2) ; 4 bytes per character
             (Mov 'rdi rax)
             (Add 'rdi 8)
             (Lea 'rsi string)
             pad-stack
             (Call 'memcmp)
             unpad-stack
             (Cmp rax 0)
             (Je ok)                  
             (Label bail)
             (Add rsp (* 8 (length cm)))
             (Jmp next)
             (Label ok)))
      cm)]))

Some examples:

> (require "run.rkt" "parse.rkt")
> (run (compile (parse '(match "a"
                          ["a" 1]
                          [_ 2]))))
1
> (run (compile (parse '(match ""
                          ["" 1]
                          [_ 2]))))
1
> (run (compile (parse '(match "a"
                          ["" 1]
                          [_ 2]))))
2
> (run (compile (parse '(match ""
                          ["a" 1]
                          [_ 2]))))
2

And there you go.

One thing I think this code suggests is that making the empty string equal to type-str was a wrong move and better is to statically allocate a canonical empty string so that this code can be simplified and unified into a single case.


This should be incorporated into the Knock compiler, but after changing empty strings (and vectors) to be statically allocated so that this code can be simpler and in a single case. Similar code would be in a string=? primitive, so it probably make sense to define a function and call it from here and string=?.

@dvanhorn dvanhorn added the bug Something isn't working label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant