Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selective auto-whitespace #64

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sbelak
Copy link

@sbelak sbelak commented Dec 30, 2013

Added the option to auto-add whitespace only to select tags or to exclude certain tags from having whitespace added.
See tests for examples.

@Engelberg
Copy link
Owner

Looks like a useful addition.

Can you elaborate on whether inclusions/exclusions for whitespace rules "trickle down" to subrules?

For example,

S = A
A = 'a'+

If you say :only S, does that mean no whitespace is inserted (since the tokens actually occur in rule A), or the whitespace gets added because it occurs within S?

How about:

S = A B
A = C
B = C
C = 'ccc'+

If you :only A, would that suggest that the C you get to via A includes whitespace, but the version you get to via B does not?

@sbelak
Copy link
Author

sbelak commented Dec 31, 2013

Hm, good question. I would say we want the most precise (and least surprising) meaning, so in the first case it should accept "aaa ", but not "a a a"; and in the second case it should accept "ccc ccc", but not "ccc ccc ". The latter case currently does not hold, as auto whitespace skips alias nodes. Will fix (provided you agree with the direction).

Oh, and a happy new year! :)

@Engelberg
Copy link
Owner

I've been thinking about this again.

So the way it works right now is that optional whitespace is added in front of all tokens (strings or regexes) and at the end of the start rule.

It does seem that allowing some sort of "trickling down" could be quite confusing, so one way to do this is to say that when you add :only and :except, you are saying whether optional whitespace is automatically added in front of every token, i.e., string or regex, in that rule.

But if you don't include the start rule, it's totally unclear what should happen with the space at the end. And if you only cover select rules, should those rules allow for spaces at the end of those productions?

Based on your answer above to my "second case", it sounds like you're advocating that optional whitespace also be inserted in front of every non-terminal as well. I'm somewhat concerned about how much that would balloon the grammar, but it's an option, if that's the most intuitive behavior.

I'd welcome another round of input from you on the matter.

@sbelak
Copy link
Author

sbelak commented Nov 18, 2014

I think the clearest rule would be, if we treat autowhitespacing as a series of rewrite rules; that is, if there is well defined expansion the user could do on her own that would work the same. Furthermore to allow composability (my use-case for instance leans heavily on composability, overwriting and aliasing) these rewrite rules should be local to each rule.

As for ballooning, I think it's first of all an ideological question: is Instaparse first and foremost fast (part of infrastructure) or is it convenient (a hammer you reach for first). My current needs are right now in the latter camp and so is my bias.

@danielytics
Copy link

What about having some way to mark a () grouping as "don't insert auto whitespace here" and otherwise having auto-whitespace behave as it currently does?

Eg:

S = A B #(C D)

would insert whitespace before A and B but not before C and D. (assuming # is the character used to mark the grouping)

@hlship
Copy link

hlship commented Mar 1, 2016

I suspect this would have helped us, as we liked using the whitespace, but our application (parsing a GraphQL query document) included literal text in quotes; the auto-whitespace removed that whitespace and we did not find a way to exclude auto-whitespace logic from the productions that were such literal text, so we reverted to a frequently used hidden ws element.

@cjsauer
Copy link

cjsauer commented Oct 3, 2021

Here is a somewhat rough solution to this problem that I came up with for processing string literals. Remembering that an instaparse parser is just data, you can transform it prior to invocation to strip out the automatically injected whitespace rules. Given a grammar with a string literal rule:

string =
    <'"'> inner-string* <'"'> ;
<inner-string> = 
    #'[^"\\]'
    | ( <'\\'> escape-char )
    ;
escape-char = #'[trn\\"]' ;

Define the parser as usual, and then strip out the unwanted auto-whitespace rules:

(defparser my-cool-parser
  "resources/grammar.instaparse"
  :auto-whitespace :standard)

(def ^:private disable-auto-ws
  (memoize
    (fn [parser]
      (walk/postwalk
        (fn [form]
          (if (and (map-entry? form) (= :parsers (first form)))
            [:parsers
             (filter #(not= :whitespace (-> % :parser :keyword))
               (second form))]
            form))
        parser))))

(defn parse
  [text]
  (let [parser (update-in my-cool-parser [:grammar :inner-string] disable-auto-ws)]
    (parser text)))

By splitting string and inner-string rules, you can very accurately target whitespace disabling to the inside of the string literal, leaving all other automatic ws rules intact.

@cjsauer
Copy link

cjsauer commented Oct 3, 2021

That said, I like @danielytics approach of wrapping a rule in an "atomic construct" like #( ... ) in his example. I have used other parser generators to great effect with this feature. For example, the pest parser generator in Rust uses this approach, and calls them "atomic rules".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants