Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(ABNF) instaparse considers forms in reverse order #136

Open
Whoops opened this issue May 6, 2016 · 4 comments
Open

(ABNF) instaparse considers forms in reverse order #136

Whoops opened this issue May 6, 2016 · 4 comments
Assignees

Comments

@Whoops
Copy link

Whoops commented May 6, 2016

given grammars
test1.abnf

test = a / b / unknown
a = "a" "=" ALPHA
b = "b" "=" ALPHA
unknown = ALPHA "=" ALPHA

and
test2.abnf

test = a / b
a = "a" "=" ALPHA
b = "b" "=" ALPHA
unknown = ALPHA "=" ALPHA

and parsers

dmarced.core> (def parse1 (parser (clojure.java.io/resource "test1.abnf") :input-format :abnf))
#'dmarced.core/parse1
dmarced.core> (def parse2 (parser (clojure.java.io/resource "test2.abnf") :input-format :abnf))
#'dmarced.core/parse2

It looks to me like instaparse in considering the forms for test in reverse order

dmarced.core> (parse1 "a=b")
[:test [:unknown [:ALPHA "a"] "=" [:ALPHA "b"]]]
dmarced.core> (parse2 "a=b")
[:test [:a "a" "=" [:ALPHA "b"]]]
@Engelberg
Copy link
Owner

The ABNF specification uses the / symbol for alternation, i.e., unordered choice (this is documented on the [https://github.com/Engelberg/instaparse/blob/master/docs/ABNF.md](ABNF page). The notion of ordered choice really comes from PEG grammars, and although instaparse is mainly aimed at representing context-free grammars, I included ordered choice in the EBNF grammar because the / symbol wasn't already being used so I threw it in there. But for ABNF, the / symbol was already taken by regular unordered choice.

I didn't really expect this to be a problem because I figured most users of ABNF would be using it to deal with official specs written in that notation, and I assumed official specs would not be ambiguous grammars. This is the first I've heard of an official spec being ambiguous, so I'll need to think about how to deal with that. (Can you point me to the spec where you are having this problem, just so I know for reference)?

As a little background, let me explain why instaparse processes unordered choice in reverse order. My hope has always been that one of these days, I'd like to release a version of instaparse that can process unordered choices concurrently using multiple threads. So I really wanted to drive home the point that users should think of unordered choice as non-deterministic, because someday it likely will be. I feared that if I processed the choices in order, users would start to rely on that left-to-right behavior, and then eventually be upset if they can't take advantage of parallelism later because the non-determinism breaks their grammar. So I decided rather intentionally to process the choices right-to-left, so that if you have an ambiguous grammar but you are mentally making the assumption that it will be processed in a specific order, you'll get a surprising result and realize that your grammar is ambiguous. This forces people to either make their grammar unambiguous or use ordered choice (which will explicitly limit parallelism later).

I can see how this might be frustrating in the context of ABNF, where you don't have easy access to ordered choice, but like I said, I didn't expect it to be an issue with ABNF grammars.

One option would be, like you said, to make this into ordered choice. Another option would be to just make ABNF process choices left to right. The advantage of the first option is that the feature was designed to behave correctly when parallelism is implemented. The disadvantage is that this parallel-proof design of ordered choice is consequently rather complex, and was meant to be used judiciously. Using it everywhere probably wouldn't be good performance-wise and has some subtle interactions with negative lookahead that can be surprising. A simple left-to-right processing would perform the best, but then I'd have to special case ABNF and not apply parallelism for such grammars in the future. Dilemmas.

So this is something I'd need to think about. I'd welcome more input from you about your specific problem and what solution would work best for you.

@Whoops
Copy link
Author

Whoops commented May 7, 2016

I was rather afraid of that since I couldn't actually find any documentation specifying how ABNF should be evaluated, just it's specification. The grammar I'm working with is the SPF Specification, where it looks like it makes the left-to-right assumption in two places:

modifier         = redirect / explanation / unknown-modifier
redirect         = "redirect" <"="> domain-spec
explanation      = "exp" <"="> domain-spec
unknown-modifier = name <"="> macro-string
...
name             = ALPHA *( ALPHA / DIGIT / "-" / "_" / "." )
key              = "client-ip" / "envelope-from" / "helo" /
                   "problem" / "receiver" / "identity" /
                   "mechanism" / name

I haven't run into the second yet, but it looks less painful to me since key is basically an alphanumeric string with a few extra allowed characters anyway. The first is what prompted me to open this bug report, and means that all modifiers are parsed as unknown modifiers.

I can work around that one too, but it's worth noting that in trying to fix it, I found one of your tests has a similar issue, in abnf_uri.txt you have host = IP-literal / IPv4address / reg-name, which looks like it's coming from RFC 3986

The result of this is that in the test where you parse telnet://192.0.2.16:80/ the ip address is instead parsed as reg-name.

Looking ahead on my project to the dkim spec, it looks like it's making similar assumptions. I spotted a few examples there, but here's the simplest.

sig-c-tag-alg   = "simple" / "relaxed" / x-sig-c-tag-alg
x-sig-c-tag-alg = hyphenated-word    ; for later extension

Even SMTP isn't immune:

Link           = "TCP" / Addtl-Link
Addtl-Link     = Atom
                  ; Additional standard names for links are
                  ; registered with the Internet Assigned Numbers
                  ; Authority (IANA).  "Via" is primarily of value
                  ; with non-Internet transports.  SMTP servers
                  ; SHOULD NOT use unregistered names.
Protocol       = "ESMTP" / "SMTP" / Attdl-Protocol
Attdl-Protocol = Atom
                  ; Additional standard names for protocols are
                  ; registered with the Internet Assigned Numbers
                  ; Authority (IANA) in the "mail parameters"
                  ; registry [9].  SMTP servers SHOULD NOT
                  ; use unregistered names.

ATOM isn't fully specified in that document, but assuming it's the same as I pieced together for SPF, Link/Protocol would always parse as Attdl-Link/Protocol

So it's looking to me like the order may be unspecified, but the assumption is widespread.

@Engelberg Engelberg mentioned this issue Mar 14, 2017
@Engelberg Engelberg self-assigned this May 27, 2017
@f-f
Copy link

f-f commented Mar 19, 2018

Hi, I just hit this bug - using this grammar, that also assumes that alternatives are considered left-to-right. (the grammar is quite complex so I won't explain where the problem was, but I can detail if needed)

I was able to get the desidered behaviour by applying the fix in PR #137 (alt -> ord).

@Engelberg what's your opinion on this?
Since the assumption in multiple places seems to contradict the RFC behaviour, maybe the best way forward is to make this configurable? (e.g. with a dynamic var, as with case-insensitive?)
I can put together a PR if you think it's a good idea.

@Engelberg
Copy link
Owner

I'm glad you were able to apply the workaround. I've decided that in the next major release of instaparse, I'm going to make left-to-right processing the default, but provide some to-be-determined mechanism for backwards compatibility. Maybe, like you suggested, a dynamic variable is the way to go. If you have any other thoughts on the best way to incorporate this without screwing people who have come to rely on right-to-left processing, I'm all ears. My plan is for this to be the new default in both EBNF and ABNF grammars, so no special casing of ABNF will be needed. I'm not ready to commit to a release date, but my work load is getting a little lighter soon so I hope to have time to revisit this in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants