-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect graphs not always corrected #143
Comments
I wonder if this is the same issue as this issue here, where someone fed an invalid AMR into smatchpp and remarked that the script crashed. Indeed, this AMR seems to have the same problem with the ":
I just checked this now with penman, and indeed it seems this is the same issue. The graph is also read without warning/issue, even though it is actually invalid. [Edit] For reference:
|
Thanks for raising this. There are a few things going on here. To help with explaining, let's assume the following function: def list_triples(s: str) -> None:
for t in penman.decode(s).triples:
print(t) The main thing is that the lexer (and grammar) are not sufficiently constrained, so when it first sees a Constant <- String / Symbol
Symbol <- NameChar+
[...]
String <- '"' (!'"' ('\\' . / .))* '"'
NameChar <- ![ \n\t\r\f\v()/:~] . This explains the following: >>> list_triples('(a / alpha :op1 " :ARG0 (b / beta))')
('a', ':instance', 'alpha')
('a', ':op1', '"')
('a', ':ARG0', 'b')
('b', ':instance', 'beta') If you have two quotes on a line, it will parse it as a string, regardless if, as a human, it appears to be two solitary quote symbols: >>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 "))')
('a', ':instance', 'alpha')
('a', ':op1', '" :ARG0 (b / beta) :op2 "') The above is a perfectly valid PENMAN graph, even if the string The next issue is that the lexer for Penman operates on lines, and this fact is not captured by the grammar. This is why the following parses the two quotes as separate symbols instead of a large multiline string, even though it seems to be the same graph as above: >>> list_triples('''
... (a / alpha
... :op1 "
... :ARG0 (b / beta)
... :op2 "))''')
('a', ':instance', 'alpha')
('a', ':op1', '"')
('a', ':ARG0', 'b')
('b', ':instance', 'beta')
('a', ':op2', '"') Finally, if you have a solitary quote followed by a normal string on the same line, it will fail because it will initially succeed in matching the first solitary quote with the starting quote of the string, then the following string contents might not be valid PENMAN: >>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 "gamma"))')
Traceback (most recent call last):
[...]
penman.exceptions.DecodeError:
line 1
(a / alpha :op1 " :ARG0 (b / beta) :op2 "gamma"))
^
DecodeError: Expected: ROLE This would work, however, if the contents of the string were appropriate in that context (this would be similar to an SQL injection): >>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 " :op3 "))')
('a', ':instance', 'alpha')
('a', ':op1', '" :ARG0 (b / beta) :op2 "')
('a', ':op3', '"') Note that the above abuses the fact that the last So I think we could change the grammar to better match the expected behavior by simply adding a - NameChar <- ![ \n\t\r\f\v()/:~] .
+ NameChar <- ![ \n\t\r\f\v"()/:~] . To be explicit about the line-based parsing, we might also constrain the string characters: - String <- '"' (!'"' ('\\' . / .))* '"'
+ String <- '"' (!'"' ('\\' . / ![\n\r\f\v] .))* '"' |
Indeed, you are right, this is a valid graph and correct behaviour! So maybe the issue of the multi-line construction yielding a different construction can simply be fixed by putting every graph on one line. From my understanding of the notation, the difference of multi vs. one-line string is only visual/human. Do you think there is any real difference? Maybe the other issue, that is reading invalid graphs without warning, can be handled by checking if there is a mismatch of closing / Opening brackets, and then a warning can be issued. |
I agree that it is the correct behaviour to accept a graph like
I also agree that the presence or absence of LineFeeds should not result in a different parsing |
Interesting thought, but I don't think such preprocessing would be very useful. If we want to allow unescaped newlines within string values, the operation would be destructive. If we don't, then we don't gain much because the issue only occurs with unterminated string values.
No, in general there is no difference in meaning when a graph is presented on one line or multiple, but whitespace within string values is meaningful.
I'm not ready to agree here because we first need to decide what to do with unescaped newlines within quoted strings. We are currently in a limbo situation because the grammar and lexer do not rule out newlines, but the parser operates on one line at a time, so multiline strings are implicitly disallowed. Arguments in favor of multiline string values:
Arguments against multiline string values:
While I'm not opposed to allowing Penman to represent graphs beyond what's possible in AMR, I'm also not inclined to add additional capabilities without a use case. Considering the above, I'm leaning more toward explicitly disallowing multiline string values for now.
This is already possible: >>> penman.decode(r'(s / say-01 :op1 "say, \"cheese\"!")').triples
[('s', ':instance', 'say-01'), ('s', ':op1', '"say, \\"cheese\\"!"')]
Yes, I think something like this could be used to optionally allow multiline strings in the future. |
Everything what you said makes sense to me @goodmami . It's kind of funny that these edge cases are so extremely rare, but still, some day they do occur. At a certain level of detail there may not even be a proper formulation possible anymore of what is "acceptable" and what is not "acceptable" in the Penman notation. Also interesting is for me, what do in the case of "not acceptable", as may occur when evaluating a parser. |
Hi,
and
will produce different results/error messages ? |
Apologies for the delay. See #146.
Same results, slightly different error messages, because in the latter it tries to match the second quotation mark with the first: >>> penman.decode("""(b / bear-02
... :tense Past
... :ARG1 (p / person
... :name (n / name
... :op2 "))
... :time (d / date-entity
... :year 1980)
... :location (c / city
... :name (n2 / name
... :op1 "London")))""")
Traceback (most recent call last):
[...]
penman.exceptions.DecodeError:
line 5
:op2 "))
^
DecodeError: Expected: SYMBOL, STRING, LPAREN versus >>> penman.decode("""(b / bear-02 :tense Past :ARG1 (p / person :name (n / name :op2 ")) :time (d / date-entity :year 1980) location (c / city :name (n2 / name :op1 "London")))""")
Traceback (most recent call last):
[...]
penman.exceptions.DecodeError:
line 1
(b / bear-02 :tense Past :ARG1 (p / person :name (n / name :op2 ")) :time (d / date-entity :year 1980) location (c / city :name (n2 / name :op1 "London")))
^
DecodeError: Expected: ROLE You will see a difference in the following, though, even though the only difference is whitespace:
and
This is because multiline strings are not allowed. The code already treated them as such, but the PEG grammar in the documentation did not explicitly say so. It now does. |
Hi,
came across an invalid penman aMR graph (created by a badly trained parser)
which get parsed without problem by
penman.parse()
andpenman.decode()
. Shouldn't we see here an exception since the:op2 "
is clearly wrong?A played with this graph and extended it a bit
This time the error got detected, but only if I replace all linefeeds by a space and use
I definitly think something's odd here. This was tested with version 1.3.0
The text was updated successfully, but these errors were encountered: