Incorrect graphs not always corrected #143

jheinecke · 2024-06-07T13:04:19Z

Hi,

came across an invalid penman aMR graph (created by a badly trained parser)

(b / bear-02
    :tense Past
    :ARG1 (p / person
        :name (n / name
            :op2 "))
    :time (d / date-entity
        :year 1980))

which get parsed without problem by penman.parse() and penman.decode(). Shouldn't we see here an exception since the :op2 " is clearly wrong?

A played with this graph and extended it a bit

(b / bear-02
    :tense Past
    :ARG1 (p / person
        :name (n / name
            :op2 "))
    :time (d / date-entity
        :year 1980)
    :location (c / city
        :name (n2 / name
            :op1 "London")))

This time the error got detected, but only if I replace all linefeeds by a space and use

(b / bear-02    :tense Past    :ARG1 (p / person        :name (n / name            :op2 "))    :time (d / date-entity        :year 1980)    :location (c / city         :name (n2 / name             :op1 "London")))

I definitly think something's odd here. This was tested with version 1.3.0

The text was updated successfully, but these errors were encountered:

flipz357 · 2024-06-07T17:15:00Z

I wonder if this is the same issue as this issue here, where someone fed an invalid AMR into smatchpp and remarked that the script crashed.

Indeed, this AMR seems to have the same problem with the ":

(o / offer-01
      :ARG0 (c / country
            :name (n / name
                  :op1 "France"))
      :ARG1 (a / and
            :op1 (a2 / aircraft-type
                  :name (n2 / name
                        :op1 "Mirage"
                        :op2 " ) :poss country_0 :ARG1-of ( upgrade-01 ) ) :op2 ( transfer-01 :ARG0 country_0 :ARG1 ( technology ) :ARG2 ( country_1 :name ( name_2 :op1 "))))

I just checked this now with penman, and indeed it seems this is the same issue. The graph is also read without warning/issue, even though it is actually invalid.

[Edit] For reference:

>>> import penman
>>> penman.__version__
'1.3.0'
>>> penman.decode("""(o / offer-01 :ARG0 (c / country :name (n / name :op1 "France")) :ARG1 (a / and :op1 (a2 / aircraft-type :name (n2 / name :op1 "Mirage" :op2 " ) :poss country_0 :ARG1-of ( upgrade-01 ) ) :op2 ( transfer-01 :ARG0 country_0 :ARG1 ( technology ) :ARG2 ( country_1 :name ( name_2 :op1 "))))""")
<Graph object (top=o) at 139786655128928>

goodmami · 2024-06-09T20:18:20Z

Thanks for raising this. There are a few things going on here. To help with explaining, let's assume the following function:

def list_triples(s: str) -> None:
    for t in penman.decode(s).triples:
        print(t)

The main thing is that the lexer (and grammar) are not sufficiently constrained, so when it first sees a " it tries to parse a string and consumes tokens until the end of the line or end of the string and fails to find a matching ", so it next tries to parse it as an unquoted symbol. In this case, the production for symbols does not exclude ", so it succeeds. Here are the relevant parts of the grammar:

Constant  <- String / Symbol
Symbol    <- NameChar+
[...]
String    <- '"' (!'"' ('\\' . / .))* '"'
NameChar  <- ![ \n\t\r\f\v()/:~] .

This explains the following:

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta))')
('a', ':instance', 'alpha')
('a', ':op1', '"')
('a', ':ARG0', 'b')
('b', ':instance', 'beta')

If you have two quotes on a line, it will parse it as a string, regardless if, as a human, it appears to be two solitary quote symbols:

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 "))')
('a', ':instance', 'alpha')
('a', ':op1', '" :ARG0 (b / beta) :op2 "')

The above is a perfectly valid PENMAN graph, even if the string " :ARG0 (b / beta) :op2 " is an unlikely value for :op1.

The next issue is that the lexer for Penman operates on lines, and this fact is not captured by the grammar. This is why the following parses the two quotes as separate symbols instead of a large multiline string, even though it seems to be the same graph as above:

>>> list_triples('''
... (a / alpha
...    :op1 "
...    :ARG0 (b / beta)
...    :op2 "))''')
('a', ':instance', 'alpha')
('a', ':op1', '"')
('a', ':ARG0', 'b')
('b', ':instance', 'beta')
('a', ':op2', '"')

Finally, if you have a solitary quote followed by a normal string on the same line, it will fail because it will initially succeed in matching the first solitary quote with the starting quote of the string, then the following string contents might not be valid PENMAN:

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 "gamma"))')
Traceback (most recent call last):
[...]
penman.exceptions.DecodeError: 
  line 1
    (a / alpha :op1 " :ARG0 (b / beta) :op2 "gamma"))
                                             ^
DecodeError: Expected: ROLE

This would work, however, if the contents of the string were appropriate in that context (this would be similar to an SQL injection):

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 " :op3 "))')
('a', ':instance', 'alpha')
('a', ':op1', '" :ARG0 (b / beta) :op2 "')
('a', ':op3', '"')

Note that the above abuses the fact that the last " is parsed as a symbol; it probably wouldn't work otherwise.

So I think we could change the grammar to better match the expected behavior by simply adding a " as forbidden in NameChar (symbol) productions:

- NameChar  <- ![ \n\t\r\f\v()/:~] .
+ NameChar  <- ![ \n\t\r\f\v"()/:~] .

To be explicit about the line-based parsing, we might also constrain the string characters:

- String    <- '"' (!'"' ('\\' . / .))* '"'
+ String    <- '"' (!'"' ('\\' . / ![\n\r\f\v] .))* '"'

flipz357 · 2024-06-10T06:23:24Z

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 "))')
('a', ':instance', 'alpha')
('a', ':op1', '" :ARG0 (b / beta) :op2 "')

Indeed, you are right, this is a valid graph and correct behaviour!

So maybe the issue of the multi-line construction yielding a different construction can simply be fixed by putting every graph on one line. From my understanding of the notation, the difference of multi vs. one-line string is only visual/human. Do you think there is any real difference?

Maybe the other issue, that is reading invalid graphs without warning, can be handled by checking if there is a mismatch of closing / Opening brackets, and then a warning can be issued.

jheinecke · 2024-06-10T06:37:08Z

I agree that it is the correct behaviour to accept a graph like

(b / bear-02
    :tense Past
    :ARG1 (p / person
        :name (n / name
            :op2 "))
    :time (d / date-entity
        :year 1980)
    :location (c / city
        :name (n2 / name
            :op1 ")))

I also agree that the presence or absence of LineFeeds should not result in a different parsing
However, I think the lexer should reject literals which have only a single quote (as in this example) or graphs with an impair number of quotes. In addition quotes which may appear in a string should be escaped (in any case the AMR 3.0 data does not contains literals with quotes, only quotes around string values. I did not check the BIOAMR corpus, though). Another solution could be to add an optional parameter to the parse() and decode() methods which enforces or not a stricter parsing of quotes ?

goodmami · 2024-06-11T03:20:30Z

So maybe the issue of the multi-line construction yielding a different construction can simply be fixed by putting every graph on one line.

Interesting thought, but I don't think such preprocessing would be very useful. If we want to allow unescaped newlines within string values, the operation would be destructive. If we don't, then we don't gain much because the issue only occurs with unterminated string values.

From my understanding of the notation, the difference of multi vs. one-line string is only visual/human. Do you think there is any real difference?

No, in general there is no difference in meaning when a graph is presented on one line or multiple, but whitespace within string values is meaningful.

I agree that it is the correct behaviour to accept a graph like (b / bear-02 ...

I'm not ready to agree here because we first need to decide what to do with unescaped newlines within quoted strings. We are currently in a limbo situation because the grammar and lexer do not rule out newlines, but the parser operates on one line at a time, so multiline strings are implicitly disallowed.

Arguments in favor of multiline string values:

➕ More expressive power
?

Arguments against multiline string values:

❌ Currently not used (perhaps due to lack of tool support)? I also could not find any discussion or requests on the AMR Guidelines repo specifically about multiline strings.
❌ Might miss some erroneous AMRs like those above because they successfully parse differently than intended
❌ The AMR Editor does not allow you to create multiline string values
❌ Nathan Schneider's comment here (emphasis added, link updated)

That said, the string-entity docs here only show an :op1 and it can contain spaces within the double quotes. As long as it's all on one line I think it should be fine.

While I'm not opposed to allowing Penman to represent graphs beyond what's possible in AMR, I'm also not inclined to add additional capabilities without a use case. Considering the above, I'm leaning more toward explicitly disallowing multiline string values for now.

In addition quotes which may appear in a string should be escaped

This is already possible:

>>> penman.decode(r'(s / say-01 :op1 "say, \"cheese\"!")').triples
[('s', ':instance', 'say-01'), ('s', ':op1', '"say, \\"cheese\\"!"')]

Another solution could be to add an optional parameter to the parse() and decode() methods which enforces or not a stricter parsing of quotes ?

Yes, I think something like this could be used to optionally allow multiline strings in the future.

flipz357 · 2024-06-11T08:59:36Z

Everything what you said makes sense to me @goodmami . It's kind of funny that these edge cases are so extremely rare, but still, some day they do occur. At a certain level of detail there may not even be a proper formulation possible anymore of what is "acceptable" and what is not "acceptable" in the Penman notation. Also interesting is for me, what do in the case of "not acceptable", as may occur when evaluating a parser.

jheinecke · 2024-08-05T09:43:57Z

Hi,
so what is the conclusion ? Will

(b / bear-02
    :tense Past
    :ARG1 (p / person
        :name (n / name
            :op2 "))
    :time (d / date-entity
        :year 1980)
    :location (c / city
        :name (n2 / name
            :op1 "London")))

and

(b / bear-02    :tense Past    :ARG1 (p / person        :name (n / name            :op2 "))    :time (d / date-entity        :year 1980)    location (c / city        :name (n2 / name            :op1 "London")))

will produce different results/error messages ?

goodmami · 2024-08-07T05:54:49Z

Apologies for the delay. See #146.

Will ... and ... produce different results/error messages ?

Same results, slightly different error messages, because in the latter it tries to match the second quotation mark with the first:

>>> penman.decode("""(b / bear-02
...     :tense Past
...     :ARG1 (p / person
...         :name (n / name
...             :op2 "))
...     :time (d / date-entity
...         :year 1980)
...     :location (c / city
...         :name (n2 / name
...             :op1 "London")))""")
Traceback (most recent call last):
  [...]
penman.exceptions.DecodeError: 
  line 5
                :op2 "))
                     ^
DecodeError: Expected: SYMBOL, STRING, LPAREN

versus

>>> penman.decode("""(b / bear-02    :tense Past    :ARG1 (p / person        :name (n / name            :op2 "))    :time (d / date-entity        :year 1980)    location (c / city        :name (n2 / name            :op1 "London")))""")
Traceback (most recent call last):
  [...]
penman.exceptions.DecodeError: 
  line 1
    (b / bear-02    :tense Past    :ARG1 (p / person        :name (n / name            :op2 "))    :time (d / date-entity        :year 1980)    location (c / city        :name (n2 / name            :op1 "London")))
                                                                                                                                                                                                            ^
DecodeError: Expected: ROLE

You will see a difference in the following, though, even though the only difference is whitespace:

(a / alpha
   :op1 "
   :op2 ")

and

(a / alpha :op1 " :op2 ")

This is because multiline strings are not allowed. The code already treated them as such, but the PEG grammar in the documentation did not explicitly say so. It now does.

goodmami mentioned this issue Aug 7, 2024

Disallow quotes outside strings and multiline strings #146

Merged

goodmami closed this as completed in 0dfbcf3 Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect graphs not always corrected #143

Incorrect graphs not always corrected #143

jheinecke commented Jun 7, 2024

flipz357 commented Jun 7, 2024 •

edited

Loading

goodmami commented Jun 9, 2024

flipz357 commented Jun 10, 2024

jheinecke commented Jun 10, 2024

goodmami commented Jun 11, 2024

flipz357 commented Jun 11, 2024

jheinecke commented Aug 5, 2024

goodmami commented Aug 7, 2024

Incorrect graphs not always corrected #143

Incorrect graphs not always corrected #143

Comments

jheinecke commented Jun 7, 2024

flipz357 commented Jun 7, 2024 • edited Loading

goodmami commented Jun 9, 2024

flipz357 commented Jun 10, 2024

jheinecke commented Jun 10, 2024

goodmami commented Jun 11, 2024

flipz357 commented Jun 11, 2024

jheinecke commented Aug 5, 2024

goodmami commented Aug 7, 2024

flipz357 commented Jun 7, 2024 •

edited

Loading