Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect graphs not always corrected #143

Closed
jheinecke opened this issue Jun 7, 2024 · 8 comments
Closed

Incorrect graphs not always corrected #143

jheinecke opened this issue Jun 7, 2024 · 8 comments

Comments

@jheinecke
Copy link

Hi,

came across an invalid penman aMR graph (created by a badly trained parser)

(b / bear-02
    :tense Past
    :ARG1 (p / person
        :name (n / name
            :op2 "))
    :time (d / date-entity
        :year 1980))

which get parsed without problem by penman.parse() and penman.decode(). Shouldn't we see here an exception since the :op2 " is clearly wrong?

A played with this graph and extended it a bit

(b / bear-02
    :tense Past
    :ARG1 (p / person
        :name (n / name
            :op2 "))
    :time (d / date-entity
        :year 1980)
    :location (c / city
        :name (n2 / name
            :op1 "London")))

This time the error got detected, but only if I replace all linefeeds by a space and use

(b / bear-02    :tense Past    :ARG1 (p / person        :name (n / name            :op2 "))    :time (d / date-entity        :year 1980)    :location (c / city         :name (n2 / name             :op1 "London")))

I definitly think something's odd here. This was tested with version 1.3.0

@flipz357
Copy link

flipz357 commented Jun 7, 2024

I wonder if this is the same issue as this issue here, where someone fed an invalid AMR into smatchpp and remarked that the script crashed.

Indeed, this AMR seems to have the same problem with the ":

(o / offer-01
      :ARG0 (c / country
            :name (n / name
                  :op1 "France"))
      :ARG1 (a / and
            :op1 (a2 / aircraft-type
                  :name (n2 / name
                        :op1 "Mirage"
                        :op2 " ) :poss country_0 :ARG1-of ( upgrade-01 ) ) :op2 ( transfer-01 :ARG0 country_0 :ARG1 ( technology ) :ARG2 ( country_1 :name ( name_2 :op1 "))))

I just checked this now with penman, and indeed it seems this is the same issue. The graph is also read without warning/issue, even though it is actually invalid.

[Edit] For reference:

>>> import penman
>>> penman.__version__
'1.3.0'
>>> penman.decode("""(o / offer-01 :ARG0 (c / country :name (n / name :op1 "France")) :ARG1 (a / and :op1 (a2 / aircraft-type :name (n2 / name :op1 "Mirage" :op2 " ) :poss country_0 :ARG1-of ( upgrade-01 ) ) :op2 ( transfer-01 :ARG0 country_0 :ARG1 ( technology ) :ARG2 ( country_1 :name ( name_2 :op1 "))))""")
<Graph object (top=o) at 139786655128928>

@goodmami
Copy link
Owner

goodmami commented Jun 9, 2024

Thanks for raising this. There are a few things going on here. To help with explaining, let's assume the following function:

def list_triples(s: str) -> None:
    for t in penman.decode(s).triples:
        print(t)

The main thing is that the lexer (and grammar) are not sufficiently constrained, so when it first sees a " it tries to parse a string and consumes tokens until the end of the line or end of the string and fails to find a matching ", so it next tries to parse it as an unquoted symbol. In this case, the production for symbols does not exclude ", so it succeeds. Here are the relevant parts of the grammar:

Constant  <- String / Symbol
Symbol    <- NameChar+
[...]
String    <- '"' (!'"' ('\\' . / .))* '"'
NameChar  <- ![ \n\t\r\f\v()/:~] .

This explains the following:

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta))')
('a', ':instance', 'alpha')
('a', ':op1', '"')
('a', ':ARG0', 'b')
('b', ':instance', 'beta')

If you have two quotes on a line, it will parse it as a string, regardless if, as a human, it appears to be two solitary quote symbols:

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 "))')
('a', ':instance', 'alpha')
('a', ':op1', '" :ARG0 (b / beta) :op2 "')

The above is a perfectly valid PENMAN graph, even if the string " :ARG0 (b / beta) :op2 " is an unlikely value for :op1.

The next issue is that the lexer for Penman operates on lines, and this fact is not captured by the grammar. This is why the following parses the two quotes as separate symbols instead of a large multiline string, even though it seems to be the same graph as above:

>>> list_triples('''
... (a / alpha
...    :op1 "
...    :ARG0 (b / beta)
...    :op2 "))''')
('a', ':instance', 'alpha')
('a', ':op1', '"')
('a', ':ARG0', 'b')
('b', ':instance', 'beta')
('a', ':op2', '"')

Finally, if you have a solitary quote followed by a normal string on the same line, it will fail because it will initially succeed in matching the first solitary quote with the starting quote of the string, then the following string contents might not be valid PENMAN:

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 "gamma"))')
Traceback (most recent call last):
[...]
penman.exceptions.DecodeError: 
  line 1
    (a / alpha :op1 " :ARG0 (b / beta) :op2 "gamma"))
                                             ^
DecodeError: Expected: ROLE

This would work, however, if the contents of the string were appropriate in that context (this would be similar to an SQL injection):

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 " :op3 "))')
('a', ':instance', 'alpha')
('a', ':op1', '" :ARG0 (b / beta) :op2 "')
('a', ':op3', '"')

Note that the above abuses the fact that the last " is parsed as a symbol; it probably wouldn't work otherwise.

So I think we could change the grammar to better match the expected behavior by simply adding a " as forbidden in NameChar (symbol) productions:

- NameChar  <- ![ \n\t\r\f\v()/:~] .
+ NameChar  <- ![ \n\t\r\f\v"()/:~] .

To be explicit about the line-based parsing, we might also constrain the string characters:

- String    <- '"' (!'"' ('\\' . / .))* '"'
+ String    <- '"' (!'"' ('\\' . / ![\n\r\f\v] .))* '"'

@flipz357
Copy link

>>> list_triples('(a / alpha :op1 " :ARG0 (b / beta) :op2 "))')
('a', ':instance', 'alpha')
('a', ':op1', '" :ARG0 (b / beta) :op2 "')

Indeed, you are right, this is a valid graph and correct behaviour!

So maybe the issue of the multi-line construction yielding a different construction can simply be fixed by putting every graph on one line. From my understanding of the notation, the difference of multi vs. one-line string is only visual/human. Do you think there is any real difference?

Maybe the other issue, that is reading invalid graphs without warning, can be handled by checking if there is a mismatch of closing / Opening brackets, and then a warning can be issued.

@jheinecke
Copy link
Author

I agree that it is the correct behaviour to accept a graph like

(b / bear-02
    :tense Past
    :ARG1 (p / person
        :name (n / name
            :op2 "))
    :time (d / date-entity
        :year 1980)
    :location (c / city
        :name (n2 / name
            :op1 ")))

I also agree that the presence or absence of LineFeeds should not result in a different parsing
However, I think the lexer should reject literals which have only a single quote (as in this example) or graphs with an impair number of quotes. In addition quotes which may appear in a string should be escaped (in any case the AMR 3.0 data does not contains literals with quotes, only quotes around string values. I did not check the BIOAMR corpus, though). Another solution could be to add an optional parameter to the parse() and decode() methods which enforces or not a stricter parsing of quotes ?

@goodmami
Copy link
Owner

So maybe the issue of the multi-line construction yielding a different construction can simply be fixed by putting every graph on one line.

Interesting thought, but I don't think such preprocessing would be very useful. If we want to allow unescaped newlines within string values, the operation would be destructive. If we don't, then we don't gain much because the issue only occurs with unterminated string values.

From my understanding of the notation, the difference of multi vs. one-line string is only visual/human. Do you think there is any real difference?

No, in general there is no difference in meaning when a graph is presented on one line or multiple, but whitespace within string values is meaningful.

I agree that it is the correct behaviour to accept a graph like (b / bear-02 ...

I'm not ready to agree here because we first need to decide what to do with unescaped newlines within quoted strings. We are currently in a limbo situation because the grammar and lexer do not rule out newlines, but the parser operates on one line at a time, so multiline strings are implicitly disallowed.

Arguments in favor of multiline string values:

  • ➕ More expressive power
  • ?

Arguments against multiline string values:

  • ❌ Currently not used (perhaps due to lack of tool support)? I also could not find any discussion or requests on the AMR Guidelines repo specifically about multiline strings.
  • ❌ Might miss some erroneous AMRs like those above because they successfully parse differently than intended
  • ❌ The AMR Editor does not allow you to create multiline string values
  • ❌ Nathan Schneider's comment here (emphasis added, link updated)

    That said, the string-entity docs here only show an :op1 and it can contain spaces within the double quotes. As long as it's all on one line I think it should be fine.

While I'm not opposed to allowing Penman to represent graphs beyond what's possible in AMR, I'm also not inclined to add additional capabilities without a use case. Considering the above, I'm leaning more toward explicitly disallowing multiline string values for now.

In addition quotes which may appear in a string should be escaped

This is already possible:

>>> penman.decode(r'(s / say-01 :op1 "say, \"cheese\"!")').triples
[('s', ':instance', 'say-01'), ('s', ':op1', '"say, \\"cheese\\"!"')]

Another solution could be to add an optional parameter to the parse() and decode() methods which enforces or not a stricter parsing of quotes ?

Yes, I think something like this could be used to optionally allow multiline strings in the future.

@flipz357
Copy link

Everything what you said makes sense to me @goodmami . It's kind of funny that these edge cases are so extremely rare, but still, some day they do occur. At a certain level of detail there may not even be a proper formulation possible anymore of what is "acceptable" and what is not "acceptable" in the Penman notation. Also interesting is for me, what do in the case of "not acceptable", as may occur when evaluating a parser.

@jheinecke
Copy link
Author

Hi,
so what is the conclusion ? Will

(b / bear-02
    :tense Past
    :ARG1 (p / person
        :name (n / name
            :op2 "))
    :time (d / date-entity
        :year 1980)
    :location (c / city
        :name (n2 / name
            :op1 "London")))

and

(b / bear-02    :tense Past    :ARG1 (p / person        :name (n / name            :op2 "))    :time (d / date-entity        :year 1980)    location (c / city        :name (n2 / name            :op1 "London")))

will produce different results/error messages ?

@goodmami
Copy link
Owner

goodmami commented Aug 7, 2024

Apologies for the delay. See #146.

Will ... and ... produce different results/error messages ?

Same results, slightly different error messages, because in the latter it tries to match the second quotation mark with the first:

>>> penman.decode("""(b / bear-02
...     :tense Past
...     :ARG1 (p / person
...         :name (n / name
...             :op2 "))
...     :time (d / date-entity
...         :year 1980)
...     :location (c / city
...         :name (n2 / name
...             :op1 "London")))""")
Traceback (most recent call last):
  [...]
penman.exceptions.DecodeError: 
  line 5
                :op2 "))
                     ^
DecodeError: Expected: SYMBOL, STRING, LPAREN

versus

>>> penman.decode("""(b / bear-02    :tense Past    :ARG1 (p / person        :name (n / name            :op2 "))    :time (d / date-entity        :year 1980)    location (c / city        :name (n2 / name            :op1 "London")))""")
Traceback (most recent call last):
  [...]
penman.exceptions.DecodeError: 
  line 1
    (b / bear-02    :tense Past    :ARG1 (p / person        :name (n / name            :op2 "))    :time (d / date-entity        :year 1980)    location (c / city        :name (n2 / name            :op1 "London")))
                                                                                                                                                                                                            ^
DecodeError: Expected: ROLE

You will see a difference in the following, though, even though the only difference is whitespace:

(a / alpha
   :op1 "
   :op2 ")

and

(a / alpha :op1 " :op2 ")

This is because multiline strings are not allowed. The code already treated them as such, but the PEG grammar in the documentation did not explicitly say so. It now does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants