Swallowing up text in the parser #122

stuartlangridge · 2020-07-06T23:08:43Z

parglare version: master
Python version: 3.8.2
Operating System: Ubuntu 20.04

I have a document which contains a heading, which is a quoted string, and then a series of "sentences" which end with a "." and may have newlines in. I'd like to parse the document into Heading and Sentences. I tried to do it this way:

import parglare

grammar = r"""
Document: Heading Body;

Heading: QuotedString;
Body: Anything;

Sentence: Anything DOT;

terminals

QuotedString: /"(?P<qs>.*?)"/;
Anything: /.*/;
DOT: ".";
"""

text = """

"This is the heading"

This is sentence one.
This is sentence two
which has newlines in.
"""

g = parglare.Grammar.from_string(grammar)
p = parglare.Parser(g, debug=True)
result = p.parse(text)

However, this fails with parglare.exceptions.ParseError: Error at 6:0:"ence one.\n **> This is se" => Expected: DOT but found <Anything(This is sentence two)>.

All I care about is the Heading, and parsing the Body into separate sentences, but I can't work out how to do that; what's the best way to express this in a parglare grammar? The sentences can contain anything at all; I don't need a structure or parsing for them at this stage, just a list with ["This is sentence one.", "This is sentence two which has newlines in."] as the return; sentences might contain any characters at all.

(Apologies if this isn't actually an issue, but I hope it's the best place to ask questions about parglare. I'm happy to ask it somewhere else if that's better.)

The text was updated successfully, but these errors were encountered:

igordejanovic · 2020-07-07T09:04:59Z

In Python regexes . by default don't cross line boundaries. To change that you can use ?s inline flag (see re.DOTALL in the Python docs). So your grammar will work correctly with this:

Anything: /(?s).*/;

igordejanovic · 2020-07-07T09:05:34Z

BTW, here is the right place to ask questions about parglare.

stuartlangridge · 2020-07-07T09:26:02Z

Ah, now, I tried (?s) (this bug report was originally going to mention DOTALL until I actually read the re documentation and discovered the inline (?s) version, which I didn't know existed :-)) but when I tried it I still got errors, presumably because I don't quite understand it. Example:

import parglare

grammar = r"""
Program: al=AuthorLine sentences=Sentences;
AuthorLine: title=Identifier "by" author=Identifier DOT;

Sentences: Sentence*;
Sentence: Anything DOT;
Identifier: IdentifierWord*;

terminals

IdentifierWord: /\w+/;
DOT: ".";
Anything: /(?s).*?/;
"""

text = """
Program by Stuart.

This is sentence one.
This is sentence two
which has newlines in.
"""

g = parglare.Grammar.from_string(grammar)
p = parglare.Parser(g, debug=True)
result = p.parse(text)

This fails with error:
parglare.exceptions.ParseError: Error at 4:0:" Stuart.\n\n **> This is se" => Expected: Anything or STOP but found <IdentifierWord(This)>

I don't know how to tell parglare "just swallow up the rest of the document, I don't care about parsing it", or "please only detect an IdentifierWord in the context of an AuthorLine and once you've got the AuthorLine, stop parsing" -- I can't boost or decrease the relevance of IdentifierWord with {1} or {99} because it's a terminal, and even then I want to boost it while parsing an AuthorLine and decrease it when not, which I don't understand how to do. Maybe I'm attacking this problem completely the wrong way?

igordejanovic · 2020-07-07T10:15:47Z

The problem is that Anything collects... well anything, even dots :) so Sentence rule never match as it expect DOT after Anything. You can do this:

Anything: /(?s)[^\.]*/;

which means Anything is anything except dot.

Another feature you might find useful, depending on what you are trying to achieve, is incomplete parsing.

stuartlangridge · 2020-07-07T10:21:54Z

Incomplete parsing looks like exactly what I want! Thank you!

igordejanovic added the question label Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swallowing up text in the parser #122

Swallowing up text in the parser #122

stuartlangridge commented Jul 6, 2020 •

edited

Loading

igordejanovic commented Jul 7, 2020

igordejanovic commented Jul 7, 2020

stuartlangridge commented Jul 7, 2020

igordejanovic commented Jul 7, 2020

stuartlangridge commented Jul 7, 2020

Swallowing up text in the parser #122

Swallowing up text in the parser #122

Comments

stuartlangridge commented Jul 6, 2020 • edited Loading

igordejanovic commented Jul 7, 2020

igordejanovic commented Jul 7, 2020

stuartlangridge commented Jul 7, 2020

igordejanovic commented Jul 7, 2020

stuartlangridge commented Jul 7, 2020

stuartlangridge commented Jul 6, 2020 •

edited

Loading