Text fragmentation/segmentation based on formal grammar #34

akolonin · 2020-07-30T12:12:48Z

Base on the progress with issue #22 , we want to use the formal grammar to identify boundaries of sentences in the token (word) streams in two cases:

When the token (word) stream is provided by the speech recognition engine.
When the token (word) stream is provided by the HTML stripper applied to HTML texts where the natural language sentences are split not by conventional periods, explanations and question marks, but with some weird HTML tags with some custom styles applied to them.

The solution would have at least two applications:
A) Split the stream of tokens/words into sentences for further linguistic processing such as parsing and entity extraction
B) Split the stream of tokens/words into sentences for selecting the "featured" sentences containing some "hot" keywords for summarization purposes.

Initial progress has been reached with
https://github.com/aigents/aigents-java-nlp/blob/master/src/main/java/org/aigents/nlp/gen/Segment.java
in
aigents/aigents-java-nlp#11

Still, there is more work to do to improve the accuracy.

For testing purposes, we can use (for example) the SingularityNET extract from Gutenberg Children corpus used in Unsupervised Language Learning project, using the "cleaned" corpus: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/
then creating "extra-cleaned" corpus removing all sentences with quotes and brackets like [ ] ( ) { } ' " and all sentences with inner periods like "CHAPTER I. A NEW DEPARTURE"
then gluing sentences together on per-file or per-chapter basis and evaluate the accuracy based on the number of correctly identified sentence boundaries.

Any alternative corpora for testing against any baseline results achieved by any other authors may be considered as well.

References:
https://www.researchgate.net/publication/321227216_Text_Segmentation_Techniques_A_Critical_Review
https://www.google.com/search?q=natural+language+segmentation%20papers

akolonin added enhancement New feature or request progress In progress labels Jul 30, 2020

akolonin assigned rvignav Jul 30, 2020

akolonin mentioned this issue Aug 28, 2020

Natural language production based on formal grammar #22

Open

akolonin added help wanted Extra attention is needed on hold and removed progress In progress labels Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text fragmentation/segmentation based on formal grammar #34

Text fragmentation/segmentation based on formal grammar #34

akolonin commented Jul 30, 2020

Text fragmentation/segmentation based on formal grammar #34

Text fragmentation/segmentation based on formal grammar #34

Comments

akolonin commented Jul 30, 2020