Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text fragmentation/segmentation based on formal grammar #34

Open
akolonin opened this issue Jul 30, 2020 · 0 comments
Open

Text fragmentation/segmentation based on formal grammar #34

akolonin opened this issue Jul 30, 2020 · 0 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed on hold

Comments

@akolonin
Copy link
Member

Base on the progress with issue #22 , we want to use the formal grammar to identify boundaries of sentences in the token (word) streams in two cases:

  1. When the token (word) stream is provided by the speech recognition engine.
  2. When the token (word) stream is provided by the HTML stripper applied to HTML texts where the natural language sentences are split not by conventional periods, explanations and question marks, but with some weird HTML tags with some custom styles applied to them.

The solution would have at least two applications:
A) Split the stream of tokens/words into sentences for further linguistic processing such as parsing and entity extraction
B) Split the stream of tokens/words into sentences for selecting the "featured" sentences containing some "hot" keywords for summarization purposes.

Initial progress has been reached with
https://github.com/aigents/aigents-java-nlp/blob/master/src/main/java/org/aigents/nlp/gen/Segment.java
in
aigents/aigents-java-nlp#11

Still, there is more work to do to improve the accuracy.

For testing purposes, we can use (for example) the SingularityNET extract from Gutenberg Children corpus used in Unsupervised Language Learning project, using the "cleaned" corpus: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/
then creating "extra-cleaned" corpus removing all sentences with quotes and brackets like [ ] ( ) { } ' " and all sentences with inner periods like "CHAPTER I. A NEW DEPARTURE"
then gluing sentences together on per-file or per-chapter basis and evaluate the accuracy based on the number of correctly identified sentence boundaries.

Any alternative corpora for testing against any baseline results achieved by any other authors may be considered as well.

References:
https://www.researchgate.net/publication/321227216_Text_Segmentation_Techniques_A_Critical_Review
https://www.google.com/search?q=natural+language+segmentation%20papers

@akolonin akolonin added enhancement New feature or request progress In progress labels Jul 30, 2020
@akolonin akolonin added help wanted Extra attention is needed on hold and removed progress In progress labels Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed on hold
Projects
None yet
Development

No branches or pull requests

2 participants