You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Base on the progress with issue #22 , we want to use the formal grammar to identify boundaries of sentences in the token (word) streams in two cases:
When the token (word) stream is provided by the speech recognition engine.
When the token (word) stream is provided by the HTML stripper applied to HTML texts where the natural language sentences are split not by conventional periods, explanations and question marks, but with some weird HTML tags with some custom styles applied to them.
The solution would have at least two applications:
A) Split the stream of tokens/words into sentences for further linguistic processing such as parsing and entity extraction
B) Split the stream of tokens/words into sentences for selecting the "featured" sentences containing some "hot" keywords for summarization purposes.
Still, there is more work to do to improve the accuracy.
For testing purposes, we can use (for example) the SingularityNET extract from Gutenberg Children corpus used in Unsupervised Language Learning project, using the "cleaned" corpus: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/
then creating "extra-cleaned" corpus removing all sentences with quotes and brackets like [ ] ( ) { } ' " and all sentences with inner periods like "CHAPTER I. A NEW DEPARTURE"
then gluing sentences together on per-file or per-chapter basis and evaluate the accuracy based on the number of correctly identified sentence boundaries.
Any alternative corpora for testing against any baseline results achieved by any other authors may be considered as well.
Base on the progress with issue #22 , we want to use the formal grammar to identify boundaries of sentences in the token (word) streams in two cases:
The solution would have at least two applications:
A) Split the stream of tokens/words into sentences for further linguistic processing such as parsing and entity extraction
B) Split the stream of tokens/words into sentences for selecting the "featured" sentences containing some "hot" keywords for summarization purposes.
Initial progress has been reached with
https://github.com/aigents/aigents-java-nlp/blob/master/src/main/java/org/aigents/nlp/gen/Segment.java
in
aigents/aigents-java-nlp#11
Still, there is more work to do to improve the accuracy.
For testing purposes, we can use (for example) the SingularityNET extract from Gutenberg Children corpus used in Unsupervised Language Learning project, using the "cleaned" corpus: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/
then creating "extra-cleaned" corpus removing all sentences with quotes and brackets like [ ] ( ) { } ' " and all sentences with inner periods like "CHAPTER I. A NEW DEPARTURE"
then gluing sentences together on per-file or per-chapter basis and evaluate the accuracy based on the number of correctly identified sentence boundaries.
Any alternative corpora for testing against any baseline results achieved by any other authors may be considered as well.
References:
https://www.researchgate.net/publication/321227216_Text_Segmentation_Techniques_A_Critical_Review
https://www.google.com/search?q=natural+language+segmentation%20papers
The text was updated successfully, but these errors were encountered: