OPENNLP-912: Rule based sentence detector #390

Alanscut · 2021-01-28T10:38:43Z

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

Alanscut · 2021-02-04T01:06:07Z

Somehow the Travis CI always timed out here, but succeeded in my fork repo: https://travis-ci.org/github/Alanscut/opennlp/builds/757336242

jzonthemtn · 2021-02-07T14:04:30Z

Thanks a lot for this contribution! This is something OpenNLP has needed. I will take a closer look.

Built and tested successfully.

Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /opt/apache-maven
Java version: 11.0.9.1, vendor: Ubuntu, runtime: /usr/lib/jvm/java-11-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.8.0-40-generic", arch: "amd64", family: "unix"

jzonthemtn · 2022-03-28T14:30:43Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizer.java

+/**
+ * The interface for rule based sentence detector
+ */
+public interface SentenceTokenizer {


There is an opennlp.tools.sentdetect.SentenceDetector interface that resembles this interface. Since the purpose of the two interfaces seem the same (to break text into sentences), is it possible to reuse the other interface?

+1 (and the method would require a proper description). It is totally unclear what the provided method does/shall do from an implementor perspective.

jzonthemtn · 2022-03-28T14:33:07Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java

+import opennlp.tools.util.StringUtil;
+
+
+public class SentenceTokenizerME implements SentenceTokenizer {


The ME name in the file names denotes "maximum entropy" as the method for the implementation. Since this implementation doesn't use a trained model, could it be named something like RulesBasedSentenceDetector? (Open to other names, too.)

+1 to @jzonthemtn comment

jzonthemtn · 2022-03-28T14:40:29Z

opennlp-tools/src/main/resources/opennlp/tools/sentdetect/segment/rules.xml

@@ -0,0 +1,131 @@
+<?xml version="1.0" encoding="UTF-8"?>


What is the origin of these rules?

How did you generate this xml file? It does not seem to originate from [the](https://github.com/diasks2/pragmatic_segmenter

jzonthemtn · 2022-03-28T14:41:28Z

I would like to better understand the origins of the rules used. Does there need to be license attribution?

rzo1 · 2023-11-29T07:42:38Z

I would like to better understand the origins of the rules used. Does there need to be license attribution?

@jzonthemtn Looks these "golden-rules.txt" is from https://github.com/diasks2/pragmatic_segmenter#the-golden-rules (at least, if we trust the textual description). Also in some other languages: https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt - the library itself with the content is MIT, so no compliance issue but we would need to attribute it accordingly.

rzo1 · 2023-11-29T07:43:22Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/Clean.java

+
+package opennlp.tools.sentdetect.segment;
+
+public class Clean {


Can be a Record ?

rzo1 · 2023-11-29T07:44:11Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java

+import opennlp.tools.util.StringUtil;
+
+
+public class SentenceTokenizerME implements SentenceTokenizer {


+1 to @jzonthemtn comment

rzo1 · 2023-11-29T07:44:22Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java

+
+  private Matcher afterMatcher;
+
+  boolean found;


Should be private.

rzo1 · 2023-11-29T07:51:21Z

opennlp-tools/src/main/resources/opennlp/tools/sentdetect/segment/rules.xml

@@ -0,0 +1,131 @@
+<?xml version="1.0" encoding="UTF-8"?>


How did you generate this xml file? It does not seem to originate from [the](https://github.com/diasks2/pragmatic_segmenter

rzo1 · 2023-11-29T07:54:31Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/Clean.java

+
+public class Clean {
+
+  String regex;


Might be worth to use a Pattern here to avoid compiling the regex in every replaceAll(...) call.

rzo1 · 2023-11-29T08:06:41Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/Section.java

+
+public class Section {
+
+  int left;


these should be private

rzo1 · 2023-11-29T08:07:34Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizer.java

+/**
+ * The interface for rule based sentence detector
+ */
+public interface SentenceTokenizer {


+1 (and the method would require a proper description). It is totally unclear what the provided method does/shall do from an implementor perspective.

rzo1 · 2023-11-29T08:08:45Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java

+
+  private List<Section> noBreakSections;
+
+  public SentenceTokenizerME(LanguageTool languageTool, CharSequence text) {


I wonder if we can rebuild this to avoid creating a Tokenizer for every piece of text? Wouldn't it be of more value to provide the text as a method parameter and compute the stuff on the fly? It would also allow us to make it threadsafe in the future.

rzo1 · 2023-11-29T08:09:24Z

opennlp-tools/src/main/java/opennlp/tools/sentdetect/segment/SentenceTokenizerME.java

+        start += count;
+      }
+    } catch (IOException e) {
+      e.printStackTrace();


We shouldn*t just print the stack trace: Either rethrow as runtime exception or at least log it.

rzo1 · 2023-11-29T08:11:24Z

opennlp-tools/src/test/java/opennlp/tools/sentdetect/segment/GoldenRulesTest.java

+      text = cleaner.clean(text);
+    }
+
+    InputStream inputStream = getClass().getResourceAsStream(


we should close the stream + read it once and consume the cached result for every test run.

Alanscut force-pushed the segment branch from 238554d to 01420d7 Compare February 1, 2021 02:54

jzonthemtn reviewed Mar 28, 2022

View reviewed changes

Alanscut and others added 3 commits November 29, 2023 08:46

OPENNLP-912: Rules based sentence detector

a6a68d5

OPENNLP-912: Move rules into rules.xml

49f51ef

Rebase with origin/main, fix JUnit 5 migration errors

342a025

rzo1 force-pushed the segment branch from ae25942 to 342a025 Compare November 29, 2023 08:03

rzo1 requested changes Nov 29, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-912: Rule based sentence detector #390

OPENNLP-912: Rule based sentence detector #390

Alanscut commented Jan 28, 2021 •

edited

Loading

Alanscut commented Feb 4, 2021

jzonthemtn commented Feb 7, 2021

jzonthemtn Mar 28, 2022

rzo1 Nov 29, 2023

jzonthemtn Mar 28, 2022

rzo1 Nov 29, 2023 •

edited

Loading

jzonthemtn Mar 28, 2022

rzo1 Nov 29, 2023

jzonthemtn commented Mar 28, 2022

rzo1 commented Nov 29, 2023

rzo1 Nov 29, 2023

rzo1 Nov 29, 2023 •

edited

Loading

rzo1 Nov 29, 2023

rzo1 Nov 29, 2023

rzo1 Nov 29, 2023

rzo1 Nov 29, 2023

rzo1 Nov 29, 2023

rzo1 Nov 29, 2023

rzo1 Nov 29, 2023

rzo1 Nov 29, 2023

		import opennlp.tools.util.StringUtil;


		public class SentenceTokenizerME implements SentenceTokenizer {


		package opennlp.tools.sentdetect.segment;

		public class Clean {


		private List<Section> noBreakSections;

		public SentenceTokenizerME(LanguageTool languageTool, CharSequence text) {

OPENNLP-912: Rule based sentence detector #390

Are you sure you want to change the base?

OPENNLP-912: Rule based sentence detector #390

Conversation

Alanscut commented Jan 28, 2021 • edited Loading

For all changes:

For code changes:

For documentation related changes:

Note:

Alanscut commented Feb 4, 2021

jzonthemtn commented Feb 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rzo1 Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jzonthemtn commented Mar 28, 2022

rzo1 commented Nov 29, 2023

Choose a reason for hiding this comment

rzo1 Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alanscut commented Jan 28, 2021 •

edited

Loading

rzo1 Nov 29, 2023 •

edited

Loading

rzo1 Nov 29, 2023 •

edited

Loading