Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rita.untokenize incorrectly handling apostrophe's at end of word #477

Open
dhowe opened this issue Jul 5, 2017 · 15 comments
Open

Rita.untokenize incorrectly handling apostrophe's at end of word #477

dhowe opened this issue Jul 5, 2017 · 15 comments
Assignees

Comments

@dhowe
Copy link
Owner

dhowe commented Jul 5, 2017

please see my fix (in RiTaJS b1e70577a8fc5c86) and apply the same fix to the failing test in RiTa-java

@dhowe
Copy link
Owner Author

dhowe commented Jul 5, 2017

Here is problematic case in our untokenize tests:

a. The student said 'learning is fun'
b. We should consider the students' learning

In each case, when we get to the single-quote, we don't know whether it is the start of a new quotation or an apostrophe at the end of the previous word

@cqx931 any thoughts?

@cqx931
Copy link
Collaborator

cqx931 commented Jul 6, 2017

When I test RiTa.tokenize(), student’s stays as one token while 'students'' becomes two.
But if student’s is regarded as one token, shouldn't 'students'' also stays as one? If this is ok, then it would be an issue for tokenize(), rather than untokenize().

If we do think 'students'' should be broke into two tokens, then are words ending in 's' the only case here? (I can't come up with more cases... ) If it is the only case we could check whether the token before is ending in 's' to differentiate case a and b.

@dhowe
Copy link
Owner Author

dhowe commented Jul 6, 2017

This is a good question. I think punctuation that is internal to a word (e.g., student's) should be a single token, while punctuation at end of word (e.g., students' or students" or students, or students:) should be two. One reason is that we don't want the 's' character in "student's" to be a separate token... though we also need to consider the SPLIT_CONTRACTIONS flag.

In any case @cqx931, can you check/fix the known-issues test for tokenize() in RiTaJS? then lets discuss

dhowe added a commit that referenced this issue Jul 7, 2017
@cqx931
Copy link
Collaborator

cqx931 commented Jul 7, 2017

First fixed the issue in tokenize():
dhowe/ritajs@cfa2e90

@cqx931
Copy link
Collaborator

cqx931 commented Jul 8, 2017

Please check: https://github.com/dhowe/RiTaJS/pull/89
Fixed the rest issues with untokenize() and added other curly quotes to tokenize().

@dhowe
Copy link
Owner Author

dhowe commented Jul 8, 2017

Excellent.

We still have this failing case:

      var expected = "The student said 'learning is fun'";
      var input = ["The", "student", "said", "'", "learning", "is", "fun", "'"];
      var output = RiTa.untokenize(input);
      deepEqual(output, expected);

But lets leave that alone for now.

Next, please 1) update the RiTa java tests to match those in JS, then 2) update RiTa.tokenize()/untokenize() functions in java to match the behavior of RiTaJS... you may want to simply rewrite those functions as ports from JS, rather than trying to adapt what's there now

@cqx931
Copy link
Collaborator

cqx931 commented Jul 9, 2017

Updated Java and fixed the new failing case.

Please check https://github.com/dhowe/RiTa/pull/480
and https://github.com/dhowe/RiTaJS/pull/90

@cqx931
Copy link
Collaborator

cqx931 commented Aug 12, 2017

Can this be closed? @dhowe

@dhowe
Copy link
Owner Author

dhowe commented Aug 12, 2017

Do we have this case in KnownISsues for both platforms?

     var expected = "The student said 'learning is fun'";
     var input = ["The", "student", "said", "'", "learning", "is", "fun", "'"];
     var output = RiTa.untokenize(input);
     deepEqual(output, expected);

@cqx931
Copy link
Collaborator

cqx931 commented Aug 12, 2017

This test case is already solved for both platforms (in RiTaTest).

@dhowe
Copy link
Owner Author

dhowe commented Aug 14, 2017

So we have both of these arrays as inputs to untokenize() tests (which do pass):

  1. ["We", "should", "consider", "the", "students", "'", "learning" ]
  2. ["The", "student", "said", "'", "learning", "is", "fun", "'"]

Pls explain how the algorithm groups the first single-quote in 1 with the previous word (students'), but does the opposite in 2, and groups it with the subsequent word ('learning)?

@dhowe
Copy link
Owner Author

dhowe commented Nov 24, 2017

@cqx931 ?

@cqx931
Copy link
Collaborator

cqx931 commented Jun 17, 2018

What do you mean by grouping here? Isn't both in case 1 and case 2, that the single quote is considered as a single token?

@dhowe
Copy link
Owner Author

dhowe commented Jun 17, 2018

Yes, but we are talking about RiTa.untokenize here, so the output is a single string.

@dhowe
Copy link
Owner Author

dhowe commented Sep 24, 2019

@kennyviperhk Please verify that both of the tests below pass (in both RiTa and RiTaJS) for RiTa.untokenize:

    expected = "We should consider the students' learning";
    input = ["We", "should", "consider", "the", "students", "'", "learning" ]
    var output = RiTa.untokenize(input);
    equal(output, expected);

    expected = "The student said 'learning is fun'";
    input = ["The", "student", "said", "'", "learning", "is", "fun", "'"]
    var output = RiTa.untokenize(input);
    equal(output, expected);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants