Rita.untokenize incorrectly handling apostrophe's at end of word #477

dhowe · 2017-07-05T02:52:36Z

please see my fix (in RiTaJS b1e70577a8fc5c86) and apply the same fix to the failing test in RiTa-java

dhowe · 2017-07-05T09:15:32Z

Here is problematic case in our untokenize tests:

a. The student said 'learning is fun'
b. We should consider the students' learning

In each case, when we get to the single-quote, we don't know whether it is the start of a new quotation or an apostrophe at the end of the previous word

@cqx931 any thoughts?

cqx931 · 2017-07-06T07:01:42Z

When I test RiTa.tokenize(), student’s stays as one token while 'students'' becomes two.
But if student’s is regarded as one token, shouldn't 'students'' also stays as one? If this is ok, then it would be an issue for tokenize(), rather than untokenize().

If we do think 'students'' should be broke into two tokens, then are words ending in 's' the only case here? (I can't come up with more cases... ) If it is the only case we could check whether the token before is ending in 's' to differentiate case a and b.

dhowe · 2017-07-06T21:02:18Z

This is a good question. I think punctuation that is internal to a word (e.g., student's) should be a single token, while punctuation at end of word (e.g., students' or students" or students, or students:) should be two. One reason is that we don't want the 's' character in "student's" to be a separate token... though we also need to consider the SPLIT_CONTRACTIONS flag.

In any case @cqx931, can you check/fix the known-issues test for tokenize() in RiTaJS? then lets discuss

cqx931 · 2017-07-07T17:56:55Z

First fixed the issue in tokenize():
dhowe/ritajs@cfa2e90

cqx931 · 2017-07-08T16:25:24Z

Please check: https://github.com/dhowe/RiTaJS/pull/89
Fixed the rest issues with untokenize() and added other curly quotes to tokenize().

dhowe · 2017-07-08T18:29:19Z

Excellent.

We still have this failing case:

      var expected = "The student said 'learning is fun'";
      var input = ["The", "student", "said", "'", "learning", "is", "fun", "'"];
      var output = RiTa.untokenize(input);
      deepEqual(output, expected);

But lets leave that alone for now.

Next, please 1) update the RiTa java tests to match those in JS, then 2) update RiTa.tokenize()/untokenize() functions in java to match the behavior of RiTaJS... you may want to simply rewrite those functions as ports from JS, rather than trying to adapt what's there now

cqx931 · 2017-07-09T15:01:21Z

Updated Java and fixed the new failing case.

Please check https://github.com/dhowe/RiTa/pull/480
and https://github.com/dhowe/RiTaJS/pull/90

cqx931 · 2017-08-12T20:58:42Z

Can this be closed? @dhowe

dhowe · 2017-08-12T21:25:12Z

Do we have this case in KnownISsues for both platforms?

     var expected = "The student said 'learning is fun'";
     var input = ["The", "student", "said", "'", "learning", "is", "fun", "'"];
     var output = RiTa.untokenize(input);
     deepEqual(output, expected);

cqx931 · 2017-08-12T21:49:57Z

This test case is already solved for both platforms (in RiTaTest).

dhowe · 2017-08-14T02:49:41Z

So we have both of these arrays as inputs to untokenize() tests (which do pass):

["We", "should", "consider", "the", "students", "'", "learning" ]
["The", "student", "said", "'", "learning", "is", "fun", "'"]

Pls explain how the algorithm groups the first single-quote in 1 with the previous word (students'), but does the opposite in 2, and groups it with the subsequent word ('learning)?

dhowe · 2017-11-24T09:40:05Z

@cqx931 ?

cqx931 · 2018-06-17T18:02:14Z

What do you mean by grouping here? Isn't both in case 1 and case 2, that the single quote is considered as a single token?

dhowe · 2018-06-17T18:05:38Z

Yes, but we are talking about RiTa.untokenize here, so the output is a single string.

dhowe · 2019-09-24T20:44:45Z

@kennyviperhk Please verify that both of the tests below pass (in both RiTa and RiTaJS) for RiTa.untokenize:

    expected = "We should consider the students' learning";
    input = ["We", "should", "consider", "the", "students", "'", "learning" ]
    var output = RiTa.untokenize(input);
    equal(output, expected);

    expected = "The student said 'learning is fun'";
    input = ["The", "student", "said", "'", "learning", "is", "fun", "'"]
    var output = RiTa.untokenize(input);
    equal(output, expected);

dhowe assigned cqx931 Jul 5, 2017

dhowe mentioned this issue Jul 5, 2017

Quotes not handled correctly in RiMarkov.generateSentences() #467

Closed

dhowe added PRIORITY: High and removed PRIORITY: High labels Jul 6, 2017

dhowe added a commit that referenced this issue Jul 7, 2017

removed failing tests for #477

22e3ea8

dhowe added the PRIORITY: Medium label Jul 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rita.untokenize incorrectly handling apostrophe's at end of word #477

Rita.untokenize incorrectly handling apostrophe's at end of word #477

dhowe commented Jul 5, 2017 •

edited

Loading

dhowe commented Jul 5, 2017

cqx931 commented Jul 6, 2017 •

edited

Loading

dhowe commented Jul 6, 2017 •

edited

Loading

cqx931 commented Jul 7, 2017

cqx931 commented Jul 8, 2017

dhowe commented Jul 8, 2017 •

edited

Loading

cqx931 commented Jul 9, 2017 •

edited

Loading

cqx931 commented Aug 12, 2017 •

edited

Loading

dhowe commented Aug 12, 2017 •

edited

Loading

cqx931 commented Aug 12, 2017 •

edited

Loading

dhowe commented Aug 14, 2017 •

edited

Loading

dhowe commented Nov 24, 2017

cqx931 commented Jun 17, 2018

dhowe commented Jun 17, 2018 •

edited

Loading

dhowe commented Sep 24, 2019

Rita.untokenize incorrectly handling apostrophe's at end of word #477

Rita.untokenize incorrectly handling apostrophe's at end of word #477

Comments

dhowe commented Jul 5, 2017 • edited Loading

dhowe commented Jul 5, 2017

cqx931 commented Jul 6, 2017 • edited Loading

dhowe commented Jul 6, 2017 • edited Loading

cqx931 commented Jul 7, 2017

cqx931 commented Jul 8, 2017

dhowe commented Jul 8, 2017 • edited Loading

cqx931 commented Jul 9, 2017 • edited Loading

cqx931 commented Aug 12, 2017 • edited Loading

dhowe commented Aug 12, 2017 • edited Loading

cqx931 commented Aug 12, 2017 • edited Loading

dhowe commented Aug 14, 2017 • edited Loading

dhowe commented Nov 24, 2017

cqx931 commented Jun 17, 2018

dhowe commented Jun 17, 2018 • edited Loading

dhowe commented Sep 24, 2019

dhowe commented Jul 5, 2017 •

edited

Loading

cqx931 commented Jul 6, 2017 •

edited

Loading

dhowe commented Jul 6, 2017 •

edited

Loading

dhowe commented Jul 8, 2017 •

edited

Loading

cqx931 commented Jul 9, 2017 •

edited

Loading

cqx931 commented Aug 12, 2017 •

edited

Loading

dhowe commented Aug 12, 2017 •

edited

Loading

cqx931 commented Aug 12, 2017 •

edited

Loading

dhowe commented Aug 14, 2017 •

edited

Loading

dhowe commented Jun 17, 2018 •

edited

Loading