-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rita.untokenize incorrectly handling apostrophe's at end of word #477
Comments
Here is problematic case in our untokenize tests: a. The student said 'learning is fun' In each case, when we get to the single-quote, we don't know whether it is the start of a new quotation or an apostrophe at the end of the previous word @cqx931 any thoughts? |
When I test RiTa.tokenize(), If we do think 'students'' should be broke into two tokens, then are words ending in 's' the only case here? (I can't come up with more cases... ) If it is the only case we could check whether the token before is ending in 's' to differentiate case a and b. |
This is a good question. I think punctuation that is internal to a word (e.g., student's) should be a single token, while punctuation at end of word (e.g., students' or students" or students, or students:) should be two. One reason is that we don't want the 's' character in "student's" to be a separate token... though we also need to consider the SPLIT_CONTRACTIONS flag. In any case @cqx931, can you check/fix the known-issues test for tokenize() in RiTaJS? then lets discuss |
First fixed the issue in tokenize(): |
Please check: https://github.com/dhowe/RiTaJS/pull/89 |
Excellent. We still have this failing case:
But lets leave that alone for now. Next, please 1) update the RiTa java tests to match those in JS, then 2) update RiTa.tokenize()/untokenize() functions in java to match the behavior of RiTaJS... you may want to simply rewrite those functions as ports from JS, rather than trying to adapt what's there now |
Updated Java and fixed the new failing case. Please check https://github.com/dhowe/RiTa/pull/480 |
Can this be closed? @dhowe |
Do we have this case in KnownISsues for both platforms?
|
This test case is already solved for both platforms (in RiTaTest). |
So we have both of these arrays as inputs to untokenize() tests (which do pass):
Pls explain how the algorithm groups the first single-quote in 1 with the previous word (students'), but does the opposite in 2, and groups it with the subsequent word ('learning)? |
@cqx931 ? |
What do you mean by grouping here? Isn't both in case 1 and case 2, that the single quote is considered as a single token? |
Yes, but we are talking about RiTa.untokenize here, so the output is a single string. |
@kennyviperhk Please verify that both of the tests below pass (in both RiTa and RiTaJS) for RiTa.untokenize:
|
please see my fix (in RiTaJS b1e70577a8fc5c86) and apply the same fix to the failing test in RiTa-java
The text was updated successfully, but these errors were encountered: