Proposed wordcheck API #1371

70ray · 2024-11-13T12:33:18Z

There are two endpoints:

PUT v1/projects/:projectid/wordcheck This does not require the user to own the page. Needs accepted words and languages. Returns two arrays:
- wc_array: this consists of chunks of text and a flag indicating the type of chunk
- scripts: the uncommon scripts encountered.
PUT v1/projects/:projectid/pages/:pagename with pageaction = report. This requires the user to own the page and executes save_wordcheck_event(). The logic of when to call this is left to the client. It can be as follows: When the user does any kind of save, if he has previously entered wordcheck mode and a wordcheck report has not previously been submitted then submit a report before the save even if the user has not suggested any words to be accepted. On any subsequent saves submit a report only if the user has suggested some words since the previous report and include only these words. In this way the same word will not be reported more than once for each page.

Sandbox at: https://www.pgdp.org/~rp31/c.branch/api_wc1

SETUP/tests/unittests/ApiTest.php

70ray · 2024-11-14T09:13:02Z

SETUP/tests/unittests/ApiTest.php

+        $this->assertEquals($expected, $response);
+    }
+    /*
+    public function test_wordcheck_site_bad_word()


This works on the test server but I don't know how to make it work in the Github ci. Perhaps not worth the effort.

70ray · 2024-11-14T09:16:25Z

pinc/ProofProject.inc

+            $this->append_good($wordIndex);
+            if (in_array($word, $badWords)) {
+                $this->append_word_to_array($word, $badWordHash[$word]);
+            } elseif (in_array($word, $accepted_words)) {


Do we really need to mark words the user has suggested?

srjfoo

Is there a good way to test this PR via the API, or is the main testing just to make sure that in the proofreading interface and/or PM's WC interface, everything works as it should?

pinc/ProofProject.inc

70ray · 2024-11-16T06:35:57Z

Is there a good way to test this PR via the API, or is the main testing just to make sure that in the proofreading interface and/or PM's WC interface, everything works as it should?

It would be possible to test it using curl. It could be possible to test it from inside the Swagger editor although the server may not be set up so that this will work from my sandbox. I tested it by using it with a basic proofreading interface connected to it.

70ray · 2024-11-16T07:10:57Z

Sandbox updated.

cpeel

There are a couple of things that give me pause about this implementation:

We're duplicating a lot of logic -- some of it literal copy/paste -- from spellcheck_text.inc without replacing the old one. Now we have two places to fix bugs and keep in sync. I'd rather we have a new class (like you're building) that does the logic the API needs and update the current functionality to include it.
The current WordChecker class merges WordCheck (which is a word-centric feature) with highlighting (punctuation and scripts) so we're cramming the data and the UI layers together somehow. The current code does this too when creating the HTML, but the API is a data layer. I need to think about this some more because I see why you're doing this but it feels wrong.

Rather than address the above yet, I suggest let's start by having you address my comment about documenting the returned structure and how you see the UI / API caller using it to make sure we're addressing the key use-case.

pinc/wordcheck_engine.inc

pinc/ProofProject.inc

SETUP/tests/unittests/ApiTest.php

pinc/ProofProject.inc

cpeel · 2024-11-16T15:24:53Z

pinc/ProofProject.inc

+        $this->puncArray = make_grapheme_array($puncCharacters);
+    }
+
+    public function analyse($project, $page_text, $languages, $accepted_words): array


We need some documentation here for what the array returned from this function looks like. API callers need it, and frankly I the PR reviewer needs it. Please add a comment here with a very small example for reviewers and future devs and we can figure out how to best get that available to the API docs.

pinc/ProofProject.inc

70ray · 2024-11-17T05:43:50Z

There are a couple of things that give me pause about this implementation:

1. We're duplicating a lot of logic -- some of it literal copy/paste -- from `spellcheck_text.inc` without replacing the old one. Now we have two places to fix bugs and keep in sync. I'd rather we have a new class (like you're building) that does the logic the API needs and update the current functionality to include it.

2. The current WordChecker class merges WordCheck (which is a word-centric feature) with highlighting (punctuation and scripts) so we're cramming the data and the UI layers together somehow. The current code does this too when creating the HTML, but the API is a data layer. I need to think about this some more because I see why you're doing this but it feels wrong.

Rather than address the above yet, I suggest let's start by having you address my comment about documenting the returned structure and how you see the UI / API caller using it to make sure we're addressing the key use-case.

Thanks Casey.
I wrote this api envisaging it would be used in a mode similar to what we have now rather than an interactive mode but I was surprised how well it works in an interactive mode. (I was using a 2 second window after the user had finished making changes before refreshing the whole page.) But rewriting the whole page might not be the best approach and in view of your comments in the pi_wg thread it might be better to consider other ways of doing it.
Another motivation for the whole-page approach was to give the server most of the work and leave the client with just presentation of the results. The task of deciding what exactly is a word, as you know, is not so simple as might first appear. But as you suggested, sending a line would avoid this difficulty and this api would work in the same way whether the piece of text was a line or a page. Letting the client deal with punctuation would also reduce the communication traffic.
Making it interactive might also mean we need to change how we deal with invalid characters.
I'll make the changes you suggested.

so it can be used in spellcheck_text

70ray · 2024-11-18T17:11:29Z

Sandbox updated

70ray · 2024-11-20T14:45:20Z

pinc/WordChecker.inc

+     *  an array of languages to use for dictionaries
+     * @param array $accepted_words
+     *  the words the user wishes to be accepted as good
+     * TODO consider how to handle characters not enabled for the project


If we remove validate_text() from api_v1_project_wordcheck() then they will be highlit as uncommon scripts. We could adjust the way highlighting works to reserve a code/colour for them. An exception would still be thrown on saving the page.

cpeel · 2024-11-21T22:08:06Z

I've put my finger on what's bugging me about this PR -- we're munging the data and presentation layers together in the API. The API should be a data layer first and foremost.

The primary WordCheck API should return the list of words and what set they belong in (bad world, site, project, page). A PM or PP (or UI) could query this endpoint with some text and get back the list of words to do something with them.

Similarly, we could have an endpoint that returns words in uncommon scripts for a similar use by PM, PP, or UI.

eg:

$router->add_route("PUT", "v1/projects/:projectid/wordcheck", "api_v1_project_wordcheck");
$router->add_route("PUT", "v1/projects/:projectid/uncommon_scripts", "api_v1_project_uncommon_scripts");

That doesn't mean we can't or shouldn't have server-side APIs that facilitate the presentation layer like the code in this PR does, but we should think about why we're not just putting them in the UI code to begin with and keep all of the presentation bits together. Just because we're doing things server-side now doesn't mean we should be doing it server-side in the future.

I'd like to see the first WordCheck API be useful for the general case and decide if we want this data+presentation logic also as an API like you have here.

70ray · 2024-11-22T14:24:06Z

I've put my finger on what's bugging me about this PR -- we're munging the data and presentation layers together in the API. The API should be a data layer first and foremost. ...

I see what you are saying.
For the purpose of the interactive PI the client could use the endpoints you suggest if it also had all the words in the text with their offsets. How about giving the client the word pattern and letting it find the words in the text? Then send the set of unique words to the server for those endpoints instead of the whole text.

cpeel · 2024-11-23T02:11:11Z

For the purpose of the interactive PI the client could use the endpoints you suggest if it also had all the words in the text with their offsets. How about giving the client the word pattern and letting it find the words in the text? Then send the set of unique words to the server for those endpoints instead of the whole text.

I don't know that we need to pass around the word pattern. If we send the text blob to the API (like you have now) and the API returns a list of words, the caller should be able to treat those as opaque strings, search for them in the text it sent, and do something with them (highlight them, etc). I think in that case the caller doesn't need to know anything about "words" at all (not that it couldn't, just that I don't see that as a prereq).

70ray · 2024-11-23T16:12:37Z

For the purpose of the interactive PI the client could use the endpoints you suggest if it also had all the words in the text with their offsets. How about giving the client the word pattern and letting it find the words in the text? Then send the set of unique words to the server for those endpoints instead of the whole text.

I don't know that we need to pass around the word pattern. If we send the text blob to the API (like you have now) and the API returns a list of words, the caller should be able to treat those as opaque strings, search for them in the text it sent, and do something with them (highlight them, etc). I think in that case the caller doesn't need to know anything about "words" at all (not that it couldn't, just that I don't see that as a prereq).

I don't think you can do it quite like that because a bad word can be a substring of a good word.

cpeel · 2024-11-24T02:10:13Z

I don't think you can do it quite like that because a bad word can be a substring of a good word.

Yup, good point. So the server and the UI need to agree on what a "word" is. It's probably sufficient that we define it / document it and not try to have the client request it from the server.

70ray · 2024-11-26T19:53:56Z

I don't think you can do it quite like that because a bad word can be a substring of a good word.

Yup, good point. So the server and the UI need to agree on what a "word" is. It's probably sufficient that we define it / document it and not try to have the client request it from the server.

I've tried using a similar regular expression in the client to find words. One thing I came across is that in unicode mode \w works differntly in php and javascript. In php it matches the ṭ character (in projectID5690905894029) but not in js. Changing js to use \p{L} matches it. Perhaps we should change the php expression in the same way so they are more likely to agree in all cases.

70ray · 2024-11-27T16:06:26Z

Changing to returning an array of bad words.

adding wordcheck api

1f63b57

70ray marked this pull request as draft November 13, 2024 12:34

70ray added 6 commits November 13, 2024 14:41

changes to api yaml file

1cc47b4

comment out failing test pending investigation

6ac4a12

declare class variables

d95eb80

adjustments

8126125

comment out another test

1675efe

add break;

0364d28

chrismiceli approved these changes Nov 13, 2024

View reviewed changes

70ray commented Nov 14, 2024

View reviewed changes

70ray marked this pull request as ready for review November 14, 2024 09:33

70ray requested review from jchaffraix, chrismiceli, cpeel and srjfoo November 14, 2024 09:35

chrismiceli approved these changes Nov 14, 2024

View reviewed changes

srjfoo reviewed Nov 16, 2024

View reviewed changes

pinc/ProofProject.inc Outdated Show resolved Hide resolved

pinc/ProofProject.inc Outdated Show resolved Hide resolved

pinc/ProofProject.inc Outdated Show resolved Hide resolved

spell out some words more clearly

f9c58cc

cpeel reviewed Nov 16, 2024

View reviewed changes

70ray added 8 commits November 17, 2024 07:14

let split_multiscript_string() return script also

fb91b36

use split_multiscript_string() in api wordcheck

e3089f9

move wordchecker to a new file

77f6583

move validate_text out of WordChecker, adust parameters

daeee0d

so it can be used in spellcheck_text

use UTF8 function to make punctation array, add messages

ebd1527

type WordChecker, adjust tests

b7828b1

change page_text to text, since not necessarily a whole page

aa5cdf4

throw exception if preg_match_all fails

17c0ce6

70ray added 4 commits November 18, 2024 16:09

document WordChecker

33eb050

add french dictionary for tests, enable aux lang test

af29707

correct return type for split_multiscript_string()

1404ad5

try again

8318a7f

70ray commented Nov 20, 2024

View reviewed changes

70ray closed this Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed wordcheck API #1371

Proposed wordcheck API #1371

70ray commented Nov 13, 2024 •

edited

Loading

70ray Nov 14, 2024

70ray Nov 14, 2024

srjfoo left a comment

70ray commented Nov 16, 2024

70ray commented Nov 16, 2024

cpeel left a comment

cpeel Nov 16, 2024

70ray commented Nov 17, 2024

70ray commented Nov 18, 2024

70ray Nov 20, 2024 •

edited

Loading

cpeel commented Nov 21, 2024

70ray commented Nov 22, 2024 •

edited

Loading

cpeel commented Nov 23, 2024

70ray commented Nov 23, 2024

cpeel commented Nov 24, 2024

70ray commented Nov 26, 2024 •

edited

Loading

70ray commented Nov 27, 2024 •

edited

Loading

Proposed wordcheck API #1371

Proposed wordcheck API #1371

Conversation

70ray commented Nov 13, 2024 • edited Loading

70ray Nov 14, 2024

Choose a reason for hiding this comment

70ray Nov 14, 2024

Choose a reason for hiding this comment

srjfoo left a comment

Choose a reason for hiding this comment

70ray commented Nov 16, 2024

70ray commented Nov 16, 2024

cpeel left a comment

Choose a reason for hiding this comment

cpeel Nov 16, 2024

Choose a reason for hiding this comment

70ray commented Nov 17, 2024

70ray commented Nov 18, 2024

70ray Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

cpeel commented Nov 21, 2024

70ray commented Nov 22, 2024 • edited Loading

cpeel commented Nov 23, 2024

70ray commented Nov 23, 2024

cpeel commented Nov 24, 2024

70ray commented Nov 26, 2024 • edited Loading

70ray commented Nov 27, 2024 • edited Loading

70ray commented Nov 13, 2024 •

edited

Loading

70ray Nov 20, 2024 •

edited

Loading

70ray commented Nov 22, 2024 •

edited

Loading

70ray commented Nov 26, 2024 •

edited

Loading

70ray commented Nov 27, 2024 •

edited

Loading