Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed wordcheck API #1371

Closed
wants to merge 20 commits into from
Closed

Conversation

70ray
Copy link
Collaborator

@70ray 70ray commented Nov 13, 2024

There are two endpoints:

  • PUT v1/projects/:projectid/wordcheck This does not require the user to own the page. Needs accepted words and languages. Returns two arrays:
    • wc_array: this consists of chunks of text and a flag indicating the type of chunk
    • scripts: the uncommon scripts encountered.
  • PUT v1/projects/:projectid/pages/:pagename with pageaction = report. This requires the user to own the page and executes save_wordcheck_event(). The logic of when to call this is left to the client. It can be as follows: When the user does any kind of save, if he has previously entered wordcheck mode and a wordcheck report has not previously been submitted then submit a report before the save even if the user has not suggested any words to be accepted. On any subsequent saves submit a report only if the user has suggested some words since the previous report and include only these words. In this way the same word will not be reported more than once for each page.

Sandbox at: https://www.pgdp.org/~rp31/c.branch/api_wc1

@70ray 70ray marked this pull request as draft November 13, 2024 12:34
SETUP/tests/unittests/ApiTest.php Outdated Show resolved Hide resolved
$this->assertEquals($expected, $response);
}
/*
public function test_wordcheck_site_bad_word()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works on the test server but I don't know how to make it work in the Github ci. Perhaps not worth the effort.

$this->append_good($wordIndex);
if (in_array($word, $badWords)) {
$this->append_word_to_array($word, $badWordHash[$word]);
} elseif (in_array($word, $accepted_words)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to mark words the user has suggested?

@70ray 70ray marked this pull request as ready for review November 14, 2024 09:33
Copy link
Member

@srjfoo srjfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good way to test this PR via the API, or is the main testing just to make sure that in the proofreading interface and/or PM's WC interface, everything works as it should?

pinc/ProofProject.inc Outdated Show resolved Hide resolved
pinc/ProofProject.inc Outdated Show resolved Hide resolved
pinc/ProofProject.inc Outdated Show resolved Hide resolved
@70ray
Copy link
Collaborator Author

70ray commented Nov 16, 2024

Is there a good way to test this PR via the API, or is the main testing just to make sure that in the proofreading interface and/or PM's WC interface, everything works as it should?

It would be possible to test it using curl. It could be possible to test it from inside the Swagger editor although the server may not be set up so that this will work from my sandbox. I tested it by using it with a basic proofreading interface connected to it.

@70ray
Copy link
Collaborator Author

70ray commented Nov 16, 2024

Sandbox updated.

Copy link
Member

@cpeel cpeel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of things that give me pause about this implementation:

  1. We're duplicating a lot of logic -- some of it literal copy/paste -- from spellcheck_text.inc without replacing the old one. Now we have two places to fix bugs and keep in sync. I'd rather we have a new class (like you're building) that does the logic the API needs and update the current functionality to include it.
  2. The current WordChecker class merges WordCheck (which is a word-centric feature) with highlighting (punctuation and scripts) so we're cramming the data and the UI layers together somehow. The current code does this too when creating the HTML, but the API is a data layer. I need to think about this some more because I see why you're doing this but it feels wrong.

Rather than address the above yet, I suggest let's start by having you address my comment about documenting the returned structure and how you see the UI / API caller using it to make sure we're addressing the key use-case.

pinc/wordcheck_engine.inc Outdated Show resolved Hide resolved
pinc/ProofProject.inc Outdated Show resolved Hide resolved
SETUP/tests/unittests/ApiTest.php Outdated Show resolved Hide resolved
pinc/ProofProject.inc Outdated Show resolved Hide resolved
pinc/ProofProject.inc Outdated Show resolved Hide resolved
$this->puncArray = make_grapheme_array($puncCharacters);
}

public function analyse($project, $page_text, $languages, $accepted_words): array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some documentation here for what the array returned from this function looks like. API callers need it, and frankly I the PR reviewer needs it. Please add a comment here with a very small example for reviewers and future devs and we can figure out how to best get that available to the API docs.

pinc/ProofProject.inc Outdated Show resolved Hide resolved
@70ray
Copy link
Collaborator Author

70ray commented Nov 17, 2024

There are a couple of things that give me pause about this implementation:

1. We're duplicating a lot of logic -- some of it literal copy/paste -- from `spellcheck_text.inc` without replacing the old one. Now we have two places to fix bugs and keep in sync. I'd rather we have a new class (like you're building) that does the logic the API needs and update the current functionality to include it.

2. The current WordChecker class merges WordCheck (which is a word-centric feature) with highlighting (punctuation and scripts) so we're cramming the data and the UI layers together somehow. The current code does this too when creating the HTML, but the API is a data layer. I need to think about this some more because I see why you're doing this but it feels wrong.

Rather than address the above yet, I suggest let's start by having you address my comment about documenting the returned structure and how you see the UI / API caller using it to make sure we're addressing the key use-case.

Thanks Casey.
I wrote this api envisaging it would be used in a mode similar to what we have now rather than an interactive mode but I was surprised how well it works in an interactive mode. (I was using a 2 second window after the user had finished making changes before refreshing the whole page.) But rewriting the whole page might not be the best approach and in view of your comments in the pi_wg thread it might be better to consider other ways of doing it.
Another motivation for the whole-page approach was to give the server most of the work and leave the client with just presentation of the results. The task of deciding what exactly is a word, as you know, is not so simple as might first appear. But as you suggested, sending a line would avoid this difficulty and this api would work in the same way whether the piece of text was a line or a page. Letting the client deal with punctuation would also reduce the communication traffic.
Making it interactive might also mean we need to change how we deal with invalid characters.
I'll make the changes you suggested.

@70ray
Copy link
Collaborator Author

70ray commented Nov 18, 2024

Sandbox updated

* an array of languages to use for dictionaries
* @param array $accepted_words
* the words the user wishes to be accepted as good
* TODO consider how to handle characters not enabled for the project
Copy link
Collaborator Author

@70ray 70ray Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove validate_text() from api_v1_project_wordcheck() then they will be highlit as uncommon scripts. We could adjust the way highlighting works to reserve a code/colour for them. An exception would still be thrown on saving the page.

@cpeel
Copy link
Member

cpeel commented Nov 21, 2024

I've put my finger on what's bugging me about this PR -- we're munging the data and presentation layers together in the API. The API should be a data layer first and foremost.

The primary WordCheck API should return the list of words and what set they belong in (bad world, site, project, page). A PM or PP (or UI) could query this endpoint with some text and get back the list of words to do something with them.

Similarly, we could have an endpoint that returns words in uncommon scripts for a similar use by PM, PP, or UI.

eg:

$router->add_route("PUT", "v1/projects/:projectid/wordcheck", "api_v1_project_wordcheck");
$router->add_route("PUT", "v1/projects/:projectid/uncommon_scripts", "api_v1_project_uncommon_scripts");

That doesn't mean we can't or shouldn't have server-side APIs that facilitate the presentation layer like the code in this PR does, but we should think about why we're not just putting them in the UI code to begin with and keep all of the presentation bits together. Just because we're doing things server-side now doesn't mean we should be doing it server-side in the future.

I'd like to see the first WordCheck API be useful for the general case and decide if we want this data+presentation logic also as an API like you have here.

@70ray
Copy link
Collaborator Author

70ray commented Nov 22, 2024

I've put my finger on what's bugging me about this PR -- we're munging the data and presentation layers together in the API. The API should be a data layer first and foremost. ...

I see what you are saying.
For the purpose of the interactive PI the client could use the endpoints you suggest if it also had all the words in the text with their offsets. How about giving the client the word pattern and letting it find the words in the text? Then send the set of unique words to the server for those endpoints instead of the whole text.

@cpeel
Copy link
Member

cpeel commented Nov 23, 2024

For the purpose of the interactive PI the client could use the endpoints you suggest if it also had all the words in the text with their offsets. How about giving the client the word pattern and letting it find the words in the text? Then send the set of unique words to the server for those endpoints instead of the whole text.

I don't know that we need to pass around the word pattern. If we send the text blob to the API (like you have now) and the API returns a list of words, the caller should be able to treat those as opaque strings, search for them in the text it sent, and do something with them (highlight them, etc). I think in that case the caller doesn't need to know anything about "words" at all (not that it couldn't, just that I don't see that as a prereq).

@70ray
Copy link
Collaborator Author

70ray commented Nov 23, 2024

For the purpose of the interactive PI the client could use the endpoints you suggest if it also had all the words in the text with their offsets. How about giving the client the word pattern and letting it find the words in the text? Then send the set of unique words to the server for those endpoints instead of the whole text.

I don't know that we need to pass around the word pattern. If we send the text blob to the API (like you have now) and the API returns a list of words, the caller should be able to treat those as opaque strings, search for them in the text it sent, and do something with them (highlight them, etc). I think in that case the caller doesn't need to know anything about "words" at all (not that it couldn't, just that I don't see that as a prereq).

I don't think you can do it quite like that because a bad word can be a substring of a good word.

@cpeel
Copy link
Member

cpeel commented Nov 24, 2024

I don't think you can do it quite like that because a bad word can be a substring of a good word.

Yup, good point. So the server and the UI need to agree on what a "word" is. It's probably sufficient that we define it / document it and not try to have the client request it from the server.

@70ray
Copy link
Collaborator Author

70ray commented Nov 26, 2024

I don't think you can do it quite like that because a bad word can be a substring of a good word.

Yup, good point. So the server and the UI need to agree on what a "word" is. It's probably sufficient that we define it / document it and not try to have the client request it from the server.

I've tried using a similar regular expression in the client to find words. One thing I came across is that in unicode mode \w works differntly in php and javascript. In php it matches the ṭ character (in projectID5690905894029) but not in js. Changing js to use \p{L} matches it. Perhaps we should change the php expression in the same way so they are more likely to agree in all cases.

@70ray
Copy link
Collaborator Author

70ray commented Nov 27, 2024

Changing to returning an array of bad words.

@70ray 70ray closed this Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants