Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use consistent definition of word in WF #634

Merged

Conversation

windymilla
Copy link
Collaborator

Because regexes include underscore in their idea of word characters, an occurrence preceded by underscore of a word found by WF would not be found by "whole word" search. So, the word would be listed in WF, but when the user clicked it, the word would not be found.

Fixed by changing search's "whole word" to wrap the search string with a more complex regex, rather than \b...\b

Also changed WF's idea of a word to include embedded underscores (in the same way as embedded periods are included), e.g. "i_001.jpg", but to strip leading or trailing underscores, which signify italics, e.g. "dog".

Fixes #628

Because regexes include underscore in their idea of word
characters, an occurrence preceded by underscore of a
word found by WF would not be found by "whole word"
search. So, the word would be listed in WF, but when the
user clicked it, the word would not be found.

Fixed by changing search's "whole word" to wrap the
search string with a more complex regex, rather than
`\b...\b`

Also changed WF's idea of a word to include embedded
underscores (in the same way as embedded periods are
included), e.g. "i_001.jpg", but to strip leading or trailing
underscores, which signify italics, e.g. "_dog_".

Fixes DistributedProofreaders#628
@windymilla
Copy link
Collaborator Author

Note that one (I think inconsequential) consequence of this PR is that doing a Whole Word search for dog will find a match in _dog or _dog_ or dog, but attempting the "same" thing manually by doing a Regex search for \bdog\b will not find the occurrences with underscores, and will only find dog. This is because \b considers underscore to be a word character, so there is no word break between _ and dog. This is not fixable, except by inspecting the regex in the code, detecting use of \b (but avoiding \\b, but not avoiding \\\b, etc!) and replacing it with something more complicated. I don't think this would be a good use of time.

Also note that Guiguts 1 has had the same unnoticed inconsistency between Whole Word search and \b...\b regex search for many years. Also that, apart from italic markup, our books do not generally contain underscores.

@okrick
Copy link

okrick commented Jan 1, 2025

I'm pleased you found a solution for GG2. However, I disagree with the assessment that GG1 was also inconsistent. 000.jpg is present in the GG1 WF Alpha/num list, and double-clicking it successfully locates the subsequent occurrence of 000.jpg.

@windymilla
Copy link
Collaborator Author

I'm pleased you found a solution for GG2. However, I disagree with the assessment that GG1 was also inconsistent. 000.jpg is present in the GG1 WF Alpha/num list, and double-clicking it successfully locates the subsequent occurrence of 000.jpg.

Yes, GG1 was consistent between the WF word list and clicking on it, and that is what is being fixed in GG2.

However, I was pointing out that the inconsistency between whole word search and \b search was the same as GG1, and is not being fixed in GG2 (at time of writing). To see this inconsistency in GG1, bring up the Search dialog and type 001.jpg into the search field. Click Whole Word, and search - it will find those occurrences. Now, click Regex instead of Whole Word and change the search term to \b001jpg\b, and you will find the search fails.

Copy link
Member

@srjfoo srjfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that one (I think inconsequential) consequence of this PR is that doing a Whole Word search for dog will find a match in _dog or _dog_ or dog, but attempting the "same" thing manually by doing a Regex search for \bdog\b will not find the occurrences with underscores, and will only find dog. This is because \b considers underscore to be a word character, so there is no word break between _ and dog. This is not fixable, except by inspecting the regex in the code, detecting use of \b (but avoiding \\b, but not avoiding \\\b, etc!) and replacing it with something more complicated. I don't think this would be a good use of time.

Should this potentially be documented in known issues? Or just in the WF section of the manual?

@windymilla
Copy link
Collaborator Author

Should this potentially be documented in known issues? Or just in the WF section of the manual?

Thanks. Added to Known Issues. It's not really a WF thing since WF behavior is now self-consistent - it's a Search dialog thing.

@windymilla windymilla merged commit 4ec915c into DistributedProofreaders:master Jan 2, 2025
1 check passed
@windymilla windymilla deleted the word-boundaries branch January 2, 2025 11:05
@okrick
Copy link

okrick commented Jan 2, 2025

Upon further review, this behavior might simply be a documentation issue. It could be confusing, but likely not widespread. Unless, of course, numerous PPers frequently use WF to edit UTF-8 files where underscores are intended to represent italics.

@windymilla
Copy link
Collaborator Author

Upon further review, this behavior might simply be a documentation issue. It could be confusing, but likely not widespread. Unless, of course, numerous PPers frequently use WF to edit UTF-8 files where underscores are intended to represent italics.

I've added a bit to the manual to describe the unavoidable regex inconsistency (\b compared with Whole Word) which shouldn't generally be a problem for people. I think what we now have as a consequence of your report is better than we had previously, particularly for the case you describe of underscores for italics, which non-ppgen PPers would quite likely be doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Word Frequency fails to position cursor
3 participants