Use consistent definition of word in WF #634

windymilla · 2025-01-01T14:58:40Z

Because regexes include underscore in their idea of word characters, an occurrence preceded by underscore of a word found by WF would not be found by "whole word" search. So, the word would be listed in WF, but when the user clicked it, the word would not be found.

Fixed by changing search's "whole word" to wrap the search string with a more complex regex, rather than \b...\b

Also changed WF's idea of a word to include embedded underscores (in the same way as embedded periods are included), e.g. "i_001.jpg", but to strip leading or trailing underscores, which signify italics, e.g. "dog".

Fixes #628

Because regexes include underscore in their idea of word characters, an occurrence preceded by underscore of a word found by WF would not be found by "whole word" search. So, the word would be listed in WF, but when the user clicked it, the word would not be found. Fixed by changing search's "whole word" to wrap the search string with a more complex regex, rather than `\b...\b` Also changed WF's idea of a word to include embedded underscores (in the same way as embedded periods are included), e.g. "i_001.jpg", but to strip leading or trailing underscores, which signify italics, e.g. "_dog_". Fixes DistributedProofreaders#628

windymilla · 2025-01-01T15:09:30Z

Note that one (I think inconsequential) consequence of this PR is that doing a Whole Word search for dog will find a match in _dog or _dog_ or dog, but attempting the "same" thing manually by doing a Regex search for \bdog\b will not find the occurrences with underscores, and will only find dog. This is because \b considers underscore to be a word character, so there is no word break between _ and dog. This is not fixable, except by inspecting the regex in the code, detecting use of \b (but avoiding \\b, but not avoiding \\\b, etc!) and replacing it with something more complicated. I don't think this would be a good use of time.

Also note that Guiguts 1 has had the same unnoticed inconsistency between Whole Word search and \b...\b regex search for many years. Also that, apart from italic markup, our books do not generally contain underscores.

okrick · 2025-01-01T16:52:11Z

I'm pleased you found a solution for GG2. However, I disagree with the assessment that GG1 was also inconsistent. 000.jpg is present in the GG1 WF Alpha/num list, and double-clicking it successfully locates the subsequent occurrence of 000.jpg.

windymilla · 2025-01-01T17:12:40Z

I'm pleased you found a solution for GG2. However, I disagree with the assessment that GG1 was also inconsistent. 000.jpg is present in the GG1 WF Alpha/num list, and double-clicking it successfully locates the subsequent occurrence of 000.jpg.

Yes, GG1 was consistent between the WF word list and clicking on it, and that is what is being fixed in GG2.

However, I was pointing out that the inconsistency between whole word search and \b search was the same as GG1, and is not being fixed in GG2 (at time of writing). To see this inconsistency in GG1, bring up the Search dialog and type 001.jpg into the search field. Click Whole Word, and search - it will find those occurrences. Now, click Regex instead of Whole Word and change the search term to \b001jpg\b, and you will find the search fails.

srjfoo

Note that one (I think inconsequential) consequence of this PR is that doing a Whole Word search for dog will find a match in _dog or _dog_ or dog, but attempting the "same" thing manually by doing a Regex search for \bdog\b will not find the occurrences with underscores, and will only find dog. This is because \b considers underscore to be a word character, so there is no word break between _ and dog. This is not fixable, except by inspecting the regex in the code, detecting use of \b (but avoiding \\b, but not avoiding \\\b, etc!) and replacing it with something more complicated. I don't think this would be a good use of time.

Should this potentially be documented in known issues? Or just in the WF section of the manual?

windymilla · 2025-01-02T11:04:35Z

Should this potentially be documented in known issues? Or just in the WF section of the manual?

Thanks. Added to Known Issues. It's not really a WF thing since WF behavior is now self-consistent - it's a Search dialog thing.

okrick · 2025-01-02T16:23:57Z

Upon further review, this behavior might simply be a documentation issue. It could be confusing, but likely not widespread. Unless, of course, numerous PPers frequently use WF to edit UTF-8 files where underscores are intended to represent italics.

windymilla · 2025-01-02T16:28:57Z

Upon further review, this behavior might simply be a documentation issue. It could be confusing, but likely not widespread. Unless, of course, numerous PPers frequently use WF to edit UTF-8 files where underscores are intended to represent italics.

I've added a bit to the manual to describe the unavoidable regex inconsistency (\b compared with Whole Word) which shouldn't generally be a problem for people. I think what we now have as a consequence of your report is better than we had previously, particularly for the case you describe of underscores for italics, which non-ppgen PPers would quite likely be doing.

windymilla requested a review from srjfoo January 1, 2025 14:58

windymilla mentioned this pull request Jan 1, 2025

Word Frequency fails to position cursor #628

Closed

srjfoo approved these changes Jan 2, 2025

View reviewed changes

windymilla merged commit 4ec915c into DistributedProofreaders:master Jan 2, 2025
1 check passed

windymilla deleted the word-boundaries branch January 2, 2025 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use consistent definition of word in WF #634

Use consistent definition of word in WF #634

windymilla commented Jan 1, 2025

windymilla commented Jan 1, 2025

okrick commented Jan 1, 2025

windymilla commented Jan 1, 2025

srjfoo left a comment

windymilla commented Jan 2, 2025

okrick commented Jan 2, 2025

windymilla commented Jan 2, 2025

Use consistent definition of word in WF #634

Use consistent definition of word in WF #634

Conversation

windymilla commented Jan 1, 2025

windymilla commented Jan 1, 2025

okrick commented Jan 1, 2025

windymilla commented Jan 1, 2025

srjfoo left a comment

Choose a reason for hiding this comment

windymilla commented Jan 2, 2025

okrick commented Jan 2, 2025

windymilla commented Jan 2, 2025