-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use consistent definition of word in WF #634
Use consistent definition of word in WF #634
Conversation
Because regexes include underscore in their idea of word characters, an occurrence preceded by underscore of a word found by WF would not be found by "whole word" search. So, the word would be listed in WF, but when the user clicked it, the word would not be found. Fixed by changing search's "whole word" to wrap the search string with a more complex regex, rather than `\b...\b` Also changed WF's idea of a word to include embedded underscores (in the same way as embedded periods are included), e.g. "i_001.jpg", but to strip leading or trailing underscores, which signify italics, e.g. "_dog_". Fixes DistributedProofreaders#628
Note that one (I think inconsequential) consequence of this PR is that doing a Whole Word search for Also note that Guiguts 1 has had the same unnoticed inconsistency between Whole Word search and |
I'm pleased you found a solution for GG2. However, I disagree with the assessment that GG1 was also inconsistent. 000.jpg is present in the GG1 WF Alpha/num list, and double-clicking it successfully locates the subsequent occurrence of 000.jpg. |
Yes, GG1 was consistent between the WF word list and clicking on it, and that is what is being fixed in GG2. However, I was pointing out that the inconsistency between whole word search and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that one (I think inconsequential) consequence of this PR is that doing a Whole Word search for
dog
will find a match in_dog
or_dog_
ordog
, but attempting the "same" thing manually by doing a Regex search for\bdog\b
will not find the occurrences with underscores, and will only finddog
. This is because\b
considers underscore to be a word character, so there is no word break between_
anddog
. This is not fixable, except by inspecting the regex in the code, detecting use of\b
(but avoiding\\b
, but not avoiding\\\b
, etc!) and replacing it with something more complicated. I don't think this would be a good use of time.
Should this potentially be documented in known issues? Or just in the WF section of the manual?
Thanks. Added to Known Issues. It's not really a WF thing since WF behavior is now self-consistent - it's a Search dialog thing. |
Upon further review, this behavior might simply be a documentation issue. It could be confusing, but likely not widespread. Unless, of course, numerous PPers frequently use WF to edit UTF-8 files where underscores are intended to represent italics. |
I've added a bit to the manual to describe the unavoidable regex inconsistency ( |
Because regexes include underscore in their idea of word characters, an occurrence preceded by underscore of a word found by WF would not be found by "whole word" search. So, the word would be listed in WF, but when the user clicked it, the word would not be found.
Fixed by changing search's "whole word" to wrap the search string with a more complex regex, rather than
\b...\b
Also changed WF's idea of a word to include embedded underscores (in the same way as embedded periods are included), e.g. "i_001.jpg", but to strip leading or trailing underscores, which signify italics, e.g. "dog".
Fixes #628