Use consistent definition of word in WF (#634)

Because regexes include underscore in their idea of word characters, an occurrence preceded by underscore of a word found by WF would not be found by "whole word" search. So, the word would be listed in WF, but when the user clicked it, the word would not be found. Fixed by changing search's "whole word" to wrap the search string with a more complex regex, rather than `\b...\b` Also changed WF's idea of a word to include embedded underscores (in the same way as embedded periods are included), e.g. "i_001.jpg", but to strip leading or trailing underscores, which signify italics, e.g. "_dog_". Fixes #628
DistributedProofreaders · Jan 2, 2025 · 4ec915c · 4ec915c
1 parent 0f1a65f
commit 4ec915c
Show file tree

Hide file tree

Showing 2 changed files with 8 additions and 3 deletions.
diff --git a/src/guiguts/maintext.py b/src/guiguts/maintext.py
@@ -2149,8 +2149,13 @@ def find_match_in_range(
                     slurp_text = slurp_text + "\n"
         else:
             search_string = re.escape(search_string)
+        # Don't use \b for word boundary because it includes underscore
         if wholeword:
-            search_string = r"\b" + search_string + r"\b"
+            search_string = (
+                r"(?=[[:alnum:]])(?<![[:alnum:]])"
+                + search_string
+                + r"(?<=[[:alnum:]])(?![[:alnum:]])"
+            )
         # Preferable to use flags rather than prepending "(?i)", for example,
         # because if we need to report bad regex to user, it's better if it's
         # the regex they typed.

diff --git a/src/guiguts/word_frequency.py b/src/guiguts/word_frequency.py
@@ -101,11 +101,11 @@ def ensure_file_analyzed(self) -> None:
                 line = line.lower()
             line = re.sub(r"<\/?[a-z]*>", " ", line)  # throw away DP tags
             # get rid of nonalphanumeric (retaining combining characters)
-            line = re.sub(r"[^'’\.,\p{Alnum}\p{Mark}*-]", " ", line)
+            line = re.sub(r"[^'’\.,\p{Alnum}\p{Mark}*_-]", " ", line)
 
             def strip_punc(word: str) -> str:
                 """Strip relevant leading/trailing punctuation from word."""
-                return re.sub(r"^[\.,'’-]+|[\.,'’-]+$", "", word)
+                return re.sub(r"^[\.,'’_-]+|[\.,'’_-]+$", "", word)
 
             # Build a list of emdash words, i.e. "word1--word2"
             words = re.split(r"\s+", line)