Skip to content

Commit

Permalink
Use consistent definition of word in WF (#634)
Browse files Browse the repository at this point in the history
Because regexes include underscore in their idea of word
characters, an occurrence preceded by underscore of a
word found by WF would not be found by "whole word"
search. So, the word would be listed in WF, but when the
user clicked it, the word would not be found.

Fixed by changing search's "whole word" to wrap the
search string with a more complex regex, rather than
`\b...\b`

Also changed WF's idea of a word to include embedded
underscores (in the same way as embedded periods are
included), e.g. "i_001.jpg", but to strip leading or trailing
underscores, which signify italics, e.g. "_dog_".

Fixes #628
  • Loading branch information
windymilla authored Jan 2, 2025
1 parent 0f1a65f commit 4ec915c
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 3 deletions.
7 changes: 6 additions & 1 deletion src/guiguts/maintext.py
Original file line number Diff line number Diff line change
Expand Up @@ -2149,8 +2149,13 @@ def find_match_in_range(
slurp_text = slurp_text + "\n"
else:
search_string = re.escape(search_string)
# Don't use \b for word boundary because it includes underscore
if wholeword:
search_string = r"\b" + search_string + r"\b"
search_string = (
r"(?=[[:alnum:]])(?<![[:alnum:]])"
+ search_string
+ r"(?<=[[:alnum:]])(?![[:alnum:]])"
)
# Preferable to use flags rather than prepending "(?i)", for example,
# because if we need to report bad regex to user, it's better if it's
# the regex they typed.
Expand Down
4 changes: 2 additions & 2 deletions src/guiguts/word_frequency.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,11 +101,11 @@ def ensure_file_analyzed(self) -> None:
line = line.lower()
line = re.sub(r"<\/?[a-z]*>", " ", line) # throw away DP tags
# get rid of nonalphanumeric (retaining combining characters)
line = re.sub(r"[^'’\.,\p{Alnum}\p{Mark}*-]", " ", line)
line = re.sub(r"[^'’\.,\p{Alnum}\p{Mark}*_-]", " ", line)

def strip_punc(word: str) -> str:
"""Strip relevant leading/trailing punctuation from word."""
return re.sub(r"^[\.,'’-]+|[\.,'’-]+$", "", word)
return re.sub(r"^[\.,'’_-]+|[\.,'’_-]+$", "", word)

# Build a list of emdash words, i.e. "word1--word2"
words = re.split(r"\s+", line)
Expand Down

0 comments on commit 4ec915c

Please sign in to comment.