Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Pattern Improvement #32

Open
Vysci opened this issue Jun 11, 2021 · 1 comment
Open

Search Pattern Improvement #32

Vysci opened this issue Jun 11, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@Vysci
Copy link
Owner

Vysci commented Jun 11, 2021

Currently search patterns is per word basis, enhance it to support phrases (including spaces)

@Vysci Vysci added the enhancement New feature or request label Jun 11, 2021
@Vysci Vysci changed the title Support phrases in search patterns Search Pattern Improvement Jun 11, 2021
@Vysci Vysci mentioned this issue Jun 11, 2021
@juemrami juemrami pinned this issue Jun 26, 2024
@juemrami
Copy link
Collaborator

juemrami commented Jun 28, 2024

I've been thinking about how to do this and have started really looking at the code and how the message parsing works. Ive got a rough solution coded out but still needs alot of work.

Here's a couple of notes on the issue:

  • The locale specific "tagStrings"(what i call them) are already space separated, so we'd need another special/magic char to indicate "space"
    • my preference is to use + since it makes the most sense.
    • i.e; the message should match "this" plus "that", or "other => tagString = "this+that other"
  • We could use a word version of a Suffix Trie, for "efficient" pattern matching where we want to find sequences of words.
    • This might only be viable for roman and cyrillic alphabets (maybe korean but thats not supported by addon atm).
    • I think an actual suffix trie might work for chinese since each "character" is usually a word.
  • The Current algorithm for categorizing looks something like:
    • pre-process tags and create a [tag] => CategoryKey map for everything in Tags.lua
    • New chat messages passed to GBB.GetMessageWordList (formerly named SplitToNoNb)
      • creates differently processed versions (9) of the message. ie numbers removed, punctuation removed, spaces over punctuations, etc
      • splits every processed copy @ word level and collects all unique words as a list
    • list of words from the message are iterated and checked against the [tag] => CategoryKey map to see what categories matches the message.
    • limitations:
      • the map only supports single word keys, in order to support multiple words there would need to be another copied string that splits the message at every other "space". ie "this is a test" => "this is", "is a", "a test"
  • The trie based algorithm works similarly.
    • process Tags.lua and create Trie data structure for all phrases. A phrase would be 1 or more words connected by + (or any punctuation)
    • in the case where there a no multi word sequences the trie pretty much behaves like the previous map. [tag] => CategoryKey
    • however when a build the trie if a tag with sequence is seen (say "molten+core mc rag"):
      • instead of mapping from ["molten"] => CategoryKey, theres a mapping from ["molten"] => ["core"] => CategoryKey
      • image
    • Incoming message is processed in similarly to before, instead of making an entire copy of the string, we only copy the bits a string and append to the original.
    • eg: before "lf1m zul'aman" -> ["lf1m zul'aman", "lf m zul'aman", "lfm zul'aman", "lf1m zul aman", "lf1m zulaman] (separate strings to parse through)
    • after '"lf1m zul'aman" -> "lf1m lf m lfm zul'aman zul aman zulaman" 1 string
    • limitations:
    • not as accurate since its processing alot less string variations (can be improved with better regex pattern [WIP])
  • There needs to be some guidelines for how tags should be written:
    • ex1: incoming message have all their "apostrophes" normalized so "zul'aman zul`aman zul´aman are redundant tags.
    • ex2: We should make note that any punctuation characters on incoming messages (barring chinese punctuation) are replaced with a single space.
      • Therefore:"zul'aman, zul`aman" => zul aman
      • so tags should just use the sequence of zul+aman as that would match any form of zul{%p}aman where %p is an appropriate punctuation character.
    • For the time being we can do some backward compatibility when pre-processing tags to build the Trie by treating [´] as the + character which we will be using for joining sequences/chains of words as single patterns for categorization.
    • ex3: Incoming message are also "transliterated" which removes any accents from latin characters. For this reason, there makes no sense in having tags with accents in them

performance profiling
Trie (with no sequences) vs Regular Map. Note similar performance scaling with slight overhead. The scaling remains linear with sequences added since the incomming message is parsed incrementally.
image

GBB.GetMessageWordList optimization (slightly less accurate but better performant)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants