Skip to content

Latest commit

 

History

History
215 lines (154 loc) · 5.66 KB

section_fuzzy_search.md

File metadata and controls

215 lines (154 loc) · 5.66 KB

Fuzzy search

Find documents with words similar to bock

e.g. book, rock, spock.

bock~

Notes:


Levenshtein

  • Edit distance between two words
  • Count inserts, deletes, replaces, transpositions / swaps

Notes:


Levenshtein example

Operation Result
0. Start iformmetoin~
­ 1. Add n ­ informmetoin
­ 2. Delete m ­ informetoin
­ 3. Replace e with a ­ informatoin
­ 4. Swap o and i ­ information
­ The End

­ Levenshtein distance = 4

Notes:

  • Audience question
  • How can this be used to find similar terms?

bock~

Term Doc IDs Levenshtein distance (not in index)
book #1 1
information #2 >2
rock #3 1
retrieval #4 >2
spock #5 2

(consider only terms with Levenshtein distance <2)

book OR rock OR spock

#1, #3, #5

Notes:

  • What is the complexity?

Levenshtein complexity

  • Expensive: Cannot be precomputed*
  • Compare every query term with every vocabular term
  • num(query terms) × num(vocabulary terms)

*Except with some highly complex finite state machines

Notes:


Levenshtein improvements

  • Weighted (keyboard distance)
  • Maximum allowed Levenshtein distance based on query term length
    • E.g. 0-2 must match exactly; 3-5 one edit allowed; >5 two edits allowed
    • Otherwise e~ would match everything

Notes:


N-Grams

Notes:

  • How can this be used for fuzzy search?

N-Gram index

#1: book, #2: rock, #3: spock

Term Doc IDs
^bo #1
boo #1
ook #1
ok^ #1
^ro #2
roc #2
^sp #3
spo #3
poc #3
ock #2 ,#3
ck^ #2, #3

Notes:


bock

^bo OR boc OR ock OR ck^

Term Doc IDs
^bo #1
boo #1
poc #3
ock #2 ,#3
ck^ #2, #3

Which document is the best match?

Notes:


Term Doc IDs
^bo #1
boo #1
poc #3
ock #2 ,#3
ck^ #2, #3

  1. [^bo, boo, ook, ok^] → 25%
  2. [^ro, roc, <span>ock</span><!-- .element: class="highlight-blue" -->, <span>ck^</span><!-- .element: class="highlight-blue" -->] → 50%
  3. [^sp, spo, poc, <span>ock</span><!-- .element: class="highlight-blue" -->, <span>ck^</span><!-- .element: class="highlight-blue" -->] → 40%

Notes: