Find documents with words similar to bock
e.g. book, rock, spock.
↓
bock~
Notes:
- Edit distance between two words
- Count inserts, deletes, replaces, transpositions / swaps
Notes:
Operation | Result |
---|---|
0. Start | iformmetoin~ |
1. Add n | informmetoin |
2. Delete m | informetoin |
3. Replace e with a | informatoin |
4. Swap o and i | information |
The End |
Levenshtein distance = 4
Notes:
- Audience question
- How can this be used to find similar terms?
bock~
↓
Term | Doc IDs | Levenshtein distance (not in index) |
---|---|---|
book | #1 | 1 |
information | #2 | >2 |
rock | #3 | 1 |
retrieval | #4 | >2 |
spock | #5 | 2 |
↓
(consider only terms with Levenshtein distance <2)
↓
book OR rock OR spock
↓
#1, #3, #5
Notes:
- What is the complexity?
- Expensive: Cannot be precomputed*
- Compare every query term with every vocabular term
num(query terms) × num(vocabulary terms)
*Except with some highly complex finite state machines
Notes:
- Weighted (keyboard distance)
- Maximum allowed Levenshtein distance based on query term length
- E.g. 0-2 must match exactly; 3-5 one edit allowed; >5 two edits allowed
- Otherwise
e~
would match everything
Notes:
Notes:
- How can this be used for fuzzy search?
#1: book
, #2: rock
, #3: spock
Term | Doc IDs |
---|---|
^bo | #1 |
boo | #1 |
ook | #1 |
ok^ | #1 |
^ro | #2 |
roc | #2 |
^sp | #3 |
spo | #3 |
poc | #3 |
ock | #2 ,#3 |
ck^ | #2, #3 |
Notes:
bock
↓
^bo OR boc OR ock OR ck^
↓
Term | Doc IDs |
---|---|
^bo | #1 |
boo | #1 |
… | |
poc | #3 |
ock | #2 ,#3 |
ck^ | #2, #3 |
Which document is the best match?
Notes:
Term | Doc IDs |
---|---|
^bo | #1 |
boo | #1 |
… | |
poc | #3 |
ock | #2 ,#3 |
ck^ | #2, #3 |
↓
- [^bo, boo, ook, ok^] → 25%
- [^ro, roc, <span>ock</span><!-- .element: class="highlight-blue" -->, <span>ck^</span><!-- .element: class="highlight-blue" -->] → 50%
- [^sp, spo, poc, <span>ock</span><!-- .element: class="highlight-blue" -->, <span>ck^</span><!-- .element: class="highlight-blue" -->] → 40%
Notes: