Notes:
More matches for query term = more relevant document
Wikipedia article on book
vs. paper
Notes:
Infrequent terms in corpus are more relevant
Searching apple.com for apple macbook
: macbook
more relevant than apple
Notes:
- Count term occurrences per doc
- Ignore order of terms
- Bag of words
Notes:
- Where to save TF info?
- #1: a book providing information about information retrieval
- #2: a book about the search for books
- #3: a book about information
Term | Doc IDs |
---|---|
Book | #1:1, #2:2, #3:1 |
Information | #1:2, #3:1 |
Retrieval | #1:1 |
Search | #2:1 |
Notes:
- Audience participation
- Searching apple.com for
apple OR macbook
- fewer documents with
macbook
thanapple
macbook
more important
- fewer documents with
- Rank uncommon terms higher
- Only relevant for OR search
- Store inverse document frequency per term
Notes:
- Why is it only relevant for OR search?
- Why is it stored per term?
- What is the min and max IDF? Why?
$$\begin{aligned} \text{idf}(\text{term}) & = \frac{\text{num_docs}}{\text{document_frequency}(\text{term})}\\ \\ \text{idf}(\text{apple}) & = \frac{10}{9} = 1.1 \\ \\ \text{idf}(\text{macbook}) & = \frac{10}{2} = 5 \end{aligned}$$
- #1: a book providing information about information retrieval
- #2: a book about the search for books
- #3: a book about information
Term | IDF | Doc IDs |
---|---|---|
Book | 1 | #1:1, #2:2, #3:1 |
Information | 1.5 | #1:2, #3:1 |
Retrieval | 3 | #1:1 |
Search | 3 | #2:1 |
Notes:
- idf(t) = 1 is a special case
- Audience participation
Notes:
- Explain formula in human-speak.
Term | IDF | Doc IDs |
---|---|---|
Book | 1 | #1:1, #2:2, #3:1 |
Information | 1.5 | #1:2, #3:1 |
Retrieval | 3 | #1:1 |
Search | 3 | #2:1 |
information retrieval search
2 × 1.5 + 1 × 3 + 0 × 3 = 6
0 × 1.5 + 0 × 3 + 1 × 3 = 3
1 × 1.5 + 0 × 3 + 0 × 3 = 1.5
Notes: