Skip to content

Latest commit

 

History

History
166 lines (103 loc) · 4.02 KB

section_ranked_retrieval.md

File metadata and controls

166 lines (103 loc) · 4.02 KB

Ranked retrieval

Notes:


Idea I

More matches for query term = more relevant document

book book

book paper

Wikipedia article on book vs. paper

Notes:


Idea II

Infrequent terms in corpus are more relevant

apple.com search apple

apple.com search macbook

Searching apple.com for apple macbook: macbook more relevant than apple

Notes:


I: Term frequency

At index time:

  • ­ Count term occurrences per doc
  • ­ Ignore order of terms
  • ­ Bag of words

­ Tag cloud

Notes:

  • Where to save TF info?

TF

  • #1: a book providing information about information retrieval
  • #2: a book about the search for books
  • #3: a book about information

Term Doc IDs
Book #1:1, #2:2, #3:1
Information #1:2, #3:1
Retrieval #1:1
Search #2:1

Notes:

  • Audience participation

II: Inverse document frequency

  • ­ Searching apple.com for apple OR macbook
    • fewer documents with macbook than apple
    • macbook more important
  • ­ Rank uncommon terms higher
  • ­ Only relevant for OR search
  • ­ Store inverse document frequency per term

Notes:

  • Why is it only relevant for OR search?
  • Why is it stored per term?
  • What is the min and max IDF? Why?

Inverse Document Frequency

$$\begin{aligned} \text{idf}(\text{term}) & = \frac{\text{num_docs}}{\text{document_frequency}(\text{term})}\\ \\ \text{idf}(\text{apple}) & = \frac{10}{9} = 1.1 \\ \\ \text{idf}(\text{macbook}) & = \frac{10}{2} = 5 \end{aligned}$$

IDF

  • #1: a book providing information about information retrieval
  • #2: a book about the search for books
  • #3: a book about information

Term IDF Doc IDs
Book 1 #1:1, #2:2, #3:1
Information 1.5 #1:2, #3:1
Retrieval 3 #1:1
Search 3 #2:1

Notes:

  • idf(t) = 1 is a special case
  • Audience participation

TF-IDF Ranking

$$\text{score}(\text{query}, \text{document}) = \sum_{\text{term} \in \text{query}} \left( \text{tf}(\text{term}, \text{document}) \times \text{idf}(\text{term}) \right)$$

Notes:

  • Explain formula in human-speak.

Term IDF Doc IDs
Book 1 #1:1, #2:2, #3:1
Information 1.5 #1:2, #3:1
Retrieval 3 #1:1
Search 3 #2:1
information retrieval search

#1

2 × 1.5 + 1 × 3 + 0 × 3 = 6

#2

0 × 1.5 + 0 × 3 + 1 × 3 = 3

#3

1 × 1.5 + 0 × 3 + 0 × 3 = 1.5

Notes: