Ranked retrieval

Notes:

Idea I

More matches for query term = more relevant document

Wikipedia article on book vs. paper

Notes:

Idea II

Infrequent terms in corpus are more relevant

Searching apple.com for apple macbook: macbook more relevant than apple

Notes:

I: Term frequency

At index time:

Count term occurrences per doc
Ignore order of terms
Bag of words

Notes:

Where to save TF info?

TF

#1: a book providing information about information retrieval
#2: a book about the search for books
#3: a book about information

Term	Doc IDs
Book	#1:1, #2:2, #3:1
Information	#1:2, #3:1
Retrieval	#1:1
Search	#2:1

Notes:

Audience participation

II: Inverse document frequency

Searching apple.com for apple OR macbook
- fewer documents with macbook than apple
- macbook more important
Rank uncommon terms higher
Only relevant for OR search
Store inverse document frequency per term

Notes:

Why is it only relevant for OR search?
Why is it stored per term?
What is the min and max IDF? Why?

Inverse Document Frequency

$$\begin{aligned} \text{idf}(\text{term}) & = \frac{\text{num_docs}}{\text{document_frequency}(\text{term})}\\ \\ \text{idf}(\text{apple}) & = \frac{10}{9} = 1.1 \\ \\ \text{idf}(\text{macbook}) & = \frac{10}{2} = 5 \end{aligned}$$

IDF

#1: a book providing information about information retrieval
#2: a book about the search for books
#3: a book about information

Term	IDF	Doc IDs
Book	1	#1:1, #2:2, #3:1
Information	1.5	#1:2, #3:1
Retrieval	3	#1:1
Search	3	#2:1

Notes:

idf(t) = 1 is a special case
Audience participation

TF-IDF Ranking

$$\text{score}(\text{query}, \text{document}) = \sum_{\text{term} \in \text{query}} \left( \text{tf}(\text{term}, \text{document}) \times \text{idf}(\text{term}) \right)$$

Notes:

Explain formula in human-speak.

Term	IDF	Doc IDs
Book	1	#1:1, #2:2, #3:1
Information	1.5	#1:2, #3:1
Retrieval	3	#1:1
Search	3	#2:1

information retrieval search

#1

2 × 1.5 + 1 × 3 + 0 × 3 = 6

#2

0 × 1.5 + 0 × 3 + 1 × 3 = 3

#3

1 × 1.5 + 0 × 3 + 0 × 3 = 1.5

Notes:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

section_ranked_retrieval.md

section_ranked_retrieval.md

Ranked retrieval

Idea I

Idea II

I: Term frequency

At index time:

TF

II: Inverse document frequency

Inverse Document Frequency

$$\begin{aligned} \text{idf}(\text{term}) & = \frac{\text{num_docs}}{\text{document_frequency}(\text{term})}\\ \\ \text{idf}(\text{apple}) & = \frac{10}{9} = 1.1 \\ \\ \text{idf}(\text{macbook}) & = \frac{10}{2} = 5 \end{aligned}$$

IDF

TF-IDF Ranking

#1

#2

#3

Files

section_ranked_retrieval.md

Latest commit

History

section_ranked_retrieval.md

File metadata and controls

Ranked retrieval

Idea I

Idea II

I: Term frequency

At index time:

TF

II: Inverse document frequency

Inverse Document Frequency

$$\begin{aligned} \text{idf}(\text{term}) & = \frac{\text{num_docs}}{\text{document_frequency}(\text{term})}\\ \\ \text{idf}(\text{apple}) & = \frac{10}{9} = 1.1 \\ \\ \text{idf}(\text{macbook}) & = \frac{10}{2} = 5 \end{aligned}$$

IDF

TF-IDF Ranking

#1

#2

#3