TF pitfalls

What can be some issues with term frequency as a relevance indicator?

Not always true

$$\text{tf} (\text{book}) = 4$$

$$\text{tf} (\text{book}) = 5$$

Spam

$$\text{tf}(\text{"money"}, \text{"money money money"}) > \text{tf}(\text{"money"}, \text{"Legit document about money"})$$

Spam should be prevented.

Notes:

Who can explain what the formula means and why this is an issue?

Multi-term queries

$$\text{tf}(\text{"cat OR dog"}, \text{"cat cat cat"}) > \text{tf}(\text{"cat OR dog"}, \text{"cat dog"})$$

Documents which match all query terms should be ranked higher.

Notes:

Who can explain what the formula means and why this is an issue?

Document length

$$\text{tf}(\text{"cat"}, \text{"cat"}) < \text{tf}(\text{"cat"}, \text{"cat dog mouse elephant cat"})$$

Term frequency should be normalized with the document length.

Notes:

Who can explain what the formula means and why this is an issue?

Meanwhile, in the real world...

Elasticsearch default is now Okapi BM25
Also based on TF-IDF
Much less descriptive:

$${\displaystyle {\text{score}}(D,Q)=\sum_{i=1}^{n}{\text{IDF}}(q_i)\cdot {\frac {TF(q_i,D)\cdot (k_1+1)}{TF(q_i,D) +k_1\cdot \left(1-b+b\cdot {\frac {|D|}{\text{avgdl}}}\right)}}}$$

Notes:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

section_tf_idf_issues.md

section_tf_idf_issues.md

TF pitfalls

Not always true

Spam

Multi-term queries

Document length

Meanwhile, in the real world...

Files

section_tf_idf_issues.md

Latest commit

History

section_tf_idf_issues.md

File metadata and controls

TF pitfalls

Not always true

Spam

Multi-term queries

Document length

Meanwhile, in the real world...