Skip to content

Latest commit

 

History

History
73 lines (39 loc) · 1.68 KB

section_tf_idf_issues.md

File metadata and controls

73 lines (39 loc) · 1.68 KB

TF pitfalls

What can be some issues with term frequency as a relevance indicator?


Not always true

book book

$$\text{tf} (\text{book}) = 4$$

book library

$$\text{tf} (\text{book}) = 5$$


Spam

$$\text{tf}(\text{"money"}, \text{"money money money"}) > \text{tf}(\text{"money"}, \text{"Legit document about money"})$$

Spam should be prevented.

Notes:

  • Who can explain what the formula means and why this is an issue?

Multi-term queries

$$\text{tf}(\text{"cat OR dog"}, \text{"cat cat cat"}) > \text{tf}(\text{"cat OR dog"}, \text{"cat dog"})$$

Documents which match all query terms should be ranked higher.

Notes:

  • Who can explain what the formula means and why this is an issue?

Document length

$$\text{tf}(\text{"cat"}, \text{"cat"}) < \text{tf}(\text{"cat"}, \text{"cat dog mouse elephant cat"})$$

Term frequency should be normalized with the document length.

Notes:

  • Who can explain what the formula means and why this is an issue?

Meanwhile, in the real world...

  • Elasticsearch default is now Okapi BM25
  • Also based on TF-IDF
  • Much less descriptive:

$${\displaystyle {\text{score}}(D,Q)=\sum_{i=1}^{n}{\text{IDF}}(q_i)\cdot {\frac {TF(q_i,D)\cdot (k_1+1)}{TF(q_i,D) +k_1\cdot \left(1-b+b\cdot {\frac {|D|}{\text{avgdl}}}\right)}}}$$

Notes: