What can be some issues with term frequency as a relevance indicator?
$$\text{tf} (\text{book}) = 4$$
$$\text{tf} (\text{book}) = 5$$
$$\text{tf}(\text{"money"}, \text{"money money money"}) > \text{tf}(\text{"money"}, \text{"Legit document about
money"})$$
Spam should be prevented.
Notes:
Who can explain what the formula means and why this is an issue?
$$\text{tf}(\text{"cat OR dog"}, \text{"cat cat cat"}) > \text{tf}(\text{"cat OR dog"}, \text{"cat dog"})$$
Documents which match all query terms should be ranked higher.
Notes:
Who can explain what the formula means and why this is an issue?
$$\text{tf}(\text{"cat"}, \text{"cat"}) < \text{tf}(\text{"cat"}, \text{"cat dog mouse elephant cat"})$$
Term frequency should be normalized with the document length.
Notes:
Who can explain what the formula means and why this is an issue?
Meanwhile, in the real world...
Elasticsearch default is now Okapi BM25
Also based on TF-IDF
Much less descriptive:
$${\displaystyle {\text{score}}(D,Q)=\sum_{i=1}^{n}{\text{IDF}}(q_i)\cdot {\frac {TF(q_i,D)\cdot (k_1+1)}{TF(q_i,D)
+k_1\cdot \left(1-b+b\cdot {\frac {|D|}{\text{avgdl}}}\right)}}}$$
Notes: