What can be some issues with term frequency as a relevance indicator?
$$\text{tf} (\text{book}) = 4$$
$$\text{tf} (\text{book}) = 5$$
$$\text{tf}(\text{"money"}, \text{"money money money"}) > \text{tf}(\text{"money"}, \text{"Legit document about
Spam should be prevented.
Who can explain what the formula means and why this is an issue?
$$\text{tf}(\text{"cat OR dog"}, \text{"cat cat cat"}) > \text{tf}(\text{"cat OR dog"}, \text{"cat dog"})$$
Documents which match all query terms should be ranked higher.
Who can explain what the formula means and why this is an issue?
$$\text{tf}(\text{"cat"}, \text{"cat"}) < \text{tf}(\text{"cat"}, \text{"cat dog mouse elephant cat"})$$
Term frequency should be normalized with the document length.
Who can explain what the formula means and why this is an issue?
Meanwhile, in the real world...
Elasticsearch default is now Okapi BM25
Also based on TF-IDF
Much less descriptive:
$${\displaystyle {\text{score}}(D,Q)=\sum_{i=1}^{n}{\text{IDF}}(q_i)\cdot {\frac {TF(q_i,D)\cdot (k_1+1)}{TF(q_i,D)
+k_1\cdot \left(1-b+b\cdot {\frac {|D|}{\text{avgdl}}}\right)}}}$$