Before you begin, it is recommended to normalize the data used in the test.
Assume there is an initial vector with n dimensions:
The L2 norm of vector is represented as:
Each dimension is calculated as follows:
After normalization, the L2 norm is 1:
ANNS (approximate nearest neighbor searching) is the mainstream application in vector search. The key concept is that computing and searching is done only in the sub-spaces of initial vector space. This significantly increases the overall search speed.
Assume the search space (sub-spaces of initial vector space) is:
The dot product of two vectors is defined as follows:
The cosine similarity of two vectors is represented as:
Similarity is measured by the cosine of the angle between two vectors: the greater the cosine, the higher the similarity:
Assume that after vector normalization, original vectoris converted to :
Thus, the cosine similarity of two vectors remains unchanged after vector normalization. In particular,
it can be concluded that Cosine similarity equals Dot product for normalized vectors.
The Euclidean distance of vectors is represented as:
Similarity is measured by comparing the Euclidean distance between two vectors: the smaller the Euclidean distance, the higher the similarity:
If you further unfold the above formula, you will get:
It is obvious that the square of Euclidean distance has a negative correlation with the dot product. The Euclidean distance is a non-negative real number. And the size relationship between two non-negative real numbers is the same as the size relationship between their own squares.
Therefore, we can conclude that after vector normalization, if you search the same vector in the same vector spaces, the Euclidean distance equals Dot product for the top k results.