Skip to content

Latest commit

 

History

History
93 lines (44 loc) · 7.3 KB

data_normalization.md

File metadata and controls

93 lines (44 loc) · 7.3 KB

Data Normalization

L2 Normalization

Before you begin, it is recommended to normalize the data used in the test.

Assume there is an initial vector with n dimensions:

Initial vector:

The L2 norm of vector is represented as:

Normalized vector:

Each dimension is calculated as follows:

After normalization, the L2 norm is 1:

Compute Vector Similarity

ANNS (approximate nearest neighbor searching) is the mainstream application in vector search. The key concept is that computing and searching is done only in the sub-spaces of initial vector space. This significantly increases the overall search speed.

Assume the search space (sub-spaces of initial vector space) is:

Inner Product (Dot Product)

The dot product of two vectors is defined as follows:

Cosine Similarity

The cosine similarity of two vectors is represented as:

Similarity is measured by the cosine of the angle between two vectors: the greater the cosine, the higher the similarity:

Assume that after vector normalization, original vectoris converted to

Thus, the cosine similarity of two vectors remains unchanged after vector normalization. In particular,

it can be concluded that Cosine similarity equals Dot product for normalized vectors.

Euclidean Distance

The Euclidean distance of vectors is represented as:

Similarity is measured by comparing the Euclidean distance between two vectors: the smaller the Euclidean distance, the higher the similarity:

If you further unfold the above formula, you will get:

It is obvious that the square of Euclidean distance has a negative correlation with the dot product. The Euclidean distance is a non-negative real number. And the size relationship between two non-negative real numbers is the same as the size relationship between their own squares.

Therefore, we can conclude that after vector normalization, if you search the same vector in the same vector spaces, the Euclidean distance equals Dot product for the top k results.