Skip to content

Latest commit

 

History

History
151 lines (90 loc) · 3.57 KB

section_vector_search.md

File metadata and controls

151 lines (90 loc) · 3.57 KB

Vector Search

  • Encode documents.
  • Encode query.
  • Find nearest neighbors.

Index

a book about information retrieval

LLM

[1.3, 2.7, 1.1]

Document Vector
a book about information retrieval [1.3, 2.7, 1.1]
a book about the search for information [2.4, 0.3, 3.5]
a book about retrieving information [0.1, 2.0, 1.1]

Search

finding stuff

LLM

[1.1, 0.4, 2.3]

Document Vector Cosine Similarity
a book about information retrieval [1.3, 2.7, 1.1] 0.6
a book about the search for information [2.4, 0.3, 3.5] 0.56
a book about retrieving information [0.1, 2.0, 1.1] 0.58

Performance

  • $O(n)$ performance for brute force cosine similarity
  • Naive Cosine Similarity does not scale.
  • Must scale for millions of documents.

Notes:

  • What is the complexity for brute force cosine similarity?

Approximate Nearest Neighbors

  • Vector Search is fuzzy anyway.
  • Finding approximately the best results is good enough.
  • And it's much faster!

Navigable Small World Graph

­ Navigable Small World Graph

Source: pinecone.io

$O(log(n))$ for less than a few thousand nodes.


Hierarchical Navigable Small World Graph (HNSW)

­ Hierarchical Navigable Small World Graph

Source: pinecone.io


Relevance

  • For keyword search, only documents that contain query terms are returned.
  • For vector search, every document vector is more or less similar to every query vector.

Notes:

  • Why is every document vector more or less similar?

cats and dogs

LLM

Document Vector Cosine Similarity
a book about information retrieval [1.3, 2.7, 1.1] 0.12
a book about the search for information [2.4, 0.3, 3.5] 0.05
a book about retrieving information [0.1, 2.0, 1.1] 0.37

Relevance

  • Where to cut off the results?
  • Just return Top 50 similar documents?
  • What if there are just no meaningful results?

Hybrid Scoring

  • Combine vector search results and keyword search results into one result set

Reciprocal Rank Fusion

search algorithms

Document Keyword Rank Vector Rank Total Result
#1 a book about the search for information 1 3 $\frac{1}{1} + \frac{1}{3} = 1.3$
#2 a book about information retrieval 2 1 $\frac{1}{2} + \frac{1}{1} = 1.5$
#3 a book about retrieving information - 2 $0 + \frac{1}{2} = 0.5$

#2, #1, #3