- Encode documents.
- Encode query.
- Find nearest neighbors.
a book about information retrieval
↓
LLM
↓
[1.3, 2.7, 1.1]
↓
Document | Vector |
---|---|
a book about information retrieval | [1.3, 2.7, 1.1] |
a book about the search for information | [2.4, 0.3, 3.5] |
a book about retrieving information | [0.1, 2.0, 1.1] |
finding stuff
↓
LLM
↓
[1.1, 0.4, 2.3]
↓
Document | Vector | Cosine Similarity |
---|---|---|
a book about information retrieval | [1.3, 2.7, 1.1] | 0.6 |
a book about the search for information | [2.4, 0.3, 3.5] | 0.56 |
a book about retrieving information | [0.1, 2.0, 1.1] | 0.58 |
-
$O(n)$ performance for brute force cosine similarity - Naive Cosine Similarity does not scale.
- Must scale for millions of documents.
Notes:
- What is the complexity for brute force cosine similarity?
- Vector Search is fuzzy anyway.
- Finding approximately the best results is good enough.
- And it's much faster!
Source: pinecone.io
Source: pinecone.io
- For keyword search, only documents that contain query terms are returned.
- For vector search, every document vector is more or less similar to every query vector.
Notes:
- Why is every document vector more or less similar?
cats and dogs
↓
LLM
↓
Document | Vector | Cosine Similarity |
---|---|---|
a book about information retrieval | [1.3, 2.7, 1.1] | 0.12 |
a book about the search for information | [2.4, 0.3, 3.5] | 0.05 |
a book about retrieving information | [0.1, 2.0, 1.1] | 0.37 |
- Where to cut off the results?
- Just return Top 50 similar documents?
- What if there are just no meaningful results?
- Combine vector search results and keyword search results into one result set
search algorithms
↓
Document | Keyword Rank | Vector Rank | Total Result |
---|---|---|---|
#1 a book about the search for information | 1 | 3 | |
#2 a book about information retrieval | 2 | 1 | |
#3 a book about retrieving information | - | 2 |
↓
#2, #1, #3