-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
482 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,239 @@ | ||
--- | ||
title: "Semantic Search with Embeddings" | ||
slug: "docs/semantic-search-embed" | ||
|
||
hidden: false | ||
description: >- | ||
Examples on how to use the Embed endpoint to perform semantic search (API v1). | ||
image: "../../assets/images/fa074c3-cohere_docs_preview_image_1200x630_copy.jpg" | ||
keywords: "vector embeddings, embeddings, natural language processing" | ||
|
||
--- | ||
|
||
This section provides examples on how to use the Embed endpoint to perform semantic search. | ||
|
||
Semantic search solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles to capture the context or meaning of a piece of text. | ||
|
||
|
||
```python PYTHON | ||
import cohere | ||
import numpy as np | ||
co = cohere.Client(api_key="YOUR_API_KEY") # Get your free API key: https://dashboard.cohere.com/api-keys | ||
``` | ||
|
||
The Embed endpoint takes in texts as input and returns embeddings as output. | ||
|
||
For semantic search, there are two types of documents we need to turn into embeddings. | ||
|
||
- The list of documents to search from. | ||
- The query that will be used to search the documents. | ||
|
||
### Step 1: Embed the documents | ||
We call the Embed endpoint using `co.embed()` and pass the required arguments: | ||
- `texts`: The list of texts | ||
- `model`: Here we choose `embed-english-v3.0`, which generates embeddings of size 1024 | ||
- `input_type`: We choose `search_document` to ensure the model treats these as the documents for search | ||
- `embedding_types`: We choose `float` to get a float array as the output | ||
|
||
### Step 2: Embed the query | ||
Next, we add and embed a query. We choose `search_query` as the `input_type` to ensure the model treats this as the query (instead of documents) for search. | ||
|
||
### Step 3: Return the most similar documents | ||
Next, we calculate and sort similarity scores between a query and document embeddings, then display the top N most similar documents. Here, we are using the numpy library for calculating similarity using a dot product approach. | ||
|
||
|
||
|
||
```python PYTHON | ||
### STEP 1: Embed the documents | ||
|
||
# Define the documents | ||
documents = [ | ||
"Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.", | ||
"Finding Coffee Spots: For your caffeine fix, head to the break room's coffee machine or cross the street to the café for artisan coffee.", | ||
"Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!", | ||
"Working Hours Flexibility: We prioritize work-life balance. While our core hours are 9 AM to 5 PM, we offer flexibility to adjust as needed.", | ||
] | ||
|
||
# Embed the documents | ||
doc_emb = co.embed( | ||
texts=documents, | ||
model="embed-english-v3.0", | ||
input_type="search_document", | ||
embedding_types=["float"] | ||
).embeddings.float | ||
|
||
### STEP 2: Embed the query | ||
|
||
# Add the user query | ||
query = "How to connect with my teammates?" | ||
|
||
# Embed the query | ||
query_emb = co.embed( | ||
texts=[query], | ||
model="embed-english-v3.0", | ||
input_type="search_query", | ||
embedding_types=["float"] | ||
).embeddings.float | ||
|
||
### STEP 3: Return the most similar documents | ||
|
||
# Calculate similarity scores | ||
scores = np.dot(query_emb, np.transpose(doc_emb))[0] | ||
|
||
# Sort and filter documents based on scores | ||
top_n = 2 | ||
top_doc_idxs = np.argsort(-scores)[:top_n] | ||
|
||
# Display search results | ||
for idx, docs_idx in enumerate(top_doc_idxs): | ||
print(f"Rank: {idx+1}") | ||
print(f"Document: {documents[docs_idx]}\n") | ||
``` | ||
``` | ||
Rank: 1 | ||
Document: Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime! | ||
Rank: 2 | ||
Document: Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged. | ||
``` | ||
|
||
|
||
## Content quality measure with Embed v3 | ||
|
||
A standard text embeddings model is optimized for only topic similarity between a query and candidate documents. But in many real-world applications, you have redundant information with varying content quality. | ||
|
||
For instance, consider a user query of “COVID-19 Symptoms” and compare that to candidate document, “COVID-19 has many symptoms”. This document does not offer high-quality and rich information. However, with a typical embedding model, it will appear high on search results because it is highly similar to the query. | ||
|
||
|
||
The Embed v3 model is trained to capture both content quality and topic similarity. Through this approach, a search system can extract richer information from documents and is robust against noise. | ||
|
||
As an example below, give a query ("COVID-19 Symptoms"), the document with the highest quality ("COVID-19 symptoms can include: a high temperature or shivering...") is ranked first. | ||
|
||
Another document ("COVID-19 has many symptoms") is arguably more similar to the query based on what information it contains, yet it is ranked lower as it doesn’t contain that much information. | ||
|
||
This demonstrates how Embed v3 helps to surface high-quality documents for a given query. | ||
|
||
|
||
```python PYTHON | ||
### STEP 1: Embed the documents | ||
|
||
documents = [ | ||
"COVID-19 has many symptoms.", | ||
"COVID-19 symptoms are bad.", | ||
"COVID-19 symptoms are not nice", | ||
"COVID-19 symptoms are bad. 5G capabilities include more expansive service coverage, a higher number of available connections, and lower power consumption.", | ||
"COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.", | ||
"COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more", | ||
"Dementia has the following symptom: Experiencing memory loss, poor judgment, and confusion.", | ||
"COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion.", | ||
] | ||
|
||
# Embed the documents | ||
doc_emb = co.embed( | ||
texts=documents, | ||
model="embed-english-v3.0", | ||
input_type="search_document", | ||
embedding_types=["float"] | ||
).embeddings.float | ||
|
||
### STEP 2: Embed the query | ||
|
||
# Add the user query | ||
query = "COVID-19 Symptoms" | ||
|
||
# Embed the query | ||
query_emb = co.embed( | ||
texts=[query], | ||
model="embed-english-v3.0", | ||
input_type="search_query", | ||
embedding_types=["float"] | ||
).embeddings.float | ||
|
||
### STEP 3: Return the most similar documents | ||
|
||
# Calculate similarity scores | ||
scores = np.dot(query_emb, np.transpose(doc_emb))[0] | ||
|
||
# Sort and filter documents based on scores | ||
top_n = 5 | ||
top_doc_idxs = np.argsort(-scores)[:top_n] | ||
|
||
# Display search results | ||
for idx, docs_idx in enumerate(top_doc_idxs): | ||
print(f"Rank: {idx+1}") | ||
print(f"Document: {documents[docs_idx]}\n") | ||
``` | ||
``` | ||
Rank: 1 | ||
Document: COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more | ||
Rank: 2 | ||
Document: COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others. | ||
Rank: 3 | ||
Document: COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion. | ||
Rank: 4 | ||
Document: COVID-19 has many symptoms. | ||
Rank: 5 | ||
Document: COVID-19 symptoms are not nice | ||
``` | ||
|
||
|
||
## Multilingual semantic search | ||
|
||
The Embed endpoint also supports multilingual semantic search via the `embed-multilingual-...` models. This means you can perform semantic search on texts in different languages. | ||
|
||
Specifically, you can do both multilingual and cross-lingual searches using one single model. | ||
|
||
|
||
```python PYTHON | ||
### STEP 1: Embed the documents | ||
|
||
documents = [ | ||
"Remboursement des frais de voyage : Gérez facilement vos frais de voyage en les soumettant via notre outil financier. Les approbations sont rapides et simples.", | ||
"Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail.", | ||
"Avantages pour la santé et le bien-être : Nous nous soucions de votre bien-être et proposons des adhésions à des salles de sport, des cours de yoga sur site et une assurance santé complète.", | ||
"Fréquence des évaluations de performance : Nous organisons des bilans informels tous les trimestres et des évaluations formelles deux fois par an.", | ||
] | ||
|
||
# Embed the documents | ||
doc_emb = co.embed( | ||
texts=documents, | ||
model="embed-english-v3.0", | ||
input_type="search_document", | ||
embedding_types=["float"] | ||
).embeddings.float | ||
|
||
### STEP 2: Embed the query | ||
|
||
# Add the user query | ||
query = "What's your remote-working policy?" | ||
|
||
# Embed the query | ||
query_emb = co.embed( | ||
texts=[query], | ||
model="embed-english-v3.0", | ||
input_type="search_query", | ||
embedding_types=["float"] | ||
).embeddings.float | ||
|
||
### STEP 3: Return the most similar documents | ||
|
||
# Calculate similarity scores | ||
scores = np.dot(query_emb, np.transpose(doc_emb))[0] | ||
|
||
# Sort and filter documents based on scores | ||
top_n = 1 | ||
top_doc_idxs = np.argsort(-scores)[:top_n] | ||
|
||
# Display search results | ||
for idx, docs_idx in enumerate(top_doc_idxs): | ||
print(f"Rank: {idx+1}") | ||
print(f"Document: {documents[docs_idx]}\n") | ||
``` | ||
``` | ||
Rank: 1 | ||
Document: Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail. | ||
``` |
Oops, something went wrong.