Skip to content

Commit

Permalink
add semantic search docs (#174)
Browse files Browse the repository at this point in the history
  • Loading branch information
mrmer1 authored Oct 21, 2024
1 parent e724750 commit 0ff5f65
Show file tree
Hide file tree
Showing 4 changed files with 482 additions and 0 deletions.
239 changes: 239 additions & 0 deletions fern/pages/text-embeddings/semantic-search-embed.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
---
title: "Semantic Search with Embeddings"
slug: "docs/semantic-search-embed"

hidden: false
description: >-
Examples on how to use the Embed endpoint to perform semantic search (API v1).
image: "../../assets/images/fa074c3-cohere_docs_preview_image_1200x630_copy.jpg"
keywords: "vector embeddings, embeddings, natural language processing"

---

This section provides examples on how to use the Embed endpoint to perform semantic search.

Semantic search solves the problem faced by the more traditional approach of lexical search, which is great at finding keyword matches, but struggles to capture the context or meaning of a piece of text.


```python PYTHON
import cohere
import numpy as np
co = cohere.Client(api_key="YOUR_API_KEY") # Get your free API key: https://dashboard.cohere.com/api-keys
```

The Embed endpoint takes in texts as input and returns embeddings as output.

For semantic search, there are two types of documents we need to turn into embeddings.

- The list of documents to search from.
- The query that will be used to search the documents.

### Step 1: Embed the documents
We call the Embed endpoint using `co.embed()` and pass the required arguments:
- `texts`: The list of texts
- `model`: Here we choose `embed-english-v3.0`, which generates embeddings of size 1024
- `input_type`: We choose `search_document` to ensure the model treats these as the documents for search
- `embedding_types`: We choose `float` to get a float array as the output

### Step 2: Embed the query
Next, we add and embed a query. We choose `search_query` as the `input_type` to ensure the model treats this as the query (instead of documents) for search.

### Step 3: Return the most similar documents
Next, we calculate and sort similarity scores between a query and document embeddings, then display the top N most similar documents. Here, we are using the numpy library for calculating similarity using a dot product approach.



```python PYTHON
### STEP 1: Embed the documents

# Define the documents
documents = [
"Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.",
"Finding Coffee Spots: For your caffeine fix, head to the break room's coffee machine or cross the street to the café for artisan coffee.",
"Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!",
"Working Hours Flexibility: We prioritize work-life balance. While our core hours are 9 AM to 5 PM, we offer flexibility to adjust as needed.",
]

# Embed the documents
doc_emb = co.embed(
texts=documents,
model="embed-english-v3.0",
input_type="search_document",
embedding_types=["float"]
).embeddings.float

### STEP 2: Embed the query

# Add the user query
query = "How to connect with my teammates?"

# Embed the query
query_emb = co.embed(
texts=[query],
model="embed-english-v3.0",
input_type="search_query",
embedding_types=["float"]
).embeddings.float

### STEP 3: Return the most similar documents

# Calculate similarity scores
scores = np.dot(query_emb, np.transpose(doc_emb))[0]

# Sort and filter documents based on scores
top_n = 2
top_doc_idxs = np.argsort(-scores)[:top_n]

# Display search results
for idx, docs_idx in enumerate(top_doc_idxs):
print(f"Rank: {idx+1}")
print(f"Document: {documents[docs_idx]}\n")
```
```
Rank: 1
Document: Team-Building Activities: We foster team spirit with monthly outings and weekly game nights. Feel free to suggest new activity ideas anytime!
Rank: 2
Document: Joining Slack Channels: You will receive an invite via email. Be sure to join relevant channels to stay informed and engaged.
```


## Content quality measure with Embed v3

A standard text embeddings model is optimized for only topic similarity between a query and candidate documents. But in many real-world applications, you have redundant information with varying content quality.

For instance, consider a user query of “COVID-19 Symptoms” and compare that to candidate document, “COVID-19 has many symptoms”. This document does not offer high-quality and rich information. However, with a typical embedding model, it will appear high on search results because it is highly similar to the query.


The Embed v3 model is trained to capture both content quality and topic similarity. Through this approach, a search system can extract richer information from documents and is robust against noise.

As an example below, give a query ("COVID-19 Symptoms"), the document with the highest quality ("COVID-19 symptoms can include: a high temperature or shivering...") is ranked first.

Another document ("COVID-19 has many symptoms") is arguably more similar to the query based on what information it contains, yet it is ranked lower as it doesn’t contain that much information.

This demonstrates how Embed v3 helps to surface high-quality documents for a given query.


```python PYTHON
### STEP 1: Embed the documents

documents = [
"COVID-19 has many symptoms.",
"COVID-19 symptoms are bad.",
"COVID-19 symptoms are not nice",
"COVID-19 symptoms are bad. 5G capabilities include more expansive service coverage, a higher number of available connections, and lower power consumption.",
"COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.",
"COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more",
"Dementia has the following symptom: Experiencing memory loss, poor judgment, and confusion.",
"COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion.",
]

# Embed the documents
doc_emb = co.embed(
texts=documents,
model="embed-english-v3.0",
input_type="search_document",
embedding_types=["float"]
).embeddings.float

### STEP 2: Embed the query

# Add the user query
query = "COVID-19 Symptoms"

# Embed the query
query_emb = co.embed(
texts=[query],
model="embed-english-v3.0",
input_type="search_query",
embedding_types=["float"]
).embeddings.float

### STEP 3: Return the most similar documents

# Calculate similarity scores
scores = np.dot(query_emb, np.transpose(doc_emb))[0]

# Sort and filter documents based on scores
top_n = 5
top_doc_idxs = np.argsort(-scores)[:top_n]

# Display search results
for idx, docs_idx in enumerate(top_doc_idxs):
print(f"Rank: {idx+1}")
print(f"Document: {documents[docs_idx]}\n")
```
```
Rank: 1
Document: COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more
Rank: 2
Document: COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.
Rank: 3
Document: COVID-19 has the following symptom: Experiencing memory loss, poor judgment, and confusion.
Rank: 4
Document: COVID-19 has many symptoms.
Rank: 5
Document: COVID-19 symptoms are not nice
```


## Multilingual semantic search

The Embed endpoint also supports multilingual semantic search via the `embed-multilingual-...` models. This means you can perform semantic search on texts in different languages.

Specifically, you can do both multilingual and cross-lingual searches using one single model.


```python PYTHON
### STEP 1: Embed the documents

documents = [
"Remboursement des frais de voyage : Gérez facilement vos frais de voyage en les soumettant via notre outil financier. Les approbations sont rapides et simples.",
"Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail.",
"Avantages pour la santé et le bien-être : Nous nous soucions de votre bien-être et proposons des adhésions à des salles de sport, des cours de yoga sur site et une assurance santé complète.",
"Fréquence des évaluations de performance : Nous organisons des bilans informels tous les trimestres et des évaluations formelles deux fois par an.",
]

# Embed the documents
doc_emb = co.embed(
texts=documents,
model="embed-english-v3.0",
input_type="search_document",
embedding_types=["float"]
).embeddings.float

### STEP 2: Embed the query

# Add the user query
query = "What's your remote-working policy?"

# Embed the query
query_emb = co.embed(
texts=[query],
model="embed-english-v3.0",
input_type="search_query",
embedding_types=["float"]
).embeddings.float

### STEP 3: Return the most similar documents

# Calculate similarity scores
scores = np.dot(query_emb, np.transpose(doc_emb))[0]

# Sort and filter documents based on scores
top_n = 1
top_doc_idxs = np.argsort(-scores)[:top_n]

# Display search results
for idx, docs_idx in enumerate(top_doc_idxs):
print(f"Rank: {idx+1}")
print(f"Document: {documents[docs_idx]}\n")
```
```
Rank: 1
Document: Travailler de l'étranger : Il est possible de travailler à distance depuis un autre pays. Il suffit de coordonner avec votre responsable et de vous assurer d'être disponible pendant les heures de travail.
```
Loading

0 comments on commit 0ff5f65

Please sign in to comment.