Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inference): add tutorial how to implement RAG with managed infer… #3744

Merged
merged 27 commits into from
Oct 4, 2024
Merged
Changes from 5 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
2ff288b
feat(inference): add tutorial how to implement RAG with managed infer…
Laure-di Sep 24, 2024
73f71c2
configure dev env part
Laure-di Sep 24, 2024
83490a3
Managed database setup
Laure-di Sep 24, 2024
ca9beb9
last part of the first version
Laure-di Sep 24, 2024
d9f4ef2
add more informations on the tuto
Laure-di Sep 26, 2024
d1c41bc
fix title
Laure-di Sep 27, 2024
969b299
fix: remove unused env var and add missing one in code
Laure-di Sep 27, 2024
08218ba
database management improve
Laure-di Sep 27, 2024
6f66eda
develop Document loader
Laure-di Sep 27, 2024
a04d4c3
format
Laure-di Sep 27, 2024
4a72e84
Update tutorials/how-to-implement-rag/index.mdx
Laure-di Oct 3, 2024
004699b
Update tutorials/how-to-implement-rag/index.mdx
Laure-di Oct 3, 2024
b298435
Update tutorials/how-to-implement-rag/index.mdx
Laure-di Oct 3, 2024
52f5265
Update tutorials/how-to-implement-rag/index.mdx
Laure-di Oct 3, 2024
93b07a4
Update tutorials/how-to-implement-rag/index.mdx
Laure-di Oct 3, 2024
bf83379
Update tutorials/how-to-implement-rag/index.mdx
Laure-di Oct 3, 2024
a3f42df
Update tutorials/how-to-implement-rag/index.mdx
Laure-di Oct 3, 2024
29aa648
Update tutorials/how-to-implement-rag/index.mdx
Laure-di Oct 3, 2024
ea655a8
Update tutorials/how-to-implement-rag/index.mdx
Laure-di Oct 3, 2024
f5b4d4d
switch metodo
Laure-di Oct 3, 2024
69bc345
pre-defined prompt
Laure-di Oct 3, 2024
403fd5e
add custom prompt
Laure-di Oct 4, 2024
69c9974
exemple of endpoint
Laure-di Oct 4, 2024
8419eb7
Apply suggestions from code review
Laure-di Oct 4, 2024
bb72b38
Apply suggestions from code review
Laure-di Oct 4, 2024
d35ff83
Apply suggestions from code review
bene2k1 Oct 4, 2024
a77d197
Apply suggestions from code review
bene2k1 Oct 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
219 changes: 219 additions & 0 deletions tutorials/how-to-implement-rag/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
---
meta:
title: How to implement RAG with managed inference
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
description: Learn how to implement Retrieval-Augmented Generation (RAG) using Scaleway's managed inference, PostgreSQL, pgvector, and object storage.
content:
h1: How to implement RAG with managed inference
tags: inference, managed, postgresql, pgvector, object storage, RAG
categories:
- inference
---

Retrieval-Augmented Generation (RAG) enhances the power of language models by enabling them to retrieve relevant information from external datasets. In this tutorial, we’ll implement RAG using Scaleway’s Managed Inference, PostgreSQL, pgvector, and Scaleway’s Object Storage.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

With Scaleway's fully managed services, integrating RAG becomes a streamlined process. You'll use a sentence transformer for embedding text, store embeddings in a PostgreSQL database with pgvector, and leverage object storage for scalable data management.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

<Macro id="requirements" />
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

- A Scaleway account logged into the [console](https://console.scaleway.com)
- [Owner](/identity-and-access-management/iam/concepts/#owner) status or [IAM permissions](/identity-and-access-management/iam/concepts/#permission) allowing you to perform actions in the intended Organization
- [Inference Deployment](/ai-data/managed-inference/how-to/create-deployment/): Set up an inference deployment using [sentence-transformers/sentence-t5-xxl](/ai-data/managed-inference/reference-content/sentence-t5-xxl/) on an L4 instance to efficiently process embeddings.
bene2k1 marked this conversation as resolved.
Show resolved Hide resolved
- [Inference Deployment](/ai-data/managed-inference/how-to/create-deployment/) with the model of your choice.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
- [Object Storage Bucket](/storage/object/how-to/create-a-bucket/) to store all the data you want to inject into your LLM model.
bene2k1 marked this conversation as resolved.
Show resolved Hide resolved
- [Managed Database](/managed-databases/postgresql-and-mysql/how-to/create-a-database/) to securely store all your embeddings.
bene2k1 marked this conversation as resolved.
Show resolved Hide resolved

## Configure your development environment

1. Install necessary packages: run the following command to install the required packages:
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

```sh
pip install langchain psycopg2 python-dotenv scaleway
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
```
2. Configure your environment variables: create a .env file and add the following variables. These will store your API keys, database connection details, and other configuration values.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

```sh
# .env file

# Scaleway API credentials
SCW_ACCESS_KEY=your_scaleway_access_key
SCW_SECRET_KEY=your_scaleway_secret_key
SCW_API_KEY=your_scaleway_api_key
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

# Scaleway project and region
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
SCW_DEFAULT_PROJECT_ID=your_scaleway_project_id
SCW_DEFAULT_REGION=your_scaleway_region

# Scaleway managed database (PostgreSQL) credentials
SCW_DB_NAME=your_scaleway_managed_db_name
SCW_DB_USER=your_scaleway_managed_db_username
SCW_DB_PASSWORD=your_scaleway_managed_db_password
SCW_DB_HOST=your_scaleway_managed_db_host # The IP address of your database instance
SCW_DB_PORT=your_scaleway_managed_db_port # The port number for your database instance

# Scaleway S3 bucket configuration
SCW_BUCKET_NAME=your_scaleway_bucket_name
SCW_BUCKET_ENDPOINT=your_scaleway_bucket_endpoint # S3 endpoint, e.g., https://s3.fr-par.scw.cloud

# Scaleway Inference API configuration (Embeddings)
SCW_INFERENCE_EMBEDDINGS_ENDPOINT=your_scaleway_inference_embeddings_endpoint # Endpoint for sentence-transformers/sentence-t5-xxl deployment
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

# Scaleway Inference API configuration (LLM deployment)
SCW_INFERENCE_DEPLOYMENT_ENDPOINT=your_scaleway_inference_endpoint # Endpoint for your LLM deployment
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
```

### Set Up Managed Database
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

1. Connect to your PostgreSQL instance and install the pgvector extension, which is used for storing high-dimensional embeddings.

```python
conn = psycopg2.connect(
database="your_database_name",
user="your_db_user",
password=os.getenv("SCW_DB_PASSWORD"),
host="your_db_host",
port="your_db_port"
)

cur = conn.cursor()

# Install pg_vector extension
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
conn.commit()
```
The command above ensures that pgvector is installed on your database if it hasn't been already.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

2. To avoid reprocessing documents that have already been loaded and vectorized, create a table in your PostgreSQL database to track them. This ensures that new documents added to your object storage bucket are processed only once, preventing duplicate downloads and redundant vectorization.

```python
cur.execute("CREATE TABLE IF NOT EXISTS object_loaded (id SERIAL PRIMARY KEY, object_key TEXT)")
conn.commit()
```

Laure-di marked this conversation as resolved.
Show resolved Hide resolved
### Set Up Document Loaders for Object Storage

The document loader pulls documents from your Scaleway Object Storage bucket. This loader will retrieve the contents of each document for further processing.s

```python
document_loader = S3DirectoryLoader(
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
bucket=os.getenv('SCW_BUCKET_NAME'),
endpoint_url=os.getenv('SCW_BUCKET_ENDPOINT'),
aws_access_key_id=os.getenv("SCW_ACCESS_KEY"),
aws_secret_access_key=os.getenv("SCW_SECRET_KEY")
)

Laure-di marked this conversation as resolved.
Show resolved Hide resolved
```

### Embeddings and Vector Store Setup

1. We will utilize the OpenAIEmbeddings class from LangChain and store the embeddings in PostgreSQL using the PGVector integration.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

```python
embeddings = OpenAIEmbeddings(
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
openai_api_key=os.getenv("SCW_API_KEY"),
openai_api_base=os.getenv("SCW_INFERENCE_EMBEDDINGS_ENDPOINT"),
model="sentence-transformers/sentence-t5-xxl",
tiktoken_enabled=False,
)
```

Key Parameters:
- openai_api_key: This is your API key for accessing the OpenAI-powered embeddings service, in this case, deployed via Scaleway’s Managed Inference.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
- openai_api_base: This is the base URL that points to your deployment of the sentence-transformers/sentence-t5-xxl model on Scaleway's Managed Inference. This URL serves as the entry point to make API calls for generating embeddings.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
- model="sentence-transformers/sentence-t5-xxl": This defines the specific model being used for text embeddings. sentence-transformers/sentence-t5-xxl is a powerful model optimized for generating high-quality sentence embeddings, making it ideal for tasks like document retrieval in RAG systems.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
- tiktoken_enabled=False: This is an important parameter, which disables the use of TikToken for tokenization within the embeddings process.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

What is tiktoken_enabled?
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

tiktoken is a tokenization library developed by OpenAI, which is optimized for working with GPT-based models (like GPT-3.5 or GPT-4). It transforms text into smaller token units that the model can process.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

Why set tiktoken_enabled=False?

In the context of using Scaleway’s Managed Inference and the sentence-t5-xxl model, TikToken tokenization is not necessary because the model you are using (sentence-transformers) works with raw text and handles its own tokenization internally.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
Moreover, leaving tiktoken_enabled as True causes issues when sending data to Scaleway’s API because it results in tokenized vectors being sent instead of raw text. Since Scaleway's endpoint expects text and not pre-tokenized data, this mismatch can lead to errors or incorrect behavior.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
By setting tiktoken_enabled=False, you ensure that raw text is sent to Scaleway's Managed Inference endpoint, which is what the sentence-transformers model expects to process. This guarantees that the embedding generation process works smoothly with Scaleway's infrastructure.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

2. Next, configure the connection string for your PostgreSQL instance and create a PGVector store to store these embeddings.

```python

Laure-di marked this conversation as resolved.
Show resolved Hide resolved
connection_string = f"postgresql+psycopg2://{conn.info.user}:{conn.info.password}@{conn.info.host}:{conn.info.port}/{conn.info.dbname}"
vector_store = PGVector(connection=connection_string, embeddings=embeddings)
```

PGVector: This creates the vector store in your PostgreSQL database to store the embeddings.

### Load and Process Documents

Use the S3FileLoader to load documents and split them into chunks. Then, embed and store them in your PostgreSQL database.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

1. Lazy loadings documents: This method is designed to efficiently load and process documents one by one from Scaleway Object Storage. Instead of loading all documents at once, it loads them lazily, allowing us to inspect each file before deciding whether to embed it.
```python
files = document_loader.lazy_load()
```
Why lazy loading?
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
The key reason for using lazy loading here is to avoid reprocessing documents that have already been embedded. In the context of Retrieval-Augmented Generation (RAG), reprocessing the same document multiple times is redundant and inefficient. Lazy loading enables us to check if a document has already been embedded (by querying the database) before actually loading and embedding it.

```python
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=20)

for file in files:
cur.execute("SELECT object_key FROM object_loaded WHERE object_key = %s", (file.metadata["source"],))
if cur.fetchone() is None:
fileLoader = S3FileLoader(
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
bucket=os.getenv(),
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
key=file.metadata["source"].split("/")[-1],
endpoint_url=endpoint_s3,
aws_access_key_id=os.getenv("SCW_ACCESS_KEY"),
aws_secret_access_key=os.getenv("SCW_SECRET_KEY")
)
file_to_load = fileLoader.load()
chunks = text_splitter.split_text(file.page_content)

embeddings_list = [embeddings.embed_query(chunk) for chunk in chunks]
for chunk, embedding in zip(chunks, embeddings_list):
vector_store.add_embeddings(embedding, chunk)
```

- S3FileLoader: Loads each file individually from the object storage bucket.
- RecursiveCharacterTextSplitter: Splits the document into smaller text chunks. This is important for embedding, as models typically work better with smaller chunks of text.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
- embeddings_list: Stores the embeddings for each chunk.
- vector_store.add_embeddings(): Stores each chunk and its corresponding embedding in the PostgreSQL vector store.

The code iterates over each file retrieved from object storage using lazy loading.
For each file, a query is made to check if its corresponding object_key (a unique identifier from the file metadata) exists in the object_loaded table in PostgreSQL.
If the document has already been processed and embedded (i.e., the object_key is found in the database), the system skips loading the file and moves on to the next one.
If the document is new (not yet embedded), the file is fully loaded and processed.

This approach ensures that only new or modified documents are loaded into memory and embedded, saving significant computational resources and reducing redundant work.

Why store both chunk and embedding?
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

Storing both the chunk and its corresponding embedding allows for efficient document retrieval later.
When a query is made, the RAG system will retrieve the most relevant embeddings, and the corresponding text chunks will be used to generate the final response.

### Query the RAG System

Now, set up the RAG system to handle queries using RetrievalQA and the LLM.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

```python
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
llm = ChatOpenAI(
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
base_url=os.getenv("SCW_INFERENCE_DEPLOYMENT_ENDPOINT"),
api_key=os.getenv("SCW_API_KEY"),
model=deployment.model_name,
)

qa_stuff = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

query = "What are the commands to set up a database with the CLI of Scaleway?"
Laure-di marked this conversation as resolved.
Show resolved Hide resolved
response = qa_stuff.invoke(query)

print(response['result'])
```


### Conclusion
bene2k1 marked this conversation as resolved.
Show resolved Hide resolved

This step is essential for efficiently processing and storing large document datasets for RAG. By using lazy loading, the system handles large datasets without overwhelming memory, while chunking ensures that each document is processed in a way that maximizes the performance of the LLM. The embeddings are stored in PostgreSQL via pgvector, allowing for fast and scalable retrieval when responding to user queries.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved

By combining Scaleway’s Managed Object Storage, PostgreSQL with pgvector, and LangChain’s embedding tools, you can implement a powerful RAG system that scales with your data and offers robust information retrieval capabilities.
Laure-di marked this conversation as resolved.
Show resolved Hide resolved