Skip to content

Commit

Permalink
feat(examples): Add RAGAS evaluation to RAG chain (#50)
Browse files Browse the repository at this point in the history
* feat(module): add genai example

* feat: add README.md

* edit README.md

* add known issues

* add fix

* add service agent role assignment

* update terraform.tfvars

* Update README with VPC-SC Instruction

* fix render

* Update module

* add regional bucket and fix deployed index access

* add Google Header to file

* update README.md with terraform docs

* lint fixes

* implement PR review changes

* update host_vpc projectid

* add RAGAS evaluation

* add readme steps

* add missing header

* update version 5.34

* update outputs.tf
  • Loading branch information
caetano-colin authored Jun 19, 2024
1 parent e95edf3 commit 8f23e99
Show file tree
Hide file tree
Showing 4 changed files with 298 additions and 4 deletions.
102 changes: 102 additions & 0 deletions examples/genai-rag-multimodal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,108 @@ When running the Notebook, you will reach a step that downloads an example PDF f
- projects/200612033880 # Google Cloud Example Project
```
## Deploying infrastructure using Machine Learning Infra Pipeline
### Required Permissions for pipeline Service Account
- Give `roles/compute.networkUser` to the Service Account that runs the Pipeline.

```bash
SERVICE_ACCOUNT=$(terraform -chdir="./gcp-projects/ml_business_unit/shared" output -json terraform_service_accounts | jq -r '."ml-machine-learning"')
gcloud projects add-iam-policy-binding <INSERT_HOST_VPC_NETWORK_PROJECT_HERE> --member="serviceAccount:$SERVICE_ACCOUNT" --role="roles/compute.networkUser"
```

- Add the following ingress rule to the Service Perimeter.

```yaml
ingressPolicies:
- ingressFrom:
identities:
- serviceAccount:<SERVICE_ACCOUNT>
sources:
- accessLevel: '*'
ingressTo:
operations:
- serviceName: '*'
resources:
- '*'
```

### Deployment steps

**IMPORTANT:** Please note that the steps below are assuming you are checked out on the same level as `terraform-google-enterprise-genai/` and the other repos (`gcp-bootstrap`, `gcp-org`, `gcp-projects`...).

- Retrieve the Project ID where the Machine Learning Pipeline Repository is located in.

```bash
export INFRA_PIPELINE_PROJECT_ID=$(terraform -chdir="gcp-projects/ml_business_unit/shared/" output -raw cloudbuild_project_id)
echo ${INFRA_PIPELINE_PROJECT_ID}
```

- Clone the repository.

```bash
gcloud source repos clone ml-machine-learning --project=${INFRA_PIPELINE_PROJECT_ID}
```

- Navigate into the repo and the desired branch. Create directories if they don't exist.

```bash
cd ml-machine-learning
git checkout -b development
mkdir -p ml_business_unit/development
mkdir -p modules
```

- Copy required files to the repository.

```bash
cp -R ../terraform-google-enterprise-genai/examples/genai-rag-multimodal ./modules
cp ../terraform-google-enterprise-genai/build/cloudbuild-tf-* .
cp ../terraform-google-enterprise-genai/build/tf-wrapper.sh .
chmod 755 ./tf-wrapper.sh
cat ../terraform-google-enterprise-genai/examples/genai-rag-multimodal/terraform.tfvars >> ml_business_unit/development/genai_example.auto.tfvars
cat ../terraform-google-enterprise-genai/examples/genai-rag-multimodal/variables.tf >> ml_business_unit/development/variables.tf
```

> NOTE: Make sure there are no variable name collision for variables under `terraform-google-enterprise-genaiexamples/genai-rag-multimodal/variables.tf` and that your `terraform.tfvars` is updated with values from your environment.

- Validate that variables under `ml_business_unit/development/genai_example.auto.tfvars` are correct.

```bash
cat ml_business_unit/development/genai_example.auto.tfvars
```

- Create a file named `genai_example.tf` under `ml_business_unit/development` path that calls the module.

```terraform
module "genai_example" {
source = "../../modules/genai-rag-multimodal"
kms_key = var.kms_key
network = var.network
subnet = var.subnet
machine_learning_project = var.machine_learning_project
vector_search_vpc_project = var.vector_search_vpc_project
}
```

- Commit and push

```terraform
git add .
git commit -m "Add GenAI example"
git push origin development
```

## Deploying infrastructure using terraform locally

Run `terraform init && terraform apply -auto-approve`.

## Usage

Once all the requirements are set up, you can start by running and adjusting the notebook step-by-step.
Expand Down
173 changes: 171 additions & 2 deletions examples/genai-rag-multimodal/multimodal_rag_langchain.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -599,7 +599,7 @@
"\n",
"\n",
"# Image summaries\n",
"img_base64_list, image_summaries = generate_img_summaries(\".\")"
"img_base64_list, image_summaries = generate_img_summaries(\"./intro_multimodal_rag_old_version\")"
]
},
{
Expand Down Expand Up @@ -824,8 +824,17 @@
" for i, s in enumerate(text_summaries + table_summaries + image_summaries)\n",
"]\n",
"\n",
"retriever_multi_vector_img.docstore.mset(list(zip(doc_ids, doc_contents)))\n",
"list_of_docs = list(zip(doc_ids, doc_contents))\n",
"\n",
"retriever_multi_vector_img.docstore.mset(list_of_docs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If using Vertex AI Vector Search, this will take a while to complete.\n",
"# You can cancel this cell and continue later.\n",
"retriever_multi_vector_img.vectorstore.add_documents(summary_docs)"
Expand Down Expand Up @@ -1000,6 +1009,166 @@
"Markdown(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### RAGAS Evaluation\n",
"\n",
"On the cells below we will be using RAGAS to evaluate the RAG pipeline for text-based context."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install ragas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"questions = [\n",
" \"How did COVID-19 initially impact Google's advertising revenue in 2020?\",\n",
" \"How did Google's advertising revenue recover from the initial COVID-19 impact?\",\n",
" \"What was the primary driver of Google's operating cash flow in 2020?\",\n",
" \"How did Google's share repurchases compare to the previous year in 2020?\"\n",
"]\n",
"\n",
"golden_answers = [\n",
" \"COVID-19 initially impacted Google's advertising revenue in 2020 in two ways, Users searched for less commercially-driven topics, reducing the relevance and value of ads displayed and Businesses cut back on advertising budgets due to the economic downturn caused by the pandemic.\",\n",
" \"Google's advertising revenue recovered from the initial COVID-19 impact through a combination of factors, User search activity shifted back to more commercially-driven topics, increasing the effectiveness of advertising and As the economic climate improved, businesses began to invest more heavily in advertising again.\",\n",
" \"The primary driver of Google's operating cash flow in 2020 was revenue generated from its advertising products, totaling $91.7 billion\",\n",
" \"Google's share repurchases in 2020 were $50.3 billion, reflecting a significant increase of 62% compared to the prior year.\"\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def summarize_image_context(doc_base64):\n",
" prompt = \"\"\"You are an assistant tasked with summarizing images for retrieval. \\\n",
" These summaries will be embedded and used to retrieve the raw image. \\\n",
" Give a concise summary of the image that is well optimized for retrieval.\n",
" If it's a table, extract all elements of the table.\n",
" If it's a graph, explain the findings in the graph.\n",
" Do not include any numbers that are not mentioned in the image.\n",
" \"\"\"\n",
" return image_summarize(doc_base64, prompt)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_samples = {\n",
" \"contexts\": [],\n",
" \"question\": [],\n",
" \"answer\": [],\n",
" \"ground_truth\": []\n",
" }\n",
"\n",
"for i, question in enumerate(questions): \n",
" docs = retriever_multi_vector_img.get_relevant_documents(question, limit=10) \n",
" image_contexts = []\n",
" \n",
" source_docs = split_image_text_types(docs)\n",
" \n",
" if len(source_docs[\"images\"]) > 0: \n",
" for image in source_docs[\"images\"]:\n",
" image_contexts.append(summarize_image_context(image))\n",
" \n",
" text_context = source_docs[\"texts\"]\n",
" \n",
" data_samples[\"contexts\"].append(text_context + image_contexts)\n",
" data_samples[\"question\"].append(question)\n",
" data_samples[\"answer\"].append(chain_multimodal_rag.invoke(question))\n",
" data_samples[\"ground_truth\"].append(golden_answers[i])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datasets import Dataset\n",
"\n",
"dataset = Dataset.from_dict(data_samples)\n",
"\n",
"\n",
"from ragas.metrics import (\n",
" context_precision,\n",
" answer_relevancy,\n",
" faithfulness,\n",
" context_recall,\n",
" answer_similarity,\n",
" answer_correctness,\n",
")\n",
"from ragas.metrics.critique import harmfulness\n",
"\n",
"# list of metrics we're going to use\n",
"metrics = [\n",
" faithfulness,\n",
" answer_relevancy,\n",
" context_recall,\n",
" context_precision,\n",
" harmfulness,\n",
" answer_similarity,\n",
" answer_correctness,\n",
"]\n",
"\n",
"from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings\n",
"\n",
"config = { \n",
" \"chat_model_id\": \"gemini-1.0-pro-002\",\n",
" \"embedding_model_id\": \"textembedding-gecko\",\n",
"}\n",
"\n",
"\n",
"vertextai_llm = ChatVertexAI(model_name=config[\"chat_model_id\"],)\n",
"vertextai_embeddings = VertexAIEmbeddings(model_name=config[\"embedding_model_id\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataset.to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"from ragas import evaluate\n",
"\n",
"result = evaluate(\n",
" dataset, # using 1 as example due to quota constrains\n",
" metrics=metrics,\n",
" llm=vertextai_llm,\n",
" embeddings=vertextai_embeddings,\n",
")\n",
"\n",
"result.to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down
4 changes: 2 additions & 2 deletions examples/genai-rag-multimodal/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,12 @@ output "host_vpc_project_id" {

output "host_vpc_network" {
description = "This is the Self-link of the Host VPC network"
value = var.network
value = google_workbench_instance.instance.gce_setup[0].network_interfaces[0].network
}

output "notebook_project_id" {
description = "The Project ID where the notebook will be run on"
value = var.machine_learning_project
value = google_workbench_instance.instance.project
}

output "vector_search_bucket_name" {
Expand Down
23 changes: 23 additions & 0 deletions examples/genai-rag-multimodal/versions.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
/**
* Copyright 2024 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

terraform {
required_providers {
google = {
version = "~> 5.34.0"
}
}
}

0 comments on commit 8f23e99

Please sign in to comment.