Merge pull request #156 from cohere-ai/rerank-json

Rerank 3 JSON example
cohere-ai · May 4, 2024 · 4199b7b · 4199b7b
2 parents 4e742f0 + 6cedf61
commit 4199b7b
Showing 1 changed file with 287 additions and 0 deletions.
diff --git a/notebooks/llmu/Rerank_Endpoint.ipynb b/notebooks/llmu/Rerank_Endpoint.ipynb
@@ -0,0 +1,287 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "56b0e523",
+   "metadata": {},
+   "source": [
+    "# Reranking Unstructured and Semi-Structured Data\n",
+    "\n",
+    "Enterprise data is often complex, and current systems have difficulty searching through multi-aspect and semi-structured data sources. The most useful data at companies is not often in simple document format, and semi-structured data formats such as JSON are common across enterprise applications.\n",
+    "\n",
+    "The Rerank API Endpoint, powered by the [Rerank 3](https://docs.cohere.com/docs/rerank-2) model, can search over semi-structured data that is represented as JSON. You can simply take your JSON documents, e.g. from an Elasticsearch or MongoDB, and pass it to the Rerank 3 model. By setting the rank fields, you can select which fields should be considered by the model for the reranking.\n",
+    "\n",
+    "In this notebook, we'll see how to use Rerank 3 to rank complex, multi-aspect data (e.g. emails) based on all of their relevant metadata fields. The diagram below provides an overview of what we'll build.\n",
+    "\n",
+    "![rerank-email-example](https://cohere.com/_next/image?url=https%3A%2F%2Flh7-us.googleusercontent.com%2Fwvf252whPErQDMQi8LiS_5werbHEIKWWfyDZYinKcrQnNe2CX6rRmm8ahfyPT101I0YggS-h0nkaIENBysIHJfy8ztNSJIl1Q4LQaJWZeDMTO0bLNk9iREvcWBI3wd-q1Q_qdDCZQPL5L53vqAKi6P4&w=1920&q=75)\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "We'll do the following steps: \n",
+    "- **Step 1: Define a JSON Dataset** - Define the email dataset.\n",
+    "- **Step 2: Define Fields for the Reranking** - Select which fields should be considered by the model.\n",
+    "- **Step 3: Call the Rerank Endpoint** - Pull results deemed most relevant to two sample queries.\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "We start by installing the Cohere SDK."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6e543d03",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install cohere -q"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4b7ca061",
+   "metadata": {},
+   "source": [
+    "Fill in your Cohere API key in the next cell. To do this, begin by [signing up to Cohere](https://os.cohere.ai/) (for free!) if you haven't yet. Then get your API key [here](https://dashboard.cohere.com/api-keys)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "cb297cba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cohere\n",
+    "\n",
+    "# Paste your API key here. Remember to not share publicly\n",
+    "co = cohere.Client(\"COHERE_API_KEY\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b4d5e16",
+   "metadata": {},
+   "source": [
+    "## Step 1: Define a JSON Dataset\n",
+    "\n",
+    "We begin by creating our dataset, which is a Python list of JSON objects.  Each JSON object is a different email with five fields. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "6dbedb4b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "emails = [\n",
+    "    {\n",
+    "        \"from\": \"Paul Doe <[email protected]>\",\n",
+    "        \"to\": [\"Steve <[email protected]>\", \"[email protected]\"],\n",
+    "        \"date\": \"2024-03-27\",\n",
+    "        \"subject\": \"Follow-up\",\n",
+    "        \"text\": \"We are happy to give you the following pricing for your project.\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"from\": \"John McGill <[email protected]>\",\n",
+    "        \"to\": [\"Steve <[email protected]>\"],\n",
+    "        \"date\": \"2024-03-28\",\n",
+    "        \"subject\": \"Missing Information\",\n",
+    "        \"text\": \"Sorry, but here is the pricing you asked for for the newest line of your models.\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"from\": \"John McGill <[email protected]>\",\n",
+    "        \"to\": [\"Steve <[email protected]>\"],\n",
+    "        \"date\": \"2024-02-15\",\n",
+    "        \"subject\": \"Commited Pricing Strategy\",\n",
+    "        \"text\": \"I know we went back and forth on this during the call but the pricing for now should follow the agreement at hand.\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"from\": \"Generic Airline Company<no_reply@generic_airline_email.com>\",\n",
+    "        \"to\": [\"Steve <[email protected]>\"],\n",
+    "        \"date\": \"2023-07-25\",\n",
+    "        \"subject\": \"Your latest flight travel plans\",\n",
+    "        \"text\": \"Thank you for choose to fly Generic Airline Company. Your booking status is confirmed.\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"from\": \"Generic SaaS Company<marketing@generic_saas_email.com>\",\n",
+    "        \"to\": [\"Steve <[email protected]>\"],\n",
+    "        \"date\": \"2024-01-26\",\n",
+    "        \"subject\": \"How to build generative AI applications using Generic Company Name\",\n",
+    "        \"text\": \"Hey Steve! Generative AI is growing so quickly and we know you want to build fast!\"\n",
+    "    },\n",
+    "    {\n",
+    "        \"from\": \"Paul Doe <[email protected]>\",\n",
+    "        \"to\": [\"Steve <[email protected]>\", \"[email protected]\"],\n",
+    "        \"date\": \"2024-04-09\",\n",
+    "        \"subject\": \"Price Adjustment\",\n",
+    "        \"text\": \"Re: our previous correspondence on 3/27 we'd like to make an amendment on our pricing proposal. We'll have to decrease the expected base price by 5%.\"\n",
+    "    },\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "db3453a3",
+   "metadata": {},
+   "source": [
+    "## Step 2: Define Fields for the Reranking\n",
+    "\n",
+    "Next, we define which fields we want to include for the reranking. In this case, we will use all fields."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "60ed6532",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rank_fields = [\"from\", \"to\", \"date\", \"subject\", \"text\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be3594bc",
+   "metadata": {},
+   "source": [
+    "## Step 3: Call the Rerank Endpoint\n",
+    "\n",
+    "Now we are ready to call the Rerank endpoint, which will help us to find emails that are relevant to a particular query.\n",
+    "\n",
+    "We'll begin with a query about pricing from Microsoft (MS). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "3a92ffc6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"What is the pricing that we received from MS?\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "28664c16",
+   "metadata": {},
+   "source": [
+    "To pull relevant emails, the model needs to combine information from the `\"to\"` field (e.g., `\"[email protected]\"`) and the `\"text\"` field (e.g., `\"Sorry, but here is the pricing you asked for ...\"`). \n",
+    "\n",
+    "We call the Rerank endpoint with `co.rerank()` and use five parameters: \n",
+    "- `query` - The query that we will use to find relevant documents\n",
+    "- `documents` -  The set of documents (or, in this case, emails) to rerank\n",
+    "- `top_n` - The number of most relevant documents to return\n",
+    "- `model` - For this data, we need to use 'rerank-english-v3.0', Cohere's latest English reranking model \n",
+    "- `rank_fields` - The keys we would like to have considered for reranking\n",
+    "\n",
+    "The next code cell prints the two emails deemed most relevant by the Rerank endpoint, in decreasing order of relevance. It correctly pulled the relevant emails, and in the correct order."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "id": "4cc5b622",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'from': 'John McGill <[email protected]>', 'to': ['Steve <[email protected]>'], 'date': '2024-03-28', 'subject': 'Missing Information', 'text': 'Sorry, but here is the pricing you asked for for the newest line of your models.'}\n",
+      "{'from': 'John McGill <[email protected]>', 'to': ['Steve <[email protected]>'], 'date': '2024-02-15', 'subject': 'Commited Pricing Strategy', 'text': 'I know we went back and forth on this during the call but the pricing for now should follow the agreement at hand.'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "results = co.rerank(query=query, \n",
+    "                    documents=emails, \n",
+    "                    top_n=2,\n",
+    "                    model='rerank-english-v3.0', \n",
+    "                    rank_fields=rank_fields)\n",
+    "\n",
+    "for hit in results.results:\n",
+    "    email = emails[hit.index]\n",
+    "    print(email)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dcef93a0",
+   "metadata": {},
+   "source": [
+    "Now we'll work with another query, this time asking for the pricing from Oracle."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "ee7dcfcc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"Which pricing did we get from Oracle\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d5326fa",
+   "metadata": {},
+   "source": [
+    "We call the Rerank endpoint a second time and print the results. Again, the model returns a correct result."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "650fc281",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Query: Which pricing did we get from Oracle\n",
+      "{'from': 'Paul Doe <[email protected]>', 'to': ['Steve <[email protected]>', '[email protected]'], 'date': '2024-03-27', 'subject': 'Follow-up', 'text': 'We are happy to give you the following pricing for your project.'}\n",
+      "{'from': 'Paul Doe <[email protected]>', 'to': ['Steve <[email protected]>', '[email protected]'], 'date': '2024-04-09', 'subject': 'Price Adjustment', 'text': \"Re: our previous correspondence on 3/27 we'd like to make an amendment on our pricing proposal. We'll have to decrease the expected base price by 5%.\"}\n"
+     ]
+    }
+   ],
+   "source": [
+    "results = co.rerank(query=query, \n",
+    "                    documents=emails, \n",
+    "                    top_n=2, \n",
+    "                    model='rerank-english-v3.0', \n",
+    "                    rank_fields=rank_fields)\n",
+    "\n",
+    "print(\"Query:\", query)\n",
+    "for hit in results.results:\n",
+    "    email = emails[hit.index]\n",
+    "    print(email)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}