Merge pull request #186 from cohere-ai/jj-pdf

Notebook for agents extracting PDF text.
cohere-ai · Jun 6, 2024 · 5c254af · 5c254af
2 parents 82dae9a + 40ea9a9
commit 5c254af
Show file tree

Hide file tree

Showing 3 changed files with 366 additions and 0 deletions.
diff --git a/notebooks/agents/pdf-extractor/output.csv b/notebooks/agents/pdf-extractor/output.csv
@@ -0,0 +1,2 @@
+total_amount,invoice_number
+$115.00,0852
diff --git a/notebooks/agents/pdf-extractor/pdf_extractor.ipynb b/notebooks/agents/pdf-extractor/pdf_extractor.ipynb
@@ -0,0 +1,364 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "# PDF Extractor with Agents\n",
+    "\n",
+    "## Objective\n",
+    "\n",
+    "Generally, users are limited to text inputs when using large language models (LLMs), but agents enable the model to do more than ingest or output text information. Using tools, LLMs can call other APIs, save data, and much more. In this notebook, we will explore how we can leverage agents to extract information from PDFs. We will mimic an application where the user uploads PDF files and the agent extracts useful information. This can be useful when the text information has varying formats and you need to extract various types of information.\n",
+    "\n",
+    "In the directory, we have a simple_invoice.pdf file. Everytime a user uploads the document, the agent will extract key information total_amount and invoice_number and then save it as CSV which then can be used in another application. We only extract two pieces of information in this demo, but users can extend the example and extract a lot more information.\n",
+    "\n",
+    "## Steps \n",
+    "\n",
+    "1. extract_pdf() function extracts text data from the PDF using [unstructured](https://unstructured.io/) package. You can use other packages like PyPDF2 as well.\n",
+    "2. This extracted text is added to the prompt so the model can \"see\" the document.\n",
+    "3. The agent summarizes the document and passes that information to convert_to_json() function. This function makes another call to command model to convert the summary to json output. This separation of tasks is useful when the text document is complicated and long. Therefore, we first distill the information and ask another model to convert the text into json object. This is useful so each model or agent focuses on its own task without suffering from long context.\n",
+    "4. Then the json object goes through a check to make sure all keys are present and gets saved as a csv file. When the document is too long or the task is too complex, the model may fail to extract all information. These checks are then very useful because they give feedback to the model so it can adjust it's parameters to retry.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "import cohere\n",
+    "import pandas as pd\n",
+    "import json\n",
+    "from unstructured.partition.pdf import partition_pdf"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# uncomment to install dependencies\n",
+    "# !pip install cohere unstructured"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "cohere version: 5.5.1\n"
+     ]
+    }
+   ],
+   "source": [
+    "# versions\n",
+    "print('cohere version:', cohere.__version__)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "COHERE_API_KEY = os.environ.get(\"CO_API_KEY\")\n",
+    "COHERE_MODEL = 'command-r-plus'\n",
+    "co = cohere.Client(api_key=COHERE_API_KEY)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data \n",
+    "\n",
+    "The sample invoice data is from https://unidoc.io/media/simple-invoices/simple_invoice.pdf. \n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tool \n",
+    "\n",
+    "Here we define the tool which converts summary of the pdf into json object. Then, it checks to make sure all necessary keys are present and saves it as csv. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def convert_to_json(text: str) -> dict:\n",
+    "    \"\"\"\n",
+    "    Given text files, convert to json object and saves to csv.\n",
+    "\n",
+    "    Args:\n",
+    "        text (str): The text to extract information from.\n",
+    "\n",
+    "    Returns:\n",
+    "        dict: A dictionary containing the result of the conversion process.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    MANDATORY_FIELDS = [\n",
+    "        \"total_amount\",\n",
+    "        \"invoice_number\",\n",
+    "    ]\n",
+    "\n",
+    "    message = \"\"\"# Instruction\n",
+    "    Given the text, convert to json object with the following keys:\n",
+    "    total_amount, invoice_number\n",
+    "\n",
+    "    # Output format json:\n",
+    "    {{\n",
+    "        \"total_amount\": \"<extracted invoice total amount>\",\n",
+    "        \"invoice_number\": \"<extracted invoice number>\",\n",
+    "    }}\n",
+    "\n",
+    "    Do not output code blocks.\n",
+    "\n",
+    "    # Extracted PDF\n",
+    "    {text}\n",
+    "    \"\"\"\n",
+    "\n",
+    "    result = co.chat(\n",
+    "        message=message.format(text=text), model=COHERE_MODEL, preamble=None\n",
+    "    ).text\n",
+    "\n",
+    "    try:\n",
+    "        result = json.loads(result)\n",
+    "        # check if all keys are present\n",
+    "        if not all(i in result.keys() for i in MANDATORY_FIELDS):\n",
+    "            return {\"result\": f\"ERROR: Keys are missing. Please check your result {result}\"}\n",
+    "\n",
+    "        df = pd.DataFrame(result, index=[0])\n",
+    "        df.to_csv(\"output.csv\", index=False)\n",
+    "        return {\"result\": \"SUCCESS. All steps have been completed.\"}\n",
+    "\n",
+    "    except Exception as e:\n",
+    "        return {\"result\": f\"ERROR: Could not load the result as json. Please check the result: {result} and ERROR: {e}\"}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Cohere Agent \n",
+    "\n",
+    "Below is a cohere agent that leverages multi-step API. It is equipped with convert_to_json tool. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def cohere_agent(\n",
+    "    message: str,\n",
+    "    preamble: str,\n",
+    "    verbose: bool = False,\n",
+    ") -> str:\n",
+    "    \"\"\"\n",
+    "    Function to handle multi-step tool use api.\n",
+    "\n",
+    "    Args:\n",
+    "        message (str): The message to send to the Cohere AI model.\n",
+    "        preamble (str): The preamble or context for the conversation.\n",
+    "        verbose (bool, optional): Whether to print verbose output. Defaults to False.\n",
+    "\n",
+    "    Returns:\n",
+    "        str: The final response from the call.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    functions_map = {\n",
+    "        \"convert_to_json\": convert_to_json,\n",
+    "    }\n",
+    "\n",
+    "    tools = [\n",
+    "        {\n",
+    "            \"name\": \"convert_to_json\",\n",
+    "            \"description\": \"Given a text, convert it to json object.\",\n",
+    "            \"parameter_definitions\": {\n",
+    "                \"text\": {\n",
+    "                    \"description\": \"text to be converted into json\",\n",
+    "                    \"type\": \"str\",\n",
+    "                    \"required\": True,\n",
+    "                },\n",
+    "            },\n",
+    "        }\n",
+    "    ]\n",
+    "\n",
+    "    counter = 1\n",
+    "\n",
+    "    response = co.chat(\n",
+    "        model=COHERE_MODEL,\n",
+    "        message=message,\n",
+    "        preamble=preamble,\n",
+    "        tools=tools,\n",
+    "    )\n",
+    "\n",
+    "    if verbose:\n",
+    "        print(f\"\\nrunning step 0\")\n",
+    "        print(response.text)\n",
+    "\n",
+    "    while response.tool_calls:\n",
+    "        tool_results = []\n",
+    "\n",
+    "        if verbose:\n",
+    "            print(f\"\\nrunning step {counter}\")\n",
+    "        for tool_call in response.tool_calls:\n",
+    "            print(\"tool_call.parameters:\", tool_call.parameters)\n",
+    "            if tool_call.parameters:\n",
+    "                output = functions_map[tool_call.name](**tool_call.parameters)\n",
+    "            else:\n",
+    "                output = functions_map[tool_call.name]()\n",
+    "\n",
+    "            outputs = [output]\n",
+    "            tool_results.append({\"call\": tool_call, \"outputs\": outputs})\n",
+    "\n",
+    "            if verbose:\n",
+    "                print(\n",
+    "                    f\"= running tool {tool_call.name}, with parameters: {tool_call.parameters}\"\n",
+    "                )\n",
+    "                print(f\"== tool results: {outputs}\")\n",
+    "\n",
+    "        response = co.chat(\n",
+    "            model=COHERE_MODEL,\n",
+    "            message=\"\",\n",
+    "            chat_history=response.chat_history,\n",
+    "            preamble=preamble,\n",
+    "            tools=tools,\n",
+    "            tool_results=tool_results,\n",
+    "        )\n",
+    "\n",
+    "        if verbose:\n",
+    "            print(response.text)\n",
+    "            counter += 1\n",
+    "\n",
+    "    return response.text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### main "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "running step 0\n",
+      "I will summarise the text and then use the convert_to_json tool to format the summary.\n",
+      "\n",
+      "running step 1\n",
+      "tool_call.parameters: {'text': 'Total amount billed: $115.00\\nInvoice number: 0852'}\n",
+      "= running tool convert_to_json, with parameters: {'text': 'Total amount billed: $115.00\\nInvoice number: 0852'}\n",
+      "== tool results: [{'result': 'SUCCESS. All steps have been completed.'}]\n",
+      "SUCCESS.\n",
+      "Finished extracting: simple_invoice.pdf\n",
+      "Please check the output below\n",
+      "  total_amount  invoice_number\n",
+      "0      $115.00             852\n"
+     ]
+    }
+   ],
+   "source": [
+    "def extract_pdf(path):\n",
+    "    \"\"\"\n",
+    "    Function to extract text from a PDF file.\n",
+    "    \"\"\"\n",
+    "    elements = partition_pdf(path)\n",
+    "    return \"\\n\".join([str(el) for el in elements])\n",
+    "\n",
+    "\n",
+    "def pdf_extractor(pdf_path):\n",
+    "    \"\"\"\n",
+    "    Main function that extracts pdf and calls the cohere agent.\n",
+    "    \"\"\"\n",
+    "    pdf_text = extract_pdf(pdf_path)\n",
+    "\n",
+    "    prompt = f\"\"\"\n",
+    "    # Instruction\n",
+    "    You are expert at extracting invoices from PDF. The text of the PDF file is given below.\n",
+    "\n",
+    "    You must follow the steps below:\n",
+    "    1. Summarize the text and extract only the most information: total amount billed and invoice number.\n",
+    "    2. Using the summary above, call convert_to_json tool, which uses the summary from step 1.\n",
+    "    If you run into issues. Identifiy the issue and retry.\n",
+    "    You are not done unless you see SUCCESS in the tool output.\n",
+    "\n",
+    "    # File Name:\n",
+    "    {pdf_path}\n",
+    "\n",
+    "    # Extracted Text:\n",
+    "    {pdf_text}\n",
+    "    \"\"\"\n",
+    "    output = cohere_agent(prompt, None, verbose=True)\n",
+    "    print(f\"Finished extracting: {pdf_path}\")\n",
+    "\n",
+    "    print('Please check the output below')\n",
+    "    print(pd.read_csv('output.csv'))\n",
+    "\n",
+    "\n",
+    "pdf_extractor('simple_invoice.pdf')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As shown above, the model first summarized the extracted pdf as `Total amount: $115.00\\nInvoice number: 0852` and sent this to `conver_to_json()` function. \n",
+    "`conver_to_json()` then converts it to json format and saves it into a csv file. "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/agents/pdf-extractor/simple_invoice.pdf b/notebooks/agents/pdf-extractor/simple_invoice.pdf