Niketkumardheeryan · Niketkumardheeryan · Jun 2, 2024 · May 20, 2024 · May 20, 2024 · May 22, 2024
diff --git a/QnA Conversational Agent/PDFChat.ipynb b/QnA Conversational Agent/PDFChat.ipynb
@@ -0,0 +1,297 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# QnA Conversational Agent"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 📚 Chat with Your Data Bot - Jupyter Notebook 🤖\n",
+    "\n",
+    "Welcome to the **Chat with Your Data Bot** Jupyter Notebook! This notebook demonstrates how to set up and use a conversational AI model to interact with your PDF or TXT documents using Azure OpenAI and the LangChain library. Follow along to understand the process of initializing the model, uploading documents, and asking questions to get insightful answers.\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "In this notebook, we will cover the following steps:\n",
+    "\n",
+    "1. **Setting Up the Environment**: Installing the necessary libraries and configuring the Azure OpenAI credentials.\n",
+    "2. **Loading and Preprocessing Documents**: Uploading PDF or TXT files and splitting them into smaller chunks for efficient processing.\n",
+    "3. **Creating Embeddings and Vector Store**: Using Hugging Face embeddings to convert document chunks into vectors and storing them in an in-memory vector store.\n",
+    "4. **Initializing the Conversational Retrieval Chain**: Setting up the retrieval chain to interact with the documents based on user queries.\n",
+    "5. **Interacting with the Bot**: Asking questions and receiving answers along with references to the source documents.\n",
+    "\n",
+    "Let's get started! 🚀\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "Before running this notebook, ensure you have the following:\n",
+    "\n",
+    "- **Azure OpenAI API Key**: Your Azure OpenAI API key, endpoint, deployment name, and API version.\n",
+    "- **Python Environment**: A Python environment with the necessary libraries installed (as listed in the `requirements.txt` file).\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## Note\n",
+    "\n",
+    "### **main.py and utils.py needs to be put in a seperate files .py files,** \n",
+    "### **once it is done, go to the terminal and type:**\n",
+    "\n",
+    "```streamlit run main.py```\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### Table of Contents\n",
+    "\n",
+    "1. [Setting Up the Environment](#setting-up-the-environment)\n",
+    "2. [Loading and Preprocessing Documents](#loading-and-preprocessing-documents)\n",
+    "3. [Creating Embeddings and Vector Store](#creating-embeddings-and-vector-store)\n",
+    "4. [Initializing the Conversational Retrieval Chain](#initializing-the-conversational-retrieval-chain)\n",
+    "5. [Interacting with the Bot](#interacting-with-the-bot)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setting Up the Environment"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import streamlit as st\n",
+    "from langchain.chat_models import AzureChatOpenAI\n",
+    "from langchain.chains import ConversationalRetrievalChain\n",
+    "from langchain.document_loaders import PyPDFLoader, TextLoader\n",
+    "from utils import load_db\n",
+    "import tempfile\n",
+    "from langchain.vectorstores import DocArrayInMemorySearch\n",
+    "from langchain.embeddings import HuggingFaceEmbeddings\n",
+    "from langchain.chains import ConversationalRetrievalChain\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "\n",
+    "\n",
+    "#I have used Azure OpenAI API Key for this bot, if ypu are using OpenAI, please follow the syntax from OpenAI Website.\n",
+    "# Initialize Azure OpenAI client\n",
+    "openai_api_key = \"\"\n",
+    "azure_endpoint = \"\"\n",
+    "azure_deployment_name = \"\"\n",
+    "azure_api_version = \"\"\n",
+    "\n",
+    "llm = AzureChatOpenAI(\n",
+    "    api_key=openai_api_key,\n",
+    "    azure_endpoint=azure_endpoint,\n",
+    "    deployment_name=azure_deployment_name,\n",
+    "    api_version=azure_api_version,\n",
+    "    temperature=0,\n",
+    "    max_tokens=500\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading and Processing the Documents"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### I have added those processes in the streamlit app, in the below code."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Streamlit app\n",
+    "# main.py\n",
+    "\n",
+    "import streamlit as st\n",
+    "from langchain.chat_models import AzureChatOpenAI\n",
+    "from langchain.chains import ConversationalRetrievalChain\n",
+    "from langchain.document_loaders import PyPDFLoader, TextLoader\n",
+    "from utils import load_db\n",
+    "import tempfile\n",
+    "from langchain.vectorstores import DocArrayInMemorySearch\n",
+    "from langchain.embeddings import HuggingFaceEmbeddings\n",
+    "from langchain.chains import ConversationalRetrievalChain\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "\n",
+    "\n",
+    "#I have used Azure OpenAI API Key for this bot, if you are using OpenAI, please follow the syntax from OpenAI Website.\n",
+    "# Initialize Azure OpenAI client\n",
+    "openai_api_key = \"\"\n",
+    "azure_endpoint = \"\"\n",
+    "azure_deployment_name = \"\"\n",
+    "azure_api_version = \"\"\n",
+    "\n",
+    "llm = AzureChatOpenAI(\n",
+    "    api_key=openai_api_key,\n",
+    "    azure_endpoint=azure_endpoint,\n",
+    "    deployment_name=azure_deployment_name,\n",
+    "    api_version=azure_api_version,\n",
+    "    temperature=0,\n",
+    "    max_tokens=500\n",
+    ")\n",
+    "\n",
+    "st.title('Chat with Your Data Bot')\n",
+    "st.sidebar.header('Upload a PDF or TXT file')\n",
+    "\n",
+    "uploaded_file = st.sidebar.file_uploader(\"Choose a file\", type=[\"pdf\", \"txt\"])\n",
+    "\n",
+    "\n",
+    "\n",
+    "# Once the file is uploaded, the code creates a local db file, here it is creating a Chroma based DB.\n",
+    "if uploaded_file is not None:\n",
+    "    file_extension = uploaded_file.name.split('.')[-1]\n",
+    "    \n",
+    "    # Create a temporary file to save the uploaded file\n",
+    "    with tempfile.NamedTemporaryFile(delete=False, suffix=f\".{file_extension}\") as tmp_file:\n",
+    "        tmp_file.write(uploaded_file.read())\n",
+    "        tmp_file_path = tmp_file.name\n",
+    "\n",
+    "    if file_extension == 'pdf':\n",
+    "        loader = PyPDFLoader(tmp_file_path)\n",
+    "    elif file_extension == 'txt':\n",
+    "        loader = TextLoader(tmp_file_path)\n",
+    "    else:\n",
+    "        st.error('Unsupported file type!')\n",
+    "\n",
+    "    documents = loader.load()\n",
+    "    qa = load_db(documents, llm)\n",
+    "\n",
+    "    st.session_state.qa = qa\n",
+    "    st.session_state.chat_history = []\n",
+    "\n",
+    "    st.success(f'Loaded file: {uploaded_file.name}')\n",
+    "\n",
+    "\n",
+    "if 'qa' in st.session_state:\n",
+    "    user_query = st.text_input(\"Enter your question:\")\n",
+    "\n",
+    "    if st.button('Submit'):\n",
+    "        result = st.session_state.qa({\"question\": user_query, \"chat_history\": st.session_state.chat_history})\n",
+    "        st.session_state.chat_history.extend([(user_query, result[\"answer\"])])\n",
+    "\n",
+    "# The output shows the answer, the original query and the sourced paragraphs with in the file.\n",
+    "        st.write(f\"**User:** {user_query}\")\n",
+    "        st.write(f\"**ChatBot:** {result['answer']}\")\n",
+    "\n",
+    "        st.write(\"**DB Query:**\", result[\"generated_question\"])\n",
+    "        st.write(\"**Source Documents:**\")\n",
+    "        for doc in result[\"source_documents\"]:\n",
+    "            st.write(doc)\n",
+    "\n",
+    "    if st.button('Clear History'):\n",
+    "        st.session_state.chat_history = []\n",
+    "        st.write(\"Chat history cleared!\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setting up Vector DB through --> Chroma & Initalizing Conversation Chain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#utils.py\n",
+    "\n",
+    "def load_db(documents, llm, k=4, chain_type=\"stuff\"):\n",
+    "    # Initialize a text splitter with specific chunk size and overlap\n",
+    "    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)\n",
+    "    \n",
+    "    # Split the input documents into smaller chunks\n",
+    "    docs = text_splitter.split_documents(documents)\n",
+    "\n",
+    "    # Ensure each document chunk has text and metadata\n",
+    "    for doc in docs:\n",
+    "        if not hasattr(doc, 'metadata'):  # If the document does not have metadata, add an empty metadata dictionary\n",
+    "            doc.metadata = {}\n",
+    "        doc.metadata['text'] = doc.page_content  # Add the page content as a metadata field\n",
+    "\n",
+    "    # Define embeddings using a pre-trained Hugging Face model\n",
+    "    embedding = HuggingFaceEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
+    "\n",
+    "    # Create an in-memory vector database from the document chunks using the embeddings\n",
+    "    db = DocArrayInMemorySearch.from_documents(docs, embedding)\n",
+    "\n",
+    "    # Create a retriever from the vector database that retrieves documents based on similarity\n",
+    "    retriever = db.as_retriever(search_type=\"similarity\", search_kwargs={\"k\": k})\n",
+    "\n",
+    "    # Initialize a conversational retrieval chain with the given LLM and retriever\n",
+    "    qa = ConversationalRetrievalChain.from_llm(\n",
+    "        llm=llm,  # Language model to use for generating responses\n",
+    "        chain_type=chain_type,  # Type of chain to use (default is \"stuff\")\n",
+    "        retriever=retriever,  # Retriever for fetching relevant documents\n",
+    "        return_source_documents=True,  # Whether to return the source documents\n",
+    "        return_generated_question=True,  # Whether to return the generated question\n",
+    "    )\n",
+    "\n",
+    "    return qa  # Return the initialized conversational retrieval chain\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Its time to interact with the chatbot!\n",
+    "### Hope you liked the bot\n",
+    "### Please visit [@hiteshhhh007](https://github.com/hiteshhhh007) for more such projects and codes!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/QnA Conversational Agent/readme.md b/QnA Conversational Agent/readme.md
@@ -0,0 +1,66 @@
+# 📚 Chat with Your Data Bot 🤖
+
+Welcome to the **Chat with Your Data Bot** project! This powerful Streamlit application allows you to interact with your PDF or TXT documents using a conversational AI model powered by Azure OpenAI. Upload your files, ask questions, and get insightful answers along with relevant source document references. Let's dive in! 🚀
+
+## Features ✨
+
+- **User-Friendly Interface**: Easily upload PDF or TXT files via a clean and intuitive sidebar.
+- **Conversational AI**: Leverage the capabilities of Azure OpenAI to interact with your documents.
+- **Support for PDF and TXT Files**: Upload documents in PDF or TXT formats.
+- **Efficient Document Retrieval**: Utilize Hugging Face embeddings and in-memory vector search for quick and accurate document retrieval.
+- **Source Document Reference**: Get the answer along with the source document for better context and understanding.
+- **Persistent Chat History**: Keep track of your questions and answers during the session.
+- **Clear Chat History**: Option to clear the chat history and start fresh at any time.
+
+## How to Use 🛠️
+
+### Prerequisites
+
+- **Azure OpenAI API Key**: Make sure you have an Azure OpenAI API key, endpoint, deployment name, and API version.
+
+### Setup
+
+1. **Clone the Repository**:
+    ```sh
+    git clone https://github.com/Niketkumardheeryan/ML-CaPsule.git
+    ```
+
+2. **Install Dependencies**:
+    ```sh
+    pip install -r requirements.txt
+    ```
+
+3. **Configure Azure OpenAI**:
+    - Open the `main.ipynb` file.
+    - Replace the placeholders with your Azure OpenAI credentials:
+      ```python
+      openai_api_key = "your_api_key_here"
+      azure_endpoint = "your_endpoint_here"
+      azure_deployment_name = "your_deployment_name_here"
+      azure_api_version = "your_api_version_here"
+      ```
+
+### Running the Application
+
+1. **Start the Streamlit App**
+2. **Upload Your File**:
+    - Click on the sidebar and upload your PDF or TXT file.
+
+3. **Ask Questions**:
+    - Type your question in the text input field and hit "Submit".
+    - The bot will provide an answer along with the source document references.
+
+4. **Clear Chat History**:
+    - Click the "Clear History" button to start a new session.
+
+## Requirements 📋
+
+Make sure to have the following libraries installed. These are specified in the `requirements.txt` file:
+
+```plaintext
+streamlit==1.7.0
+langchain==0.0.92
+PyPDF2==1.26.0
+docarray==0.0.1
+torch==1.10.1
+transformers==4.15.0
diff --git a/QnA Conversational Agent/requirements.txt b/QnA Conversational Agent/requirements.txt
@@ -0,0 +1,6 @@
+streamlit==1.7.0
+langchain==0.0.92
+PyPDF2==1.26.0
+docarray==0.0.1
+torch==1.10.1
+transformers==4.15.0