IntelliDocs

Overview

IntelliDocs is a Retrieval-Augmented Generation (RAG) based project designed to assist users in querying and extracting information from their PDF documents. By leveraging advanced natural language processing techniques, IntelliDocs enables users to efficiently retrieve relevant content from large volumes of text within PDFs.

Project Objectives

PDF Extraction: Implement methods to extract text from PDF files, ensuring the preservation of formatting and structure.
Text Processing: Clean and tokenize extracted text to prepare it for chunking and embedding.
Chunking: Divide the processed text into manageable chunks to facilitate efficient querying.
Embedding: Use Sentence Transformers to generate embeddings for the text chunks, enabling semantic similarity searches.
Querying: Develop a retrieval system that allows users to input queries and receive relevant chunks of text based on semantic similarity.

Technologies Used

Programming Language: Python
Libraries:
- pandas: For data manipulation and embedding storage in csv format (no vector database used).
- sentence-transformers: For embedding text chunks.
- fitz: For PDF text extraction.
- Streamlit: For creating user interface (temporary).
Machine Learning: Utilizes pre-trained embedding model for vector embeddings but does not use vector database for storage.

Project Structure

├── CSV_db
## your vector embeddings go here
├── README.md
├── model
│   ├── __init__.py
│   ├── intellidocs_main.py ## run this to get the gist of how the project 
works
│   ├── intellidocs_rag_final ## final RAG version
│   │   ├── __init__.py
│   │   ├── chunk_processor.py
│   │   ├── cosine_similarity.py
│   │   ├── embedding_process.py
│   │   ├── intellidocs_rag_constants.py
│   │   ├── pdf_loader.py
│   │   └── retrieval_process.py
│   ├── intellidocs_rag_v2
│   │   ├── __init__.py
│   │   └── intellidocs_RAG_V2.py
│   └── rag_gemini_v1
│       ├── __init__.py
│       ├── document_processor.py
│       ├── faiss_saver_and_responser.py
│       └── text_processor.py
├── notebooks
│   ├── RAG_from_scratch.ipynb
├── pdfs
## your pdf files go here
├── requirements.txt
├── ui.py
└── utils
    ├── __init__.py
    └── constants.py ## project constants inc. paths

Step-by-Step Guide to Clone and Run IntelliDocs

Prerequisites

Ensure you have the following installed on your system:

Python (version 3.7 or higher)
pip (Python package installer)
Git

Step 1: Clone the Repository

Open your terminal or command prompt and run the following command:

git clone https://github.com/anishka07/intellidocs.git

Step 2: Create a virtual environment using conda or virtual env and activate it

Run the following command:

## Example:
conda create -n your_env_name python=3.11 pip -y
conda activate your_env_name

Step 3: Install Requirements

Run the following command:

pip install -r requirements.txt

Step 4: Run IntelliDocs from terminal or streamlit

To run IntelliDocs from terminal:

cd model
python intellidocs_main.py (make sure to checkt the file)

To run IntelliDocs from it's streamlit UI:

streamlit run ui.py

Streamlit Interface

Usage

Input PDF: Upload your PDF document using the Streamlit interface (for now).
Querying: Enter your query in the provided input field and submit.
Results: The system will return the most relevant text chunks extracted from the PDF based on your query.

Future Work

Expand Support: Extend support to other document formats (e.g., DOCX, TXT).
Web Application: Create a full stack web application with apis.
Summarization: Extracted text summarization using Tf-Idf.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
images		images
milvus-docker		milvus-docker
model		model
notebooks		notebooks
static		static
templates		templates
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
ui.py		ui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IntelliDocs

Overview

Project Objectives

Technologies Used

Project Structure

Step-by-Step Guide to Clone and Run IntelliDocs

Prerequisites

Step 1: Clone the Repository

Step 2: Create a virtual environment using conda or virtual env and activate it

Step 3: Install Requirements

Step 4: Run IntelliDocs from terminal or streamlit

Streamlit Interface

Usage

Future Work

About

Releases

Packages

Languages

anishka07/Intellidocs

Folders and files

Latest commit

History

Repository files navigation

IntelliDocs

Overview

Project Objectives

Technologies Used

Project Structure

Step-by-Step Guide to Clone and Run IntelliDocs

Prerequisites

Step 1: Clone the Repository

Step 2: Create a virtual environment using conda or virtual env and activate it

Step 3: Install Requirements

Step 4: Run IntelliDocs from terminal or streamlit

Streamlit Interface

Usage

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages