IntelliDocs is a Retrieval-Augmented Generation (RAG) based project designed to assist users in querying and extracting information from their PDF documents. By leveraging advanced natural language processing techniques, IntelliDocs enables users to efficiently retrieve relevant content from large volumes of text within PDFs.
- PDF Extraction: Implement methods to extract text from PDF files, ensuring the preservation of formatting and structure.
- Text Processing: Clean and tokenize extracted text to prepare it for chunking and embedding.
- Chunking: Divide the processed text into manageable chunks to facilitate efficient querying.
- Embedding: Use Sentence Transformers to generate embeddings for the text chunks, enabling semantic similarity searches.
- Querying: Develop a retrieval system that allows users to input queries and receive relevant chunks of text based on semantic similarity.
- Programming Language:
Python
- Libraries:
pandas
: For data manipulation and embedding storage in csv format (no vector database used).sentence-transformers
: For embedding text chunks.fitz
: For PDF text extraction.Streamlit
: For creating user interface (temporary).
- Machine Learning: Utilizes
pre-trained
embedding model for vector embeddings but does not use vector database for storage.
├── CSV_db
## your vector embeddings go here
├── README.md
├── model
│ ├── __init__.py
│ ├── intellidocs_main.py ## run this to get the gist of how the project
works
│ ├── intellidocs_rag_final ## final RAG version
│ │ ├── __init__.py
│ │ ├── chunk_processor.py
│ │ ├── cosine_similarity.py
│ │ ├── embedding_process.py
│ │ ├── intellidocs_rag_constants.py
│ │ ├── pdf_loader.py
│ │ └── retrieval_process.py
│ ├── intellidocs_rag_v2
│ │ ├── __init__.py
│ │ └── intellidocs_RAG_V2.py
│ └── rag_gemini_v1
│ ├── __init__.py
│ ├── document_processor.py
│ ├── faiss_saver_and_responser.py
│ └── text_processor.py
├── notebooks
│ ├── RAG_from_scratch.ipynb
├── pdfs
## your pdf files go here
├── requirements.txt
├── ui.py
└── utils
├── __init__.py
└── constants.py ## project constants inc. paths
Ensure you have the following installed on your system:
- Python (version 3.7 or higher)
- pip (Python package installer)
- Git
Open your terminal or command prompt and run the following command:
git clone https://github.com/anishka07/intellidocs.git
Run the following command:
## Example:
conda create -n your_env_name python=3.11 pip -y
conda activate your_env_name
Run the following command:
pip install -r requirements.txt
To run IntelliDocs from terminal:
cd model
python intellidocs_main.py (make sure to checkt the file)
To run IntelliDocs from it's streamlit UI:
streamlit run ui.py
- Input PDF: Upload your PDF document using the Streamlit interface (for now).
- Querying: Enter your query in the provided input field and submit.
- Results: The system will return the most relevant text chunks extracted from the PDF based on your query.
- Expand Support: Extend support to other document formats (e.g., DOCX, TXT).
- Web Application: Create a full stack web application with apis.
- Summarization: Extracted text summarization using Tf-Idf.