This project implements a Multimodal Retrieval-Augmented Generation (RAG) System designed to process and integrate text, images, audio, video, and tables. By leveraging diverse data formats, the system enhances the accuracy and relevance of generated responses for a wide range of applications.
The system utilizes the following datasets:
-
Images:
-
Text:
- Wikipedia Example Dataset (25,000 articles)
- OpenWebText (high-quality web articles)
-
Audio:
- ESC-50 (Environmental Sound Classification)
- TAU Urban Acoustic Scenes 2020 Mobile
-
Video:
- Videos from Pexels converted into frames for processing.
- Frames extracted from videos using:
- First frame.
- Every 5 seconds.
- Last frame.
- Frames embedded using CLIP and linked back to the original video file.
- Embedding:
- Used CLIP for embedding text, images, and video frames.
- Used CLAP model for embedding audio.
- VLM:
- Integrated with GPT-4O for answering queries.
- Workflow:
- Query and data embeddings stored in Chromadb.
- Cosine similarity used to retrieve top-k results.
- Retrieved results passed to GPT-4O for final response generation.
- Embedding:
- Leveraged ColPali, an advanced Vision-Language Model (VLM) architecture based on PaliGemma-3B, which generates ColBERT-style multi-vector representations for text and images.
- Token Matching:
- Utilized ColBert for late interaction retrieval with token-level matching.
- VLM:
- Responses generated using Llama 3.2.
- Workflow:
- Each document page treated as an image, divided into patches, and indexed.
- Queries embedded, matched with stored embeddings, and retrieved data processed via Llama 3.2.
- Programming Language: Python
- Database: Chromadb
- Models:
- CLIP: For embedding images and video frames.
- CLAP: For embedding audio features.
- ColPali: For document embeddings, leveraging multi-vector representations from text and visual features.
- ColBert: For token-level matching and late interaction retrieval.
- PaliGemma-3B Extension (ColPali): A specialized Vision-Language Model for document indexing.
- GPT-4O and Llama 3.2: For Visual Language Modeling and query responses.
- CLIP: Learning Transferable Visual Models from Natural Language Supervision
- CLAP: Learning Audio Concepts from Natural Language Supervision
- ColPali: Efficient Document Retrieval with Vision-Language Models
- HuggingFace Datasets Documentation
This project was inspired by or uses resources from the following repository: