We would like to investigate the answer by evaluating how Multimodal Large Language Models (MLLMs) perform when applied to graphical perception tasks.
To answer this research question, our study evaluates MLLMs' performance on graphical perception, inspired by two popular studies from Cleveland and McGrill, and Haehn et al. We incorporate our pretrained and fine-tuned MLLMs into our experiments. In particular, we compress fine-tuned models by training them on top of pretrained models, digging deep into their performance using efficient training configurations and scalable adaptation techniques to improve efficiency and save resources.
You can view 👉 our Paper | Supplemental | Fast Forward | Video | Results | User Study 👈
Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images. Despite these advancements, accurately regressing values in charts remains an underexplored area for MLLMs.
We used the latest Multimodal Large Language Models technology, such as GPT-4o, Gemini 1.5 Flash, Gemini Pro Vision, and Llama 3.2B Vision.
We also produced 15 fine-tuned MLLMs (each experiment has three fine-tuned MLLMs) for this study, and built on top of Llama 3.2 Vision.
👉 Feel free to access our fine-tuned models below:
EXP1: Model 1 | Model 2 | Model 3
EXP2: Model 1 | Model 2 | Model 3
EXP3: Model 1 | Model 2 | Model 3
EXP4: Model 1 | Model 2 | Model 3
EXP5: Model 1 | Model 2 | Model 3 👈
Our study compares the performance of fine-tuned and pretrained models to see which models perform better. We also desire to understand where aspects of MLLMs succeed or fail in the data visualization fields.
We recreated stimuli for five experiments based on Cleveland and McGill’s foundational 1984 study, comparing those results with human task performance.
We integrated the most advanced packages into our study:
- Pytorch
- Pytorch Lighting
- Hugging Face Transformers
- Distributed Data Parallel
- Vision Encoder-Decoder Models
- Parameter-Efficient Fine-tuning
- Low-Rank Adaptation of Large Language Models
- Bits-per-Bite
We also designed a comprehensive script, and everything in one place, starting from generating datasets, fine-tuned models, and evaluating their MLLMs performances to interpret and analyze visual data.
You'll need to have the following installed:
- Python 3.10 or higher
- Pip
- Conda
After you forked and cloned our project into your repo, we recommend using Conda to set up a new environment:
conda env create -f conda-environment.yml
pip install -r pip-requirements.txt
After installing all packages and dependencies:
- Navigate to the LLMP directory:
cd LLMP
- Create a
.env
file with your API keys:
chatgpt_api_key="add your API key"
gemini_api_key="add your API key"
- Navigate to the experiment directory:
cd LLMP/EXP/EXP1
- Run the experiment:
python EXP1fullprogressrun.py
We use Experiment 1 for this demonstration.
We would like to thank:
- Daniel Haehn for helping us build the foundation for this research and advising us to integrate the latest technology into our paper.
- Kenichi Maeda and Masha Geshvadi for their assistance in writing and reviewing multiple scripts and building up our paper together.
We have also learned a great deal from this project, spanning #programming, #datavisualization, #imageprocessing, #machinelearning, and #computergraphics, and applied these insights to our study. Most importantly, we all contributed to this CS460 - Computer Graphics project at the University of Massachusetts Boston. Here's the link to the course website: CS460.org.
Any questions? Cool - let's discuss it with me now 🟢
Visit my LinkedIn Profile | Send me an Email | View My Resume