Image Captioning Model - BLIP (Bootstrapping Language-Image Pre-training). This model is designed for unified vision-language understanding and generation tasks. It is trained on the COCO (Common Objects in Context) dataset using a base architecture with a ViT (Vision Transformer) large backbone.
It utilizes the BLIP architecture, which combines bootstrapping language-image pre-training with the ability to generate creative captions using the OpenAI ChatGPT API.
The image captioning model is implemented using the PyTorch framework and leverages the Hugging Face Transformers library for efficient natural language processing.
Used streamlit python library for creating interactive web applications.
Caption can be generate for any image at link
Here are some example images along with the captions generated by the BLIP image captioning model:
Generated Caption: "Nothing beats the joy of a sunny day spent playing soccer with friends."
Generated Caption:
- Nature is calling, so answer the call with your Jeep and let the adventure begin.
- Live life on the wild side and take the road less traveled.
Generated Caption:
- Take a moment to appreciate the beauty of a sunset by the beach.
- The beach is the perfect place to end the day and enjoy the beauty of the sunset.
To run the image captioning model, the following dependencies are required:
- Python (version 3.7 or above)
- PyTorch (version 1.8 or above)
- Transformers library (version 4.3 or above)
You can install the necessary libraries using the following command:
pip install -r requirements.txt
We would like to express our gratitude to the researchers and developers who have contributed to the development and implementation of the BLIP image captioning model. Their dedication and hard work are greatly appreciated.
If you have any questions, issues, or feedback regarding the image captioning model, please feel free to contact us at [email protected].
We hope you find the BLIP image captioning model useful and enjoy experimenting with it!