Moondream is a multimodal tiny language vision model with 1.6B parameters, designed to handle both text and image inputs, generating text-based outputs such as captions or responses based on the given input. Here the inference model has been entirely implemented in Mojo, while pre-processing (such as tokenization) and post-processing are handled in Python. The original model implementation in Pytorch can be found here
-
Clone the repo:
git clone https://github.com/taalhaataahir0102/MoonDreamMojo.git
-
Download the weights file from here.
-
Extract the weights file in the same directory.
-
Make executable file
chmod +x run.sh
-
Run the model using:
./run.sh
Model will ask for the input image and question.
RAM: 16 GB
Hard-disk space: 10 GB
Question: What is the flower wearing?
Answer: The flower is wearing sunglasses.
Question: Describe the image
Answer: The image features a group of three paper sculptures of animals, including an elephant, a zebra, and a lion, set against a backdrop of a sunset. The sculptures are arranged in a way that showcases the animals together in a natural setting.