Name		Name	Last commit message	Last commit date
parent directory ..
images		images
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
blip_processing.py		blip_processing.py
clip_processing.py		clip_processing.py
create_dataset_file.py		create_dataset_file.py
deploy_gcp.py		deploy_gcp.py
image_utils.py		image_utils.py
load_blip_weights.py		load_blip_weights.py
model_handler.py		model_handler.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

README.md

Large Scale Image Captioning With Dataflow

Intro

This repo prepares a Dataflow job for large scale image captioning using BLIP and CLIP to generate and rank image captions.

Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing based on Apache beam.

The model licenses can be found BLIP CLIP

Features:

Creates image captions using BLIP.
Ranks captions and uses the top one.
Parallelize a job with lots of images across multiple workers.
Saves image/caption pairs in jsonl format in HuggingFace's datasets format.
Can run a small subsample in a local environment before deploying the dataflow job.

Setup

Clone the repo if you haven't. Navigate to the image-captioning-dataflow folder.

Install python3.8 and dependencies

conda create -n py38 python=3.8
conda activate py38
pip install -r requirements.txt

Install BLIP, download weights and save state dict. Change to the absolute path of the folder where BLIP was cloned.

git clone https://github.com/salesforce/BLIP
export PYTHONPATH=$PYTHONPATH:<your-blip-location>/BLIP
gdown 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base_caption.pth'
python load_blip_weights.py

Copy BLIP configs

mkdir configs
cp BLIP/configs/med_config.json configs/

Download clip weights

git lfs install
git clone https://huggingface.co/openai/clip-vit-base-patch32

Create a dataset.txt file. You'll need to upload the images you want to caption to Google cloud storage. For example, I created a bucket jfacevedo-demos-datasets with a folder me and uploaded all my images to that folder. We will upload the output file, dataset.txt, into the same directory where our images are located. Test this with a few images at first before using the full image dataset.
```
export BUCKET_ID="jfacevedo-demos-datasets"
export PREFIX="me"
python create_dataset_file.py
gsutil cp dataset.txt gs://$BUCKET_ID/$PREFIX/
```
Next we'll need to move the weights to a local directory /captioning. The dataflow job won't actually use local files but this is needed to deploy the dataflow job and also we'll be testing this locally before deploying.
```
chmod 755 clip-vit-base-patch32/
sudo mkdir /captioning/
sudo chmod 755 /captioning/
sudo cp -r clip-vit-base-patch32/ /captioning/
```
Test the pipeline locally. This works without GPUs but takes longer.
```
python pipeline.py --dataset-filename gs://$BUCKET_ID/$PREFIX/dataset.txt --output-filename gs://$BUCKET_ID/$PREFIX/metadata.jsonl
```
If we look at the output file (or files), beam has sharded the output into multiple files which improves the performance of running this workload in parallel. You can join the files as follows.
```
gsutil compose \
gs://${BUCKET_ID}/$PREFIX/metadata* \
gs://${BUCKET_ID}/$PREFIX/metadata.jsonl
```

We'll be using a custom container to run our Dataflow job. Build and push the container. Make sure you set to yours

export PROJECT_ID=<project-id>
docker build . -t gcr.io/$PROJECT_ID/dataflow-captioning:latest
docker push gcr.io/$PROJECT_ID/dataflow-captioning:latest

Run the dataflow job. First, you'll need a service account with Dataflow Admin, Dataflow Worker and Compute Network User. You can either use the default service account or create a new service account. Furthermore, if you are on the default network that comes with your project, you can ommit --subnetwork. If you're using the default service account, you can ommit --service_account_email. In the following snippet, I'm using a custom service account and a VPC network. If you're using the same --temp_location as the command below, make sure to create a bucket $PROJECT_ID-bucket.

This job uses a T4 GPU.

python pipeline.py \
--dataset-filename gs://$BUCKET_ID/$PREFIX/dataset.txt \
--output-filename gs://$BUCKET_ID/$PREFIX/metadata.jsonl \
--runner=DataflowRunner \
--project=$PROJECT_ID \
--region=us-central1 \
--job_name=captioning \
--temp_location=gs://$PROJECT_ID-bucket/ \
--sdk_container_image=gcr.io/$PROJECT_ID/dataflow-captioning:latest \
--machine_type=n1-standard-16 \
--experiment="worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver" \
--experiment=use_runner_v2 \
--disk_size_gb=200 \
--subnetwork=https://www.googleapis.com/compute/v1/projects/$PROJECT_ID/regions/us-central1/subnetworks/jfacevedo-demo-subnet \
--service_account_email=vertex-ai@$PROJECT_ID.iam.gserviceaccount.com \
--sdk_location=container

You can view the job's progress through the Dataflow console.

Don't forget to consolidate the sharded files into one to use for training , for example, with Stable diffusion.

Running with multiple GPUs

If you're trying to deploy this with multiple workers and multiple GPUs, check your project's quota allows for this, or you'll be prevented from running efficiently. Leaving the default max_num_workers value of 100 will surely saturate the single GPU in the job above for large workloads. Instead, set the value to something reasonable and increase the number of GPUs. For example.

```bash
python pipeline.py \
--dataset-filename gs://$BUCKET_ID/$PREFIX/dataset.txt \
--output-filename gs://$BUCKET_ID/$PREFIX/metadata.jsonl \
--runner=DataflowRunner \
--project=$PROJECT_ID \
--region=us-central1 \
--job_name=captioning \
--temp_location=gs://$PROJECT_ID-bucket/ \
--sdk_container_image=gcr.io/$PROJECT_ID/dataflow-captioning:latest \
--machine_type=n1-standard-16 \
--experiment="worker_accelerator=type:nvidia-tesla-t4;count:4;install-nvidia-driver" \
--experiment=use_runner_v2 \
--disk_size_gb=200 \
--subnetwork=https://www.googleapis.com/compute/v1/projects/$PROJECT_ID/regions/us-central1/subnetworks/jfacevedo-demo-subnet \
--service_account_email=vertex-ai@$PROJECT_ID.iam.gserviceaccount.com \
--sdk_location=container \
--max_num_workers=4
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image-captioning-dataflow

image-captioning-dataflow

README.md

Large Scale Image Captioning With Dataflow

Intro

Setup

Running with multiple GPUs

Files

image-captioning-dataflow

Directory actions

More options

Directory actions

More options

Latest commit

History

image-captioning-dataflow

Folders and files

parent directory

README.md

Large Scale Image Captioning With Dataflow

Intro

Setup

Running with multiple GPUs