PyTorch implementation of Image Captioning Paper
- python
- pytorch
- pytorch-vision
- pillow
- nltk
- pickle
- cuda(highly recommended)
Clone the repo to your local system :
git clone link-to-repo
NOTE: If you are using Colab, use !git instead.
- Dataset used is Flickr8k (download here).
- Extract and move images to a folder named : images and text files to folder named : text
NOTE : Place both folders alongside the python files.
- Run the following command in Windows :
python preprocess.py
NOTE : Hey buddy :) , for all the python commands , keep in mind that these are for Windows. If using Linux/Mac use python3. On Colab use !python.
- Load the tensorboard (optional):
load_ext tensorboard
tensorboard --logdir=runs
NOTE : On Colab, use the command below.
%load_ext tensorboard
%tensorboard --logdir=runs
- Run the following command :
python train.py
usage: train.py [-h] [-model MODEL] [-dir DIR] [-save_epoch SAVE_EPOCH]
[-learning_rate LEARNING_RATE] [-num_epoch NUM_EPOCH]
[-hidden_dim HIDDEN_DIM] [-embedding_dim EMBEDDING_DIM]
optional arguments:
-h, --help show this help message and exit
-model MODEL Encoder CNN architecture.Default: 'resnet18', other
option is 'inception' (Inception_v3). Model dir is
automatically saved with name of model +
current_datetime.
-dir DIR Training Directory path, default: 'train'
-save_epoch SAVE_EPOCH
Epochs after which model saves checkpoint, default : 2
-learning_rate LEARNING_RATE
Adam optimizer learning rate, default : 1e-3 (0.001)
-num_epoch NUM_EPOCH Number of epochs, default : 10
-hidden_dim HIDDEN_DIM
Dimensions in hidden state of LSTM decoder, default :
512
-embedding_dim EMBEDDING_DIM
Dimensions of encoder output, default : 512
- Run the following command :
python test.py model_dir <MODEL_DIR> filename <FILENAME> epoch <EPOCH>
usage: test.py [-h] [-model MODEL] [-test_dir TEST_DIR]
filename epoch model_dir
positional arguments:
filename Image filename.
epoch Number of epochs model has been trained for.
model_dir Saved model directory, which has name of format: model +
current_datetime.
optional arguments:
-h, --help show this help message and exit
-model MODEL Encoder CNN architecture.Default: 'resnet18', other
option is 'inception' (Inception_v3).
-test_dir TEST_DIR Test dataset directory name, default: 'test'.
- Run the following command :
python validation.py model_dir <MODEL_DIR>
usage: validation.py [-h] [--model MODEL] [--dir DIR]
[--save_epoch SAVE_EPOCH] [--num_epoch NUM_EPOCH]
model_dir
positional arguments:
model_dir Saved model directory, which has name of format: model
+ current_datetime.
optional arguments:
-h, --help show this help message and exit
--model MODEL Default: 'resnet18', other option is 'inception'
(Inception_v3).
--dir DIR Development Directory path, default: 'dev'
--save_epoch SAVE_EPOCH
Epochs after which trained model has saved checkpoint,
default : 2
--num_epoch NUM_EPOCH
Number of epochs model was trained for, default : 10
NOTE : The save_epoch and num_epoch should match with your corresponding training model .
- Title : Deep Visual-Semantic Alignments for Generating Image Descriptions
- Dated : 2015
- Authors : Andrej Karpathy, Li Fei-Fei
- University : Department of Computer Science, Stanford University
- Field : Deep Learning, NLP, CNN, Image Captioning
The main motivation behind the paper was to design a model capable enough to reason various parts of image and use them to generate a rich description. This is very easy for humans, even for a small child, but machine dosnt relate tokens with image itself. The latter would be great help in CCTV cameras, blind aid or even search engines. Earlier models tackling the same exist but rely on some hard-coded visual concepts and sentence templates, limiting the scope of model. The main two contributions of this model are:
- A deep neural network model that understands the latent alignment or relation between segments of sentences and the region of the image that they describe. Our model associates the two modalities through a common multimodal embedding.
- A multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text, which surpasses retrieval based baselines, and produce sensible qualitative predictions.
The model consists of encoder and decoder models that communicate through a latent space vector. Basically we map an image to some intractable latent space via encoding and map the latent space representation of the image to the sentence space via decoding.
We need to provide image as fixed size of vector to generate text, hence a convolutional neural network is used to encode our images. Here transfer learning is preferred, pretained on ImageNet dataset, to cater the constratints of resources and computation. Torchvision has many pretrained models, and any model can be used like ResNet, AlexNet, Inception_v3, DenseNet. We have to remove the last softmax layer which is for classification purpose, as we only need an encoded vector, of size 512 or 1024 usually.
This part is a Recurrent Neural Network with LSTM(Long Short Term Memory) cells. A Bidirectional Recurrent Neural Network (BRNN) to compute the word representations. The BRNN takes a sequence of N words (encoded in a 1-of-k representation) and transforms each one into an h-dimensional vector. However, the representation of each word is enriched by a variably-sized context around that word. Using the index t = 1 . . . N to denote the position of a word in a sentence, the precise form of the BRNN is as follows(here f is ReLU activation function, hence f(x) = max(0,x)) :
Since the supervision is at the level of entire images and sentences, the strategy is to formulate an image-sentence score as a function of the individual regionword scores. Obviously, a sentence-image pair should have a high matching score if its words have a confident support in the image. The dot product between the i-th region and t-th word as a measure of similarity and use it to define the score between image k and sentence l as:
Here, gk is the set of image fragments in image k and gl is the set of sentence fragments in sentence l. The indices k, l range over the images and sentences in the training set.- Training time (model = resnet18):
Epoch : 0 , Avg_loss = 3.141907, Time = 9.89 mins
Epoch : 1 , Avg_loss = 2.978030, Time = 9.89 mins
Epoch : 2 , Avg_loss = 2.879061, Time = 9.88 mins
Epoch : 3 , Avg_loss = 2.800483, Time = 9.90 mins
Epoch : 4 , Avg_loss = 2.734463, Time = 9.88 mins
Epoch : 5 , Avg_loss = 2.676081, Time = 9.90 mins
Epoch : 6 , Avg_loss = 2.625130, Time = 9.89 mins
Epoch : 7 , Avg_loss = 2.579518, Time = 9.90 mins
Epoch : 8 , Avg_loss = 2.538572, Time = 9.90 mins
Epoch : 9 , Avg_loss = 2.501715, Time = 9.90 mins
NOTE : Here loss plotted is total loss of dataset, which is for 6000 X 5 = 30000 captions. Mean loss is loss plotted divided by 30000 .
- Validation time (model = resnet18):
0 3.007883
2 3.035437
4 3.088141
6 3.161836
8 3.253206
NOTE : Train for more than 50 epochs for optimum results . Training is highly GPU intensive!