The main idea behind this research is to try improving the Image captioning results, the change that I made is using BERT token embeddings for the text (captions) and I used Inception_resnet_v2 for the images and I successfully implemented them. Also, It is proposed to use BERTScore[2] for captions evaluation, but this is not included in the implementation.
Dataset: I downloaded it using torrent because it's quite big and the provided forum for downloading it is not working.
Presentarion: https://docs.google.com/presentation/d/10kZi5rQVZ-Tkrt5IQh_1J5iZoWbjTUtJMPaCpHAkMlk/edit#slide=id.g57487e59e2_0_26
Documentation: https://docs.google.com/document/d/1W1FD2-nDMSW6x4HnuopuOBJO4qPl3-lSSklyPb5dRiQ/edit