This is a code release of captioning videos using Neuraltalk2. We provide a way to extract the deep image feature of VGG-16, and detect shot boundaries using the feature. We can also finetune the MS-COCO model, annotate the key frames, and return the captions to the video sequence. A sample output can be found here on Youtube.
- Caffe: https://github.com/BVLC/caffe
- neuraltalk2: https://github.com/karpathy/neuraltalk2
- ffmpeg: https://www.ffmpeg.org/
Below we show an example of generating captions from a test video sequence. (The test video is part of the 'santa' video which we show on Youtube.) You should be able to replicate the workflow with your video.
We firstly extract the frames from the videos using ffmpeg. There are a few parameters you need to setup: '-ss' denotes the starting time. '-t' indicates the duration of the video you want to process. '-i' gives the input video. You should replace the directory and the video name with your file. '-r' defines the frame rate, and here we use 5 ms. After that, you should define the name of the extracted image sequence in a new directory.
$ ffmpeg -ss 00:00:00 -t 00:00:30 -i YOUR_WORKING_DIRECTORY/data/test.mp4 -r 5.0 YOUR_WORKING_DIRECTORY/data/santa/img/s%4d.jpg
Besides the caffe package, we use one of the pre-trained models called VGG-16. Essentially, it is a topology including a very deep network with 16 layers. You should download the weights and layer configuration under your Caffe directory.
Now, you can extract the visual feature from the video frames. We provide a code called 'caffe_feat.py' for that. You need to open the file and change the 'caffe_root' and 'input_path' to your own directory. Then, you can run the following script.
python caffe_feat.py
It will generate a feature file called 'feat.txt' in svm-light format in the 'input_path' folder.
Since we have extracted the visual feature from all frames in the video, we can find the key frames which separate the video shots. You should change to your working directory and run the script below.
python caption.py 'YOUR_WORKING_DIRECTORY' 'genKeyframes'
A group of key frames will be stored under 'YOUR_WORKING_DIRECTORY/key/'.
Now, you should use the tools from Neuraltalk2 to generate captions from the video frames. Find the path to the installed Neuraltalk2 package, and run the 'eval.lua' as below,
th eval.lua -model /YOUR_NEURALTALK2_MODEL_PATH/model_coco.t7 -image_folder YOUR_WORKING_PATH/data/santa/key -num_images -1 > caplog.txt
Here, we create a log file to store the captions. This is a workaround (hack) to the 'vis.json' file generated by Neuratalk2 originally. Below, we also take an additional edit to the log file in order to create the srt file. You should open 'caplog.txt' file and remove the header and foot notes in the log, and leave only the caption information as follows
cp "/homeappl/home/gcao/tmp/Video-Caption/data/santa/key/s0001.jpg" vis/imgs/img1.jpg
image 1: a black and white photo of a car parked on the side of the road
evaluating performance... 1/-1 (0.000000)
cp "/homeappl/home/gcao/tmp/Video-Caption/data/santa/key/s0037.jpg" vis/imgs/img2.jpg
image 2: an airplane is parked on the tarmac at an airport
evaluating performance... 2/-1 (0.000000)
cp "/homeappl/home/gcao/tmp/Video-Caption/data/santa/key/s0013.jpg" vis/imgs/img3.jpg
image 3: a car is parked on the side of the road
evaluating performance... 3/-1 (0.000000)
Here, we want to create a caption file with the time frames corresponding to the video. Below is how we do that.
python caption.py 'YOUR_WORKING_PATH/' 'genSrt'
Finally, we can attach the caption to the video and show the result in 'capped.mp4'.
ffmpeg -i YOUR_WORKING_PATH/data/test.mp4 -vf subtitles=santa.srt capped.mp4
Voila! Now you can caption your videos with neuraltalk2. Note, the subtitles you generated come from the pre-trained model provided by Karpathy. You can follow their instruction to train new language models. In the future, we may have an update on this with our model.