This project aims to create a prototype video search engine that returns similar videos based on various different methods of measuring similarity.
1. Color based similarity - color_similarity.ipynb
A Faster - GPU based implementation of k-means clustering - is used for getting the dominant color.
I used Facebook’s faiss GPU implementation and compared it with scikit-learn’s vanilla k-means. Below is a table of speeds.
Useful Links:
Number of clusters (k) | Runtime with CPU (s) | Runtime with GPU (s) |
---|---|---|
5 | 4 | 0.3 |
10 | 18 | 0.3 |
20 | 45 | 0.3 |
50 | 141 | 0.3 |
As we can see, using a GPU gives a massive performance boost in this case from 13x to nearly 500x as the cluster size (k) increases from 5 to 50. We see sklearn takes exponentially more time and a GPU definitely speeds up this process. Although the algorithms are not exactly the same, the performance and quality of results are comparable and the speedup is worth the minor loss in accuracy.
PS: In all experiments, the runtime includes the time to load the image and cluster it.
Then we perform a similarity search using these dominant colors to find similar colored images and by concatenating dominant colors for multiple frames, we find similar colored videos.
Explanation:
We get such a 1x100 image for each of the images we cluster i.e. we cluster the resized image of size 200x200 and create a smaller image of size 100 (0.25%) that contains its dominant colors in descending order of dominance.
**Histogram of K-dominant colors as extracted by using k-means, ordered by cluster size**The next step is to concatenate these 100 pixel images for each frame (samples at n fps from the video of length L seconds) to create a new image of size L x n x 100 pixel image that will represent the entire video.
This compressed image representation is thereafter used to find similar images using either the MSE or SSIM similarity metric or by using the KNN algorithm.
PS: For the current experiments, L and n is variable but I have used a script to extract a fixed number of frames (20) for each video. Thus the new image representation is 20 x 100 x 3 [since there are 3 channels (RGB)].
I have extracted this image representation for the videos in data/videos
and will now use KNN to return k-similar neighbors.
We cluster smaller images / features and use the faiss - GPU implementation instead of sklearn and store them in data/vid2img
.
For 10 x 7 = 70 videos, the process took 437 s to execute i.e. approx 6.2 s for video. This process includes:
- Extracting frames from videos and writing them to disk.
- Clustering the extracted frames (20 per video).
- Converting the histogram of clusters into an image and writing to disk.
Thereafter, I ran the KNN GPU implementation on the image representations in data/vid2img
.
The dimensionality is 6000 since (20 x 100 = 2000 pixels and RBG (3) values for each pixel).
Since we need the merged numpy array we also create a merged feature vector of shape 13220 x 6000 and store it in color_UCF_vid2img.h5. The features are stores in a color_vid2imgs
and corresponding labels are stored in vid_labels
.
The runtime for the KNN search with K=3 (build and run on all videos) for this subset of 70 videos is 132 ms and we can be sure that this will scale effectively based on previous experiments.
As we can see, class-wise accuracy for this method was not that good. With a test accuracy of only 20%. However while looking at similar videos based on color we see that there are a lot of classes that have similar color backgrounds.
True Label | Predicted Label | % of error |
---|---|---|
Skiing | Horse Riding | 32.7 |
Skiing | Skijet | 44.1 |
Skiing | Swing | 29.4 |
As we can see there may be some issues with this approach but it is able be capture some color features. For example: One of the swimming class had a test accuracy of 82%.
2. Feature based similarity - feature_similarity.ipynb
The extract_features.py script (Pytorch), extracts CNN features/embeddings using a pre-trained Resnet50 model. The feature vector is 2048 dimensional. Since the UCF101 dataset has a median video length of 8 seconds.
For 1000 image feature vectors takes the KNN search with K=3 takes 94.9 ms. We extract the features for all 13320 videos with 8 uniformly sampled frames each and extract feature vectors from each such frame. The process takes nearly 1 hour and the results are stored in features_UCF_resnet50.h5. This has the 8 x 2048 matrix for each video with the video path as the key (data/UCF101/<video_name>.avi).
Since we need the merged numpy array we also create a merged feature vector of shape 106557 x 2048 and store it in merged_features_UCF_resnet50.h5. The features are stores in a feature_vectors
and corresponding labels are stored in feature_labels
.
Performing the KNN similarity search with K=3 takes 18 seconds. (This includes 100,000 queries as well, one for each frame).
The class-wise accuracy is then calculated and it comes to 96.4%.
KNN feature similarity tests:-
1. Validation using train-test split (70:30):
Therefore we see that by taking a 70:30 split for train and test we still get a class-wise accuracy of 91.6% (as compared to 96.4% for all the data) for the 101 classes. Which is still good!
Moreover the time reduces from 17.3 seconds to 3.9 seconds when queries reduces from 100,000 to 30,000.
2. Number of queries = 100, 10, 1
The time reduces to approx 0.2 seconds for 100, 10 and 1 query. This will allow us to run our KNN similarity search in real-time!
No. of Queries | Runtime (s) |
---|---|
1 | 0.2 |
10 | 0.2 |
100 | 0.2 |
3. With K = 10, 100, 200
Value of K | Runtime (s) |
---|---|
3 | 0.2 |
10 | 0.2 |
100 | 0.2 |
200 | 0.2 |
As we can see the KNN similarity search scales well with the value of K. A single query on the 100,000 datapoints of dimensionality 2048 takes approximately 0.2 s each for K = 3, 10, 100, 200. Still runnable in real-time!
PS: Only 3.2% values on diagonal are less than 0.9 (Even though overall accuracy is 91.6%)
True Label | Predicted Label | % of error |
---|---|---|
Rock Climbing Indoor | Biking | 5.2 |
Kayaking | Rafting | 6.9 |
Rock Climbing Indoor | Skiing | 5.4 |
Kayaking | Skijet | 9.2 |
3. Sound based similarity - sound_similarity.ipynb
The word is based on the audioset
dataset and VGGish
model trained by Google (Tensorflow). The pipeline follow subsampling of audio to a standard form followed by creating a log mel-spectrogram
of size (96, 64) for each second. This is then fed into the pre-trained VGGish model that returns a 128 dimensional embedding for each second of audio. It is important to not that all audio clips in this dataset are on 10 seconds each. We use the balanced train split of the audioset data to test the performance which comprises of 21,782 audio clips.
The embeddings from the .tfrecord
files are read from disk and preprocessed. Since we need the merged numpy array we created a merged audio embedding matrix of shape 21,782 x 10 x 128 and stored it in audioset_balanced_features_vggish.h5. The features are stores in a audio_embeddings
and corresponding label(s) for each audio clip is stored in audioset_balanced_labels.npy.
Steps for extracting audio features:
- Extract audio signal and save in a wav (raw audio) file from the video (if it has an audio signal).
- Subsample the audio signal at a fixed standard rate.
- Build a log mel spectrogram after applying correction on it. (1 per second)
- Pass the log mel spectrograms to the VGGish model and extract audio a 128 dimensional audio embedding.
- Store these in an HDF5 file.
Only 51.3% (51/100 classes) of the videos in the UCF dataset have an audio channel and we use only these for the KNN similarity search.
We repeat the process of transfer learning like before to extract audio features. Each set of audio embeddings (per second) that is extracted from the video takes around 1 second. This includes time to extract audio from video and then to extract audio embedding from the video. We store these audio embeddings in an HDF5 file - audio_sec_UCF_vggish.h5
.
Performing the KNN similarity search with K=3 takes 0.5 seconds. (This includes 20,000 queries as well, one for each audio clip).
No. of Queries | Runtime (s) |
---|---|
1 | 0.16 |
10 | 0.16 |
100 | 0.16 |
Value of K | Runtime (s) |
---|---|
3 | 0.16 |
10 | 0.16 |
100 | 0.16 |
200 | 0.16 |
When using 100 percent of the data, we get a class-wise accuracy of 78% on the UCF101 data.
After taking a 70:30 split for train and test data, we get a class-wise accuracy for UCF101 (51 out of 101 have audio channel) for test of 63.5% on the UCF101 data.
PS: The class-wise accuracy for AudioSet is 66.3% and drops to 47.3% when we use a train-test split (70:30), for 448 classes (when assuming only first label).
Misclassification error for KNN > 10 percent
True Label | Predicted Label | % of error |
---|---|---|
Kayaking | Balance Beam | 11.1 |
Baby Crawling | Basketball Dunk | 11.3 |
Front Crawl | Basketball Dunk | 10.2 |
Boxing Punching Bag | Bowling | 10 |
4. 3D feature based similarity - cnn3d_similarity.ipynb
Feature similarity in 3D is similar (pun intended) to 2D feature similarity. However, the layers in CNN model are in 3D, which means the convolution, batching, and max pooling are all done in 3 dimensions. This means that a 3D CNN model can capture spatiotemporal aspects of videos in an end-to-end pipeline.
More information about 3D convolution neural nets is in this paper from Facebook Research and another paper with implementation from a Japanese research group.
An implementation of 3D CNN in PyTorch was forked into this repo as submodule 3d-cnn
with some modifications.
Feature extraction was done using a pre-trained model based on ResNet-34 from this Google Drive, which also contains other pre-trained models, from the authors of the PyTorch implementation. The pre-trained models are trained using the Kinetics-400 dataset, which has 400 classes instead of the 101 in the UCF-101 dataset.
The length of the 3D CNN feature vector is 512, same as that of the 2D CNN feature vector for a single frame. The sampling of the videos is done with 8 frames per video.
The total feature extraction process for 13,200 videos in the UCF-101 dataset was done in an AWS GPU instance with at least 32GB of memory, and took approximately 8 hours.
Please refer to the cnn3d_similarity.ipynb notebook for running the pipeline with extracted features.
kNN Accuracy
The class-wise accuracy using kNN with k=3 of 3D CNN features using the UCF-101 dataset is 89.7%. When the UCF-101 dataset is split into a test/train set using a 70/30 split similar to the 2D CNN feature based similarity, the accuracy drops to 74.7%.
While the k=3 accuracy for 3D CNN based features are lower than 2D based similarities, it is important to note that the 3D CNN features are capturing spatiotemporal information of the entire video instead of a single frame, and doing so in a vector of same size. A more complex 3D CNN model (such as DenseNet) will likely give better accuracy results.
Confusion Matrix
**Below is the Confusion matrix for KNN on 3D CNN features**Looking at some of the most mischaracterized labels:
True Label | Predicted Label | % error |
---|---|---|
FrontCrawl | BreastStroke | 29.3 |
Kayaking | Rafting | 22.9 |
BandMarching | MilitaryParade | 19.2 |
Surfing | Skijet | 18.6 |
HammerThrow | ThrowDiscus | 17.6 |
It looks like some of the mislabels are with similar activities, such as BandMarching and MilitaryParade.
https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5 - Uses imagehash, a python library to compute 1 of 4 different hashes and use hashes for comparison