A Two-Branch DCNN model for Face Recognition in low-resolution images (24x24) to determine employee presence. An implementation of the paper - "Low Resolution Face Recognition for Employee Detection" by Hrishikesh Kusneniwar and Arsalan Malik, 2021.
- Python - 3.8
- Tensorflow - 2.5.0
- Keras - 2.4.3
The training was performed on a Laptop enabled with the Nvidia GTX 1060 MaxQ GPU, and Intel Core i7-8750H CPU. The time taken for training on the dataset of 209 subjects for 45 epochs is described below.
- FERET - https://www.nist.gov/itl/products-and-services/color-feret-database
- Georgia Tech Face Database - http://www.anefian.com/research/face_reco.htm
- KomNET Face Dataset - https://data.mendeley.com/datasets/hsv83m5zbb/2
subjects - 111
total images - 1165 (15 per subject)
train images - 1110 (10 per subject)
val images - 222 ( 2 per subject)
test images - 333 ( 3 per subject)
subjects - 48
total images - 720 (15 per subject)
train images - 480 (10 per subject)
val images - 96 ( 2 per subject)
test images - 144 ( 3 per subject)
subjects - 50
total images - 750 (15 per subject)
train images - 500 (10 per subject)
val images - 100 ( 2 per subject)
test images - 150 ( 3 per subject)
In the preprocessing stage, the faces were cropped using the python library, Multi-Task Cascaded Convolutional Neural (MTCNN). As shown in Figure 4, the image to be processed was given a bounding box which was then cropped based on the bounding box. After obtaining the face, the cropped image was resized to 160 x 160 dimensions via bicubic interpolation to be sent as input to our model. The above steps were performed for all the images in the acquired databases, constituting the high resolution gallery images.
The training set consists of 2090 pairs of images. Each pair contains a HR and LR version of a particular image. After preprocessing, we took 10 face images of each subject from our combined database, to form the HR images in the 2090 pairs. Their LR counterparts were created by downsampling the HR image using bicubic interpolation to the desired dimensions.
The validation set consists of 418 pairs of HR and LR images. After preprocessing, we took 2 face images of each subject from our combined database, to form the HR images in the 418 pairs. Their LR counterparts were obtained via bicubic interpolation similar to the training set.
The test set 1 contains 627 LR images that were obtained by taking 3 face images of each subject from our combined database. Bicubic interpolation was used for obtaining the desired low resolution dimensions.
The test set 2 contains the same 627 LR images in test 1, as well as 600 additional LR face images of people outside of our combined database of 209 subjects. These 600 ’out-database’ face images are obtained from Totally Looks Like Data [5]. The same preprocessing steps and downsampling as described earlier were performed on these out-database images. Therefore, total size of test set 2 is 1227 LR images.
'Sample Data' consists of the first ten subjects of the created dataset.
We make use of a two branch architecture [1] that has two DCNNs to extract features from low resolution probe images and high resolution gallery images and map them to a 512-dimensional common space. The DCNN used by us is FaceNet [2] that is pretrained on the VGGFace2 dataset. The FaceNet model was obtained from the Github respository of Hiroki Taniai [3].
Fig. 1
The top branch of the model (Fig. 1) is called high resolution feature extraction convolutional neural network (HRFECNN) that takes HR face images ( Ihr ) as its input. The bottom branch of the model is called low resolution feature extraction convolutional neural network (LRFECNN) that takes the corresponding LR face images ( Ilr ) as its input. The HR images as well as the LR images have to be in 160 x 160 dimensions before they can be fed into the two branches of the model. Bicubic interpolation is used to resize the image if it is not of the required dimensions. The 512-dimensional feature vectors of Ihr and Ilr are then extracted from the last layers of HRFECNN and LRFECNN respectively to the common space.
Fig. 2
The parameters of HRFECNN (top branch) are not updated during training, i.e. they are frozen. Therefore, yhr is fixed but ylr (LRFECNN) is trained to minimize the distance between the HR and LR mapped images of the same person in the common space. The mean-squared-error (MSE) objective function is used for training the model,
Fig. 3
FaceNet model and weights can be obtained here: https://drive.google.com/drive/folders/1vCWyI_M3KcEuOF2yuksS24bzSmPrj6VW?usp=sharing
Pairs of HR and LR images from the training set were used to train the model. Weights of LRFECNN were updated via gradient descent using the Adam Optimizer with a batch size of 64 images. The weights of HRFECNN were kept fixed. We performed the training for 45 epochs till there was insignificant decrement in the training loss. The learning rate was decayed as described in Table 1
Table 1
After completion of training, the feature vectors of LR images in the training set were obtained from the last layer of LRFECNN. These feature vectors, in conjunction with the true labels of the images were then used to train a Logistic Regression classifier. We used the Logistic Regression model available in the Scikit-Learn library, with the ’lbfgs’ solver. We trained the classifier for 1020 iterations, which took approximately 2 seconds to complete.
The 512-dimensional feature vectors of LR probe images were obtained from the last layer of LRFECNN. These feature vectors were then fed into the Logistic Regression classifier to obtain the predicted labels corresponding to the probe images.
Fig. 4 Visualization of feature vectors of LR face images from an untrained LRFECNN
Fig. 5 Visualization of feature vectors of HR face images from HRFECNN
Fig. 6 Variation of training loss and validation loss by training LRFECNN on 24 x 24 images
Fig. 7 Variation of recognition accuracy of the Logistic Regression classifier on 24 x 24 images
Fig. 8 Visualization of feature vectors of LR probe images from LRFECNN post-training
Table 2
Table 3
Fig. 9 Configurations with different super resolution methods. Blocks with blue color are involved in the training phase. Super resolution via a) Bicubic Interpolation, b) EDSR [6], c) WDSR [7], d) SRGAN [8]
Table 4
Fig. 10 Variation of recognition accuracy with different super resolution methods
- E. Zangeneh, M. Rahmati, and Y. Mohsenzadeh, “Low resolution face recognition using a two-branch deep convolutional neural network architecture,” Expert Systems with Applications, vol. 139, p. 112854, 2020.
- F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
- H. Taniai, “keras-facenet,” 2018. [Online]. Available: https://github.com/nyoki-mtl/keras-facenet
- Astawa, I Nyoman Gede Arya (2020), “KomNET: Face Image Dataset from Various Media”, Mendeley Data, V2, doi:10.17632/hsv83m5zbb.2
- A. Rosenfeld, M. D. Solbach, and J. K. Tsotsos, “Totally looks like-how humans compare, compared to machines,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 1961–1964.
- B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 136–144.
- J. Yu, Y. Fan, J. Yang, N. Xu, Z. Wang, X. Wang, and T. Huang, “Wide activation for efficient and accurate image super-resolution,” arXiv preprint arXiv:1808.08718, 2018.
- C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, ´ A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681–4690.