-
Notifications
You must be signed in to change notification settings - Fork 10
Principle Introduction and Theoretical Support Analysis: Armor Plate Detector
This page summarizes our approach in developing a single-stage object detector based on Convolutional Neural Networks for our armor plate detector algorithm.
Our detector project is based on the following research/reports:
- Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2015), You Only Look Once: Unified, Real-Time Object Detection
- Joseph Redmon, Ali Farhadi (2016), YOLO 9000: YOLO9000: Better, Faster, Stronger
- Joseph Redmon, Ali Fahardi (2018), YOLOv3: An Incremental Improvement
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi (2016), XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
- Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, Benjamin Recht (2017), The Marginal Value of Adaptive Gradient Methods in Machine Learning
Our project is mainly derived from YOLO, or You Only Look Once, a state-of-the-art single stage detector for object detection. The over-arching idea behind all 3 versions of YOLO is the use of FC (Fully Convolutional) layer to generate bounding box predictions in a single forward-propagation of a network, making the algorithm much more efficient compared to previous approaches such as Fast R-CNN, and Faster R-CNN.
<\p>
Figure 1: Bounding box prediction of YOLO(source: https://towardsdatascience.com/review-yolov3-you-only-look-once-object-detection-eab75d7a1ba6)
Furthermore, our proposed method utilizes a set of recurrent layers to integrate temporal understanding, which improved overall performance. To optimize inference speed the encoder model aggressively down samples input RGB image to a small latent space. Lastly, the use of XNOR-Net's binary weight operation and FP16 inference using Tensor core further reduced computational cost.
Our proposed model is as below:
Figure 2: High-level diagram of the proposed model. The encoder aggressively down samples the input image to greatly reduce the image spatially while increasing dimensionality. The Recurrent embedding at latent space allows model to learn temporal information while embedding at input layer provides high resolution input integrated over time. Predictions are generated by the YOLO-FC layers at encoded latent space and after an upsampling layer.
Our encoder is a 11-layer CNN. At initial stage, 9x9 kernels with stride of 4 aggressively downsample the input into much smaller spatial domain. RCNN layer, a recurrent layer that supports convolutional feature space, provides temporal information at both input and at latent space. Upsampling is added at the end to distribute FLOPs across the model while allowing the YOLO layer to be fed high-resolution input for smaller objects.
Our model was trained with 2 RTX 2080 TIs in parallel. For improved generalization (Ashia C. Wilson et al.) the model is first trained with an Adam optmizer to reduce overall training time, while it is later fine-tuned with SGD optimizer with momentum. For augmentation saturation, exposure, and hue was used to strengthen the model against overfitting. For loss, we used same loss as YOLO V3/V2:
Figure 3: Optimization loss for YOLO(source: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b)
The first line of the loss computes L2 distance (Mean-Squared Error) of the model's predicted centroids (x,y) against the ground truth, while the second line computes the L2 distance between predicted width and height of each bounding box and the ground truth. The last three lines compute the logistic loss of model's prediction to identify if given region has/does not have object(s), and the appropriate class probabilities.
It is important to note that the loss calculates classification loss and localization loss separately, reducing the need for resolving positive-negative class imbalance with techniques such as hard sample mining or focal loss. Model with best validation loss during was saved as checkpoint for inference.
Neural Networks are normally trained with FP32, and a popular approach to increasing model inference speed and reducing its size is quantization - where instead of FP32, the model can instead compute with FP16, INT8 or even INT4, essentially creating a trade-off between inference speed and accuracy. XNOR-Net maximizes this approach with the use of binary as the model's main datatype, introducing up to even 32X increase in inference speed on a CPU-based platform. Inspired from this, we utilize XNOR-based quantization in certain layers to further improve computational cost.
Figure 4: NVIDIA's diagram comparing inference speed of different datatypes (source: https://devblogs.nvidia.com/deploying-deep-learning-nvidia-tensorrt/)
Given such understanding, the integration of mixed-precision inference (Tensor-core utilized FP16 inference & XNOR) that came with our initial fork of darknet(https://github.com/AlexeyAB/darknet) allowed relatively fast experimentation of mixed-precision inference and came to a conclusion that we can utilize binary computation in layers as needed to maximize inference speed for smoothness in turret control and robustness to high variance in test data during deployment.
-
Integration of Tensorflow and TensorRT
While we made good use of mixed-precision inference using Tensor cores and XNOR, we aim to experiment further with TensorRT, and to achieve this we plan to transfer our model to Tensorflow for cross-platform experimentation alongside darknet.
-
Further research into model architecture and subsequent improvement (with increase in data)
We aim to further improve our model with prolonged research and introduction of additional data from 2019 competition.