-
Notifications
You must be signed in to change notification settings - Fork 10
Principle Introduction and Theoretical Support Analysis: Armor Plate Detector
This page summarizes our approach in developing a single-stage object detector based on Convolutional Neural Networks for our armor plate detector algorithm.
Our detector project is based on the following research/reports:
- Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2015), You Only Look Once: Unified, Real-Time Object Detection
- Joseph Redmon, Ali Farhadi (2016), YOLO 9000: YOLO9000: Better, Faster, Stronger
- Joseph Redmon, Ali Fahardi (2018), YOLOv3: An Incremental Improvement
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi (2016), XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
- Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, Benjamin Recht (2017), The Marginal Value of Adaptive Gradient Methods in Machine Learning
Our project is mainly derived from YOLO, or You Only Look Once, a state-of-the-art single stage detector for object detection. The over-arching idea behind all 3 versions of YOLO is the use of FC (Fully Convolutional) layer to generate bounding box predictions in a single forward-propagation of a network, making the algorithm much more efficient compared to previous approaches such as Fast R-CNN, and Faster R-CNN.
<\p>
Figure 1: Bounding box prediction of YOLO(source: https://towardsdatascience.com/review-yolov3-you-only-look-once-object-detection-eab75d7a1ba6)
Our proposed method is inspired from the latest version YOLOV3, with the use of smaller encoder for limited hardware usage. To optimize inference speed the encoder model aggressively down samples input RGB image to a small latent space, while maintaining skip connections for improved performance. Lastly, the use of XNOR-Net's binary weight operation and FP16 inference using Tensor core further reduced computational cost.
Our proposed model is as below:
Figure 2: High-level diagram of the proposed model. The encoder aggressively down samples the input image to greatly reduce the image spatially while increasing dimensionality. Predictions are generated by the YOLO-FC layers at encoded latent space and after each upsampling layer.
Our encoder is a mix of Darknet19 and Darknet53, overall following their use of 3x3 kernels with a stride of 1. To reduce spatial resolution, max pooling is operated after a set of convolution blocks. To prevent diminishing gradient and optimize overall performance, skip connections are attached at each convolution block. The encoder downsamples the input into much smaller spatial domain, which is used to generate first set of bounding box predictions. Upsampling layers are added at the end to distribute FLOPs across the model while allowing the YOLO layer to be fed high-resolution input for smaller objects. At the end of each layer leaky ReLU activation function is applied.
Our model was trained with 2 RTX 2080 TIs in parallel. For improved generalization (Ashia C. Wilson et al.) the model is first trained with an Adam optmizer to reduce overall training time, while it is later fine-tuned with SGD optimizer with momentum. For augmentation saturation, exposure, and hue was used to strengthen the model against overfitting. For loss, we used same loss as YOLO V3/V2:
Figure 3: Optimization loss for YOLO(source: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b)
The first line of the loss computes L2 distance (Mean-Squared Error) of the model's predicted centroids (x,y) against the ground truth, while the second line computes the L2 distance between predicted width and height of each bounding box and the ground truth. The last three lines compute the logistic loss of model's prediction to identify if given region has/does not have object(s), and the appropriate class probabilities.
It is important to note that the loss calculates object probability, classification and localization separately, reducing the need for resolving positive-negative class imbalance with techniques such as hard sample mining or focal loss. Model with best validation loss during training was saved as checkpoint for inference.
Neural Networks are normally trained with FP32, and a popular approach to increasing model inference speed and reducing its size is quantization - where instead of FP32, the model can instead compute with FP16, INT8 or even INT4, essentially creating a trade-off between inference speed and accuracy. XNOR-Net maximizes this approach with the use of binary as the model's main datatype, introducing up to even 32X increase in inference speed on a CPU-based platform. Inspired from this, we utilize XNOR-based quantization in certain layers to further improve computational cost.
Figure 4: NVIDIA's diagram comparing inference speed of different datatypes (source: https://devblogs.nvidia.com/deploying-deep-learning-nvidia-tensorrt/)
Given such understanding, the integration of mixed-precision inference (Tensor-core utilized FP16 inference & XNOR) that came with our initial fork of darknet(https://github.com/AlexeyAB/darknet) allowed relatively fast experimentation of mixed-precision inference and came to a conclusion that we can utilize binary computation in layers as needed to maximize inference speed for smoothness in turret control and robustness to high variance in test data during deployment.
-
Integration of Tensorflow and TensorRT
While we made good use of mixed-precision inference using Tensor cores and XNOR, we aim to experiment further with TensorRT, and to achieve this we plan to transfer our model to Tensorflow for cross-platform experimentation alongside darknet.
-
Further research into model architecture and subsequent improvement (with increase in data)
We aim to further improve our model with prolonged research and introduction of additional data from 2019 competition.