-
Notifications
You must be signed in to change notification settings - Fork 10
Principle Introduction and Theoretical Support Analysis: Armor Plate Detector
This page summarizes our approach in developing a single-stage object detector based on Convolutional Neural Networks.
Our plate detector project is based on the following research/projects:
- Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2015), You Only Look Once: Unified, Real-Time Object Detection
- Joseph Redmon, Ali Farhadi (2016), YOLO 9000: YOLO9000: Better, Faster, Stronger
- Joseph Redmon, Ali Fahardi (2018), YOLOv3: An Incremental Improvement
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi (2016), XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
- Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, Benjamin Recht (2017), The Marginal Value of Adaptive Gradient Methods in Machine Learning
Our project is mainly derived from the underlying approach of YOLO, or You Only Look Once, a state-of-the-art single stage detector for object detection. The over-arching idea behind all 3 version are equivalent, to use Fully Covolutional layers to generate set of bounding boxes indicating their class probabilities and location. Furthermore, our proposed method is integrated with a set of recurrent layers to enforce temporal understanding, which improved its overall performance at deployment. To reduce the computational cost for optimal run time speed the encoder model aggressively down samples input RGB image to a small latent space, while holding sufficient information for bounding box predictions. Finally, model inference speed was further optimized with the implementation of XNOR operation with binary weights in a certain set of layers to further reduce computational cost.
Our proposed model is as below:
Figure 1: High-level diagram of the proposed model. The encoder aggressively down samples the input image to greatly reduce the image spatially while maintaining high dimensionality. The Recurrent embedding at latent space allows model to learn temporal information while embedding at input layer provides high resolution input integrated over time. For prediction of smaller objects, YOLO layer is added at latent space and after it is upsampled once for predictions across various size.
Key Ideas:
- Encoder uses large-sized kernels with large strides to aggressively down sample input for optimized computational cost
- Recurrent Convolutional or RCNN layer provides temporal information to the model at both low and high resolution
- YOLO layer, or more commonly known as the fully convolutional layer, generates bounding box predictions
- Upsampling allows prediction at a higher resolution for smaller objects while maintaining relatively low FLOPs.
Our model was trained with 2 RTX 2080 TIs in parallel. For improved generalization (Ashia C. Wilson et al.) the model is first trained with Adam optmizer to reduce overall training time, while it is later fine-tuned with SGD optimizer with momentum. For saturation, exposure, and hue went through augmentation for increased model generalization. For loss, we used same loss as YOLOV3(V2):
(source: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b)
The first line of the loss computes L2 distance (Mean-Squared Error) of the model's predicted centroids (x,y) against the ground truth, while the second line computes the L2 distance between predicted width and height of each bounding box and the ground truth. The last three lines finally compute the logistic loss of model's prediction to identify if given region has/does not have object(s), and the class probabilities.
It is important to note that the loss calculates classification loss and localization loss separately, reducing the need for resolving positive-negative class imbalance with techniques such as hard sample mining or focal loss.
Model with best validation loss during was saved as checkpoint for inference.
[TODO BINARY WEIGHT QUANTIZATION(XNOR)] [TODO POSSIBLE IMPROVEMENTS]