Skip to content

Principle Introduction and Theoretical Support Analysis: Armor Plate Detector

JoonHo Lee edited this page Aug 24, 2019 · 26 revisions

Object Detection based armor plate detector

Introduction

This page summarizes our approach in developing a single-stage object detector based on Convolutional Neural Networks.

Theory

Our plate detector project is based on the following research/projects:

  • Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2015), You Only Look Once: Unified, Real-Time Object Detection
  • Joseph Redmon, Ali Farhadi (2016), YOLO 9000: YOLO9000: Better, Faster, Stronger
  • Joseph Redmon, Ali Fahardi (2018), YOLOv3: An Incremental Improvement
  • Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi (2016), XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
  • Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, Benjamin Recht (2017), The Marginal Value of Adaptive Gradient Methods in Machine Learning

Our project is mainly derived from the underlying approach of YOLO, or You Only Look Once, a state-of-the-art single stage detector for object detection. The over-arching idea behind all 3 version are equivalent, to use Fully Covolutional layers to generate set of bounding boxes indicating their class probabilities and location. Furthermore, our proposed method is integrated with a set of recurrent layers to enforce temporal understanding, which improved its overall performance at deployment. To reduce the computational cost for optimal run time speed the encoder model aggressively down samples input RGB image to a small latent space, while holding sufficient information for bounding box predictions. Finally, model inference speed was further optimized with the implementation of XNOR operation with binary weights in a certain set of layers to further reduce computational cost.

Detailed Breakdown (Model architecture)

Our proposed model is as below:

Figure 1: High-level diagram of the proposed model. The encoder aggressively down samples the input image to greatly reduce the image spatially while maintaining high dimensionality. The Recurrent embedding at latent space allows model to learn temporal information while embedding at input layer provides high resolution input integrated over time. For prediction of smaller objects, YOLO layer is added at latent space and after it is upsampled once for predictions across various size.

Key Ideas:

  • Encoder uses large-sized kernels with large strides to aggressively down sample input for optimized computational cost
  • Recurrent Convolutional or RCNN layer provides temporal information to the model at both low and high resolution
  • YOLO layer, or more commonly known as the fully convolutional layer, generates bounding box predictions
  • Upsampling allows prediction at a higher resolution for smaller objects while maintaining relatively low FLOPs.

Detailed Breakdown (Model training)

Our model was trained with 2 RTX 2080 TIs in parallel. For improved generalization (Ashia C. Wilson et al.) the model is first trained with Adam optmizer to reduce overall training time, while it is later fine-tuned with SGD optimizer with momentum. For saturation, exposure, and hue went through augmentation for increased model generalization. For loss, we used same loss as YOLOV3(V2):

(source: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b)

The first line of the loss computes L2 distance (Mean-Squared Error) of the model's predicted centroids (x,y) against the ground truth, while the second line computes the L2 distance between predicted width and height of each bounding box and the ground truth. The last three lines finally compute the logistic loss of model's prediction to identify if given region has/does not have object(s), and the class probabilities.

It is important to note that the loss calculates classification loss and localization loss separately, reducing the need for resolving positive-negative class imbalance with techniques such as hard sample mining or focal loss.

Model with best validation loss during was saved as checkpoint for inference.

Detailed Breakdown (Quantization of weights)

Neural Networks are normally trained with FP32, and a popular approach to increasing model inference speed and reducing size is quantization - where instead of FP32, the model can instead compute with FP16, INT8 or even INT4 with slight sacrifice in performance with respect to the choice of datatype. XNOR-Net maximizes this approach with the use of binary as the model's main datatype, introducing up to even 32X increase in inference speed on a CPU-based platform. Inspired from this, we utilize XNOR-based quantization in certain layers to further improve computational cost.

[TODO BINARY WEIGHT QUANTIZATION(XNOR)] [TODO POSSIBLE IMPROVEMENTS]