-
Notifications
You must be signed in to change notification settings - Fork 10
Principle Introduction and Theoretical Support Analysis: Armor Plate Detector
This page summarizes our approach in developing a single-stage object detector based on Convolutional Neural Networks.
Our plate detector project is based on the following research/projects:
- Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi (2015), You Only Look Once: Unified, Real-Time Object Detection
- Joseph Redmon, Ali Farhadi (2016), YOLO 9000: YOLO9000: Better, Faster, Stronger
- Joseph Redmon, Ali Fahardi (2018), YOLOv3: An Incremental Improvement
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi (2016), XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
Our project is mainly derived from the underlying approach of YOLO, or You Only Look Once, a state-of-the-art single stage detector for object detection. The over-arching idea behind all 3 version are equivalent, to use Fully Covolutional layers to generate set of bounding boxes indicating their class probabilities and location. Furthermore, our proposed method is integrated with a set of recurrent layers to enforce temporal understanding, which improved its overall performance at deployment. To reduce the computational cost for optimal run time speed the encoder model aggressively down samples input RGB image to a small latent space, while holding sufficient information for bounding box predictions. Finally, model inference speed was further optimized with the implementation of XNOR operation with binary weights in a certain set of layers to further reduce computational cost.
Our Final proposed model is as below:
Figure 1: High-level diagram of the proposed model. The Recurrent embedding at latent space allows model to learn temporal information while embedding at input layer provides high resolution input (Inspire from Unet's concatenation of low and dimensional information). For prediction of smaller objects, YOLO layer is added at latent space and after it is upsampled once for more various range predictions.
Key Ideas:
- Encoder uses large-sized kernels with large strides to aggressively down sample input for optimized computational cost
- Recurrent Convolutional or RCNN layer provides temporal information to the model at both low and high resolution
- YOLO layer, or more commonly known as the fully convolutional layer, generates bounding box predictions
- Upsampling allows prediction at a higher resolution for smaller objects while maintaining relatively low FLOPs.