DL之SSD：SSD算法的简介(论文介绍)、架构详解、案例应用等配图集合之详细攻略-重庆市软件正版化服务平台

政策资讯

Policy Information

DL之SSD：SSD算法的简介(论文介绍)、架构详解、案例应用等配图集合之详细攻略

来源：重庆市软件正版化服务中心 | 时间： 2022-09-19 | 浏览量： 67977 |

DL之SSD：SSD算法的简介(论文介绍)、架构详解、案例应用等配图集合之详细攻略

SSD算法的简介(论文介绍)

SSD:，即Single shot multiboxdetector，单步骤多盒探测器。

Abstract
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300 × 300 input, SSD achieves 74.3% mAP1 on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves 76.9% mAP, outperforming a comparable state-of-the-art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at: https://github.com/weiliu89/caffe/tree/ssd .
本论文提出了一种利用单个深度神经网络对图像中目标进行检测的方法。我们的方法名为SSD，它将边界框的输出空间离散为一组默认框，每个特征映射位置具有不同的纵横比和比例。在预测时，网络为每个默认框中每个对象类别的存在生成评分，并对该框进行调整以更好地匹配对象形状。此外，该网络结合了来自具有不同分辨率的多个特征图的预测，以自然地处理不同大小的对象。相对于需要对象建议的方法，SSD比较简单，因为它完全消除了建议生成和随后的像素或特征重采样阶段，并将所有计算封装在一个网络中。这使得SSD易于训练，并且易于集成到需要检测组件的系统中。
在PASCAL VOC、COCO和ILSVRC数据集上的实验结果证实，相对于使用附加对象建议步骤的方法，SSD具有竞争力的准确性，而且速度更快，同时为训练和推理提供了统一的框架。对于300×300输入，SSD在Nvidia Titan X上以59帧每秒的速度在VOC2007测试中实现了74.3%的mAP，对于512×512输入，SSD实现了76.9%的mAP，超过了同类的最先进的更快的R-CNN模型。与其他单级方法相比，即使在较小的输入图像尺寸下，SSD也具有更高的精度。代码如下:https://github.com/weiliu89/ /tree/ssd。
Conclusions
This paper introduces SSD, a fast single-shot object detector for multiple categories. A key feature of our model is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the top of the network. This representation allows us to efficiently model the space of possible box shapes. We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance. We build SSD models with at least an order of magnitude more box predictions sampling location, scale, and aspect ratio, than existing methods [5,7]. We demonstrate that given the same VGG-16 base architecture, SSD compares favorably to its state-of-the-art object detector counterparts in terms of both accuracy and speed. Our SSD512 model significantly outperforms the state-of-theart Faster R-CNN [2] in terms of accuracy on PASCAL VOC and COCO, while being 3× faster. Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy.
本文介绍了一种单shot 多类别快速目标检测系统SSD。我们模型的一个关键特性是使用多尺度卷积边界框输出，附加到网络顶部的多个特征映射上。这种表示使我们能够有效地为可能的盒子形状的空间建模。我们通过实验验证，在给定适当的训练策略下，大量精心选择的缺省边界框可以提高性能。与现有方法相比，我们构建的SSD模型具有至少一个数量级的盒预测采样位置、尺度和纵横比[5,7]。我们证明，给定相同的VGG-16基础架构，SSD在精度和速度方面都优于其最先进的对象检测器。我们的SSD512模型在PASCAL VOC和COCO上的精度明显优于目前最先进的R-CNN[2]，同时速度提高了3倍。我们的实时SSD300模型以59帧每秒的速度运行，这比当前的实时YOLO[5]替代方案更快，同时产生明显优越的检测精度。
Apart from its standalone utility, we believe that our monolithic and relatively simple SSD model provides a useful building block for larger systems that employ an object detection component. A promising future direction is to explore its use as part of a system using recurrent neural networks to detect and track objects in video simultaneously.
除了它的独立实用程序之外，我们相信我们的统一的且相对简单的SSD模型为使用对象检测组件的大型系统提供了一个有用的构建块。一个很有前途的未来方向是探索它作为一个系统的一部分，使用递归神经网络同时检测和跟踪视频对象。

论文
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single shot multiboxdetector. ECCV 2016
https://arxiv.org/abs/1512.02325

论文地址：https://arxiv.org/pdf/1512.02325v5.pdf

0、SSD实验结果

Training: VOC2007 trainvaland VOC2012 trainval(16551 images)
Testing: VOC2007 test (4952 images)

1、单步骤和两步骤在VOC2007数据集上比较

两个模型SSD300、SSD512分别可达到77%mAP且每秒46帧、80%mAP且每秒19帧。
对比Yolov1，SDD不论是速度还是精度上，都超过！对比两阶段模型，比如FasterR-CNN，也超过！

2、SSD500模型——PASCAL VOC2007 test detection results

Here is the accuracy comparison for different methods. For SSD, it uses image size of 300 ×300 or 512 ×512.这是不同方法的精度比较。对于SSD，它使用的图像大小为300×300或512×512。

The model is trained using SGD with initial learning rate 0.001, 0.9 momentum, 0.0005 weight decay, and batch size 32.
Using a Nvidia Titan X on VOC2007 test, SSD achieves 59 FPS with mAP74.3% on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP73.2% or YOLO 45 FPS with mAP63.4%.
模型采用SGD进行训练，初始学习率0.001，动量0.9，重量衰减0.0005，批量大小32。在VOC2007测试中使用Nvidia Titan X, SSD在VOC2007测试中使用mAP74.3%实现59帧/秒，而更快的R-CNN 7帧/秒使用mAP73.2%或YOLO 45帧/秒使用mAP63.4%。

Fast 和Faster R-CNN都使用最小尺寸为600的输入图像。两种SSD模型具有完全相同的设置，除了它们具有不同的输入尺寸（300×300与512×512）。很明显，更大的输入尺寸可以带来更好的结果，而更多的数据总是有帮助的。
图表可知，采用【07+12】组合数据集可得到76.8mAP，而采用【07+12+COCO】组合，性能最好，为81.6mAP！

注：
Data: ”07”: VOC2007 trainval：采用07年数据集
”07+12”: union of VOC2007 and VOC2012 trainval：采用07年和12年的数据集
”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12：采用COCO数据上训练+07年和12年数据集上微调

3、检测速度(帧每秒为单位)

This is the recap of the speed performance in frame per second
Pascal VOC2007测试结果。SSD300是唯一可实现70％以上mAP的实时检测方法。通过使用更大的输入图像，SSD512在保持接近实时速度的同时优于所有精确度方法。