目标检测之RCNN套路

Selective Search

《Selective Search for Object Recognition》
matlab code（论文作者提供） :
- http://disi.unitn.it/~uijlings/MyHomepage/index.php#page=projects1
python code :
- https://github.com/AlpacaDB/selectivesearch
- https://github.com/belltailjp/selective_search_py
解决物体识别画框的问题：
- Instead of an exhaustive search we use segmentation as selective search yielding a small set of class independent object locations.
- In contrast to the segmentation, instead of focusing on the best segmentation algorithm, we use a variety of strategies to deal with as many image conditions as possible, thereby severely reduc-ing computational costs while potentially capturing more objects accurately.
- Instead of learning an objectness measure on randomly sampled boxes, we use a bottom-up grouping procedure to generate good object locations.
设计原则：
- 能捕捉任意scale的物体，bottom-up
- 区分方法多样性，物体在图片里区分可能需要有n种区分方法，都要能识别出来
- 快速，尽可能少的在这一步浪费算力
技术方案
- Hierarchical Grouping Algorithm
- 提出四种计算similarity的方案，进行加权融合
  The Region-based CNN (R-CNN)
《Rich feature hierarchies for accurate object detection and semantic segmentation》
https://github.com/rbgirshick/rcnn
贡献
- 把神经网络从图片分类取得的好效果迁移到目标检测上了
Approach
- 对于输入图片用上面的Selective Search技术选差不多2000个bottom-up region
- 把图片resize到cnn接受的尺寸
- cnn抽取feature
- SVM判断分类
- bounding-box regressors改善定位
这篇文章作者非常良心，工程化细节介绍的很详细，值得大家反复摩擦，搞清楚每一个细节，而且也给出了code（matlab晕）
- selective search到cnn的resize方案调研
- fine-tuning阶段和svm训练阶段positive sample的选取原则不同
- pre-training和fine-tuning的learning rate的设置
- mini-batch的选择方案
- 为什么还要单独训练svm而不是fine-tuning一个softmax分类器就够了
- 增加bounding box regression可以提升效果
- 错误分析工具
  - 《Diagnosing error in object detectors》
- 训练及验证集的使用
  - 训练集未完整标注
  - 验证集进行了完整标注

Fast-RCNN

《Fast R-CNN》
https://github.com/rbgirshick/fast-rcnn
RCNN的问题：
- multi-stage：卷积神经网络算feature + SVM 做分类 + bounding-box regressors 改善定位
- 速度慢：训练慢，推理慢
Fast-RCNN贡献：
- single-stage + multi-task loss
- 更高的准确度
- 更快的速度
Fast-RCNN网络架构(Figure 1)：
- network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map.
  - 用cnn算整个图片的feature vector，然后用ROI pooling layer去对应区域的feature map进行pooling操作
- for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.
- Each feature vector is fed into a sequence of fully connected (fc) layers
- branch into two sibling output layers:
  - one that produces softmax probability estimates over K object classes plus a catch-all “background” class
  - another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes
小trick
- 保证缩放不变性，引入图像金字塔来辅助
- 用SVD压缩卷积参数W，从而加速计算
  - Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percent- age point) drop in mAP and without needing to perform additional fine-tuning after model compression
  - 《Exploiting linear structure within convolutional networks for efficient evaluation》
经验
- multi-task training的效果要比multi-stage training的效果要好
- 图像金字塔引入对效果提升很小，但是计算量的消耗很大
- 更多的region候选集合虽然理论上提高召回率但是准确率会有下降

Faster-RCNN

《Faster R-CNN: Towards real-time object detection with region proposal networks》
https://github.com/ShaoqingRen/faster_rcnn
https://github.com/rbgirshick/py-faster-rcnn
计算时proposal是计算瓶颈，提出region proposal network（RPN）来提供proposal，而不再使用之前的selective search，且RPN和预测的fast cnn模型是share feature的，从而达到加速的目的
VGG模型可以达到5fps
RPN模型上是一种fully-convolutional network
整体模型架构可见Figure2，基本思路是：
- 对整图用卷积神经网络算全局feature
- 基于图片的全局feature计算proposal
- 基于计算的proposal和全局feature计算物体识别
在faster-RCNN之前如果要做尺度不变性要训练时进行1）图像缩放或2）多个卷积层的filter size来处理，对训练时长增加很明显，而faster-rcnn只需要对“anchor”进行处理
anchor的计算从算法上保证了一定是平移不变的
训练过程：
- 1、用ImageNet pre-trained的模型初始化RPN，end-to-end对RPN进行fine-tuning
- 2、用ImageNet pre-trained的模型初始化Fast RCNN，基于第一步RPN算的proposal训练fast rcnn，注意截止到第二步，fast rcnn和rpn还没有share feature
- 3、用第二部训练的模型来初始化RPN，训练的时候shared feature那些层是固定权重的，不参与训练，只有专属RPN的部分参与模型训练
- 4、同第三步，share feature层不参与训练，fine tuning专属fast rcnn的那些层

Mask R-CNN

https://github.com/facebookresearch/Detectron
inference速度：5fps，基本等同于faster-rcnn
训练：单台机器配备8个GPU，训练1～2天
主要改进：
- multi-task loss增加一个branch，用于输出像素级的mask
- 增加RoIAlign层来辅助mask预测（对比faster RCNN的ROIPool层）

联系我: personal email address

目标检测之RCNN套路

Selective Search

The Region-based CNN (R-CNN)

Fast-RCNN

Faster-RCNN

Mask R-CNN