阅读笔记3-FCNT:Visual Tracking with Fully Convolutional Networks

来源：互联网发布：因子分解机推荐算法编辑：程序博客网时间：2024/06/11 18:48

本文是发表在2015 iccv上的一篇关于用CNN做目标追踪：本文主要贡献有两个

Contribution 1：探索来自不同层的CNN特征表达目标属性中区别和联系

Contribution 2：选取不同CNN层的特征，并将其”稀疏化“（通过tiny convolution neutral network实现），得到更有discrimination的特征用于追踪。

下面分别介绍这两部分工作：

一，Deep Feature Analysis for Visual Tracking

根据作者的实验发现

(a) Although the receptive field 1 of CNN feature maps is large, the activated feature maps are sparse and localized. The activated regions are highly correlated

to the regions of semantic objects . CNN激活的神经元很少，而且激活的部分能够表达目标的语义信息

(b) Many CNN feature maps are noisy or unrelated for the task of discriminating a particular target from its background. CNN 的feature maps有很多是噪声，即不能用来做前景和背景的判别

(c) Observation 3 Different layers encode different types of features. Higher layers capture semantic concepts on object categories, whereas lower layers encode more discriminative features to capture intra class variations. 来自高层的CNN特征更能表达目标的category，而来自底层的特征在表达目标的semantic meaning方面更好，既能够表达类内的特征。

作者用实验证明了构想（b）（c）

首先作者把得到的CNN特征通过稀疏表达的方式计算出每个feature map 的weight，然后用该weight线性组合所有的这些特征，注意这两的weight是通过解L1范数得到，目的是提取出最有区别度的特征。

c是稀疏项，F是CNN的feature map，π是目标的mask。

下图的第一列是用来计算稀疏系数的图片，2，3，4列是将1列的稀疏稀疏应用其他feature map 的结果。结果表明feature map的本身是有内在结构的，有很多的feature map对目标的表达useless。

不同层的CNN特征能够表达不同level的目标特征作者通过标定1800张人脸图像和无人脸图像，其中1800张热恋图像有每个人的ID。

结果很好的证明了作者的想法

二，FCNT跟踪框架

1. For a given target, a feature map selection process is performed on the conv4-3 and conv5-3 layers of the VGG network to select the most relevant feature maps and avoid overfitting on noisy ones.

2. A general network (GNet) that captures the catego- ry information of the target is built on top of the selected feature maps of the conv5-3 layer.

3. A specific network (SNet) that discriminates the target from background with similar appearance is built on top of the selected feature maps of the conv4-3 layer.

4. Both GNet and SNet are initialized in the first frame to perform foreground heat map regression for the target and adopt different online update strategies.

5. For a new frame, a region of interest (ROI) centered at the last target location containing both target and back- ground context is cropped and propagated through the fully convolutional network.

6. Two foreground heat maps are generated by GNet and SNet, respectively. Target localization is performed inde- pendently based on the two heat maps.

7. The final target is determined by a distracter detection scheme that decides which heat map in step 6 to be used.

（1）feature selection network sel-CNN

该网络模型仅包括一个dropout层和一个convolution层。目标是使得目标的mask和predicted foreground 尽可能的相近。

对于新的feature maps通过sel-CNN得到的feature maps 通过计算其significance决定是否留下该特征map，计算过程是通过将feature map 的神经元置为0之后loss的变化量，然后求和之后再阈值取舍。

最后保留384和feature maps（原始有512个）。

（2）目标定位 target location

该步骤通过定义两个小的，结构相同的cnn (SNet和GNet)网络，输入为sel-CNN的输出，输出为目标的heat map。该heat map是定位目标的关键。

定位过程：

(a) 利用GNet的heat map作为目标候选，GNet的输入为conv5_3的输出能够处理目标的旋转和遮挡等。能够很好的定位同类目标。

(b) （a）过程并不能很好的处理同类目标出现的情况，因此要计算有没有出现目标的漂移情况，方法是计算在目标候选区域外出现相似目标的概率P，若P>0.2定义为出现，则进一步利用SNet定位目标。否则GNet的结果则为最终结果。SNet输入是conv4_1的feature map，能够更好的表达目标的类内特征。

(3) Online update

固定GNet，更新SNet。

SNet 更新rules

1，adaptive update

每20帧更新一次，更新的图像时20帧中confidence最大的一帧目标。

2，discrimination update

当出现目标定位上步骤（b）的情况。

一方面用第一帧的目标约束模型，防止退化，另一方面用新的跟踪结果建模模型出现的变化

0 0