传统的视觉跟踪方法(如L1等)大多直接使用视频序列各帧内的像素级特征进行建模,而没有考虑到各图像块内部的深层视觉特征信息.在现实世界的固定摄像头视频监控场景中,通常可以找到一块区域,该区域中目标物体具有清晰、易于分辨的表观.因此,文中在各视频场景内事先选定一块可以清晰分辨目标表观的参考区域用以构造训练样本,并构建了一个两路对称且权值共享的深度卷积神经网络.该深度网络使得参考区域外目标的输出特征尽可能与参考区域内目标的输出特征相似,以获得参考区域内目标良好表征的特性.经过训练后的深度卷积神经网络模型具有增强目标可识别性的特点,可以应用在使用浅层特征的跟踪系统(如L1等)中以提高其鲁棒性.文中在L1跟踪系统的框架下使用训练好的深度网络提取目标候选的特征进行稀疏表示,从而获得了跟踪过程中应对遮挡、光照变化等问题的鲁棒性.文中在25个行人视频中与当前国际上流行的9种方法对比,结果显示文中提出的方法的平均重叠率比次优的方法高0.11,平均中心位置误差比次优的方法低1.0.
The traditional tracking methods (e.g.L1 tracker)generally adopt the pixel values as feature representation,and ignore the deep visual features of image patches.In a fixed video scene of the real world,we realize that we can usually find an area where the targets have clear appearance and are easy to distinguish.Therefore,in this paper,we select a region in each video to construct training set for deep model learning.In the proposed deep model,we design a deep convolutional neural network which has two symmetrical paths with the shared weights.The goal of the proposed deep network is to reduce the difference between the features of a target out of the region and in the region.As a result,the learned deep network can enhance the appearance feature of targets and benefit the trackers that utilize low-level features,such as L1 tracker.Finally, we utilize this pre-trained deep convolutional network in the L1 tracker to extract features for sparse representation.Consequently,our method achieves the robustness in tracking for handling the challenges such as occlusion and illumination changes.We evaluate the proposed approach on 25 challenging videos against with 9 state-of-the-art trackers.The extensive results show that the proposed algorithm is 0.11 higher than the second best with average overlap,and is 1.0 lower than the second best with the average center location errors.