目的针对细粒度图像分类中的背景干扰问题,提出一种利用自上而下注意图分割的分类模型。方法首先,利用卷积神经网络对细粒度图像库进行初分类,得到基本网络模型。再对网络模型进行可视化分析,发现仅有部分图像区域对Et标类别有贡献,利用学习好的基本网络计算图像像素对相关类别的空间支持度,生成自上而下注意图,检测图像中的关键区域。再用注意图初始化GraphCut算法,分割出关键的目标区域,从而提高图像的判别性。最后,对分割图像提取CNN特征实现细粒度分类。结果该模型仅使用图像的类别标注信息,在公开的细粒度图像库Cars196和Aircrafts100上进行实验验证,最后得到的平均分类正确率分别为86.74%和84.70%。这一结果表明,在GoogLeNet模型基础上引入注意信息能够进一步提高细粒度图像分类的正确率。结论基于自上而下注意图的语义分割策略,提高了细粒度图像的分类性能。由于不需要目标窗口和部位的标注信息,所以该模型具有通用性和鲁棒性,适用于显著性目标检测、前景分割和细粒度图像分类应用。
Objective This paper addresses the problem of fine-grained visual categorization requiring domain-specific and expert knowledge. This task is challenging because of all the objects in a database belonging to the same basic level catego- ry, with very fine differences between classes. These subtle differences are easily overcome by image background informa- tion that is seldom discriminative and often serves as a distractor, which reduces recognition performance in fine-grained im- age categorization. Therefore, segmenting the background regions and localizing discriminative image parts are important precursors to fine-grained image categorization. To this end, a new segmentation algorithm based on top-down attention maps is proposed to detect discriminative objects and discount the influence of background. Once the objects are localized, they are described by convolutional neural network (CNN) features to predict the corresponding class. Method The pro- posed method first recognizes a dataset through the CNN model to obtain a ConvNet basis with GoogLeNet structure. The ConvNet basis is visualized to build reliable intuitions of the visual information contained in the trained CNN representa- tions, and the visualization result shows that the saliency parts in the images correspond to the object regions while the acti- vations of background regions are very small. Then, according to the learned ConvNet, we predict the top-1 class label of a given image and determine the spatial support of the predicted class among the image pixels. Spatial support is rearranged to produce a top-down attention map, which can effectively locate the informative image object regions. Next, given an image and its corresponding image-specific class attention map, we compute the object segmentation mask with the GraphCut seg- mentation algorithm. The high-quality foreground segmentations are then used to encode the image appearance into a highly discriminative visual representation by finetuning the ConvNet basis to learn a new segmen