Introduction
Fine-grained image categorization is an extension of traditional classification and aims at classifying subclasses of a given category. For example, classifying subclasses of birds\cite{bird:cub,bird:na} and classifying subclasses of dogs\cite{dog:stanford}. Fine-grained image categorization has been an urgent problem in the field of computer vision because of the high requirements on classification features. Traditional image categorization algorithms generally describe images by extracting features, and then use a classifier to classify images. Early methods were based on the color, texture, edge, etc. of the entire image, using some supervised or unsupervised classification algorithms. The visual feature dimension of this kind of method is very high and the calculation is complicated, so it is difficult to reflect the spatial relationship of images. Later, some scholars divided the image into independent regions to extract local features, which can be generally summarized as: local feature extraction; Feature coding; Feature convergence; The classifier classifies four stages. Modeling with local features offers great flexibility and expressiveness, but these features are heavily dependent on manual selection and require a lot of time and experience, as well as professional knowledge and practice.
The development and success of neural networks\cite{network:2,network:1,network:3} in recent years have given researchers hope for solving the problem. Scholars have started to use RPN (Region Proposal Network)\cite{rpn:rpn} to automatically generate candidate regions. The candidate regions are then scaled to a specific size and fed into a feature extraction network to extract features. Although the RPN network can automatically select the local area of the input image, it still has some shortcomings. First, RPN generates a large number of candidate regions to ensure the coverage of key features. Second, RPN networks usually need to be trained separately, so using RPN networks not only increases the computational effort, but also leads to complicates the training of the network. In order to alleviate these problems, some scholars have proposed new methods, such as method based on attention mechanisms\cite{sknet:sknet,senet:senet}, feature coding\cite{vit:vit}. These methods do not require complex annotation information and can be easily implemented for end-to-end training. This approach is also increasingly becoming the mainstream of community to solve fine-grained image categorization problems.
In recent years, transformer\cite{vit:vit,swin:swin}, a method originally applied to natural language processing, has been applied to computer vision. Transformer segments an image into a series of patches and captures important regions of the image using its own self-attention mechanism. Transformer has shown promising performance, and a series of extended work on downstream tasks, such as target classification, object detection\cite{detection:detection}, and semantic segmentation\cite{seg:1,seg:2}, has also demonstrated the effectiveness of transformer in the field of vision.related work
fi
Squeeze-and-Excitation Networks (SENet)\cite{senet:senet} is a network structure proposed by Hu for convolutional neural networks. It automatically obtains the importance of each feature channel by learning and explicitly models the interdependencies between the feature channels. According to this, the useful features are promoted and thr useless features are suppressed.
fi
Inspired by above work, we propose a feature weight unit. Specifically, we take the encoded features as a unique high-dimensional representation and use these high-dimensional representations to generate a set of weight coefficients. The weight coefficients indicate the importance of corresponding feature in each network layer. Finally, all features are weighted by weight coefficient. Performance ofour model on CUB-200-2011 dataset and Stanford Dogs dataset can be seen in Fig 1. Overall, our contributions are:
1. Proposing a feature weight unit to represent the importance of feature.
2. Verifying the feasibility of our proposed method on public datasets.
Related Work
In this section, we will briefly review the methods used to develop fine-grained image categorization.
Trandition Method
The early research on fine-grained images was mainly based on traditional artificial features. In 2011, Wah et al. sorted out and published the CUB-200-2011 dataset\cite{bird:cub}, which was the prelude to fine-grained research. Subsequently, Farrell et al.\cite{trandition1} proposed to first train an attitude classifier at a coarse-grained level to extract local position information, and then further build a fine-grained level model. Liu et al.\cite{trandition2}. used local localization to extract features. They built contour models of the whole dog and its face, and then used feature matching for localization. Yang et al.\cite{trandition3}. propose a template for unsupervised learning to capture locally common shape patterns and interrelationships of objects. Berg et al.\cite{trandition4}. proposed POOF features to make pairwise comparisons between different categories to obtain characteristic representations of specific parts. Because of the weak characterization ability of artificial features, the classification effect has not been greatly improved. In addition, the lack of local localization ability further reduces the classification effect.
Deep Learning Method
The emergence of deep learning has brought earth-shaking changes to image classification research. Deep models can better mine information from data and learn more powerful and robust features, bringing revolutionary improvement in classification effect.
Attention Mechanism is a data processing method in computer vision. According to the different application of the attention mechanism, i.e., the way and location of the attention weights applied, the attention mechanism can be broadly classified into three kinds: space domain, channel domain and hybrid domain.
For convolutional neural networks, the CNN outputs a feature map of C x H x W for each layer, where C is the channel, H and W denote the height and width of the image. Spatial attention learns a weight matrix that represents the weight of each pixel on the feature map of H x W dimensions. Examples include Self-Attention\cite{self-attention}, Non-local Attention\cite{non-local}, and Spatial Transformer\cite{sp}.
The channel domain attention is applying a weight on each channel to represent the relevance of that channel to the key information. If the larger this weight is, the higher the relevance is. Represented by SENet\cite{senet:senet}, SKNet\cite{sknet:sknet}, ECANet\cite{eca}, etc.
The hybrid domain attention mechanism is an attention mechanism that combines space and channel. It improves network performance by combining channel and spatial information. Such as CBAM\cite{cbam}, DANet\cite{danet}, CCNet\cite{ccnet}, Residual Attention\cite{resattention}.
In recent year, transformer originally used to natural language processing was applied to computer vision by scholars. VIT\cite{vit:vit} is the first work to show that applying a pure transformer directly to sequence of image patches can generate satisfactory results. Later, transformer was extended to other domains, such as object detection, semantic segmentation.
Based on VIT, Zheng et al. proposed SETR\cite{seg:2} using ViT as an encoder for segmentation. He et al. proposed TransReID\cite{transid}, which embeds side information with JPM into transformer to improve the performance of object re-identification. He et al. proposed TransFG\cite{transfg}, which introduces transformer into the field of fine-grained classification.
In the context of deep learning optimization relevant to fine-grained image classification, Yang et al.9376703 proposed a two-stage selective ensemble of CNN branches via deep tree training to mitigate vanishing gradients and overfitting. Du et al.9873970 developed a global and local mixture consistency cumulative learning strategy to handle long-tailed data and alleviate head class bias. Wang et al.\cite{du2023global} designed an information maximization adaptation network with label distribution priors to enhance model generalization across domains.
Experiment analysis
In order to better evaluate our method, more evaluation indicators are adopted in this chapter, and the experimental results are analyzed in detail. All studies are done on CUB-200-2011 dataset. We used accuacy, macro-precision and macro-recall to be our evaluation metric, experimental results are shown in Table 3.
Table 3
Results of our method on accuracy, macro-precision and macro-recall
method | accuracy | macro-precision | macro-recall |
|---|
our | 91.4% | 90.6% | 90.4% |
Table 4
The number of categories in different accuracy ranges
accuracy | number |
|---|
1.0 | 97 |
| 42 |
| 38 |
| 10 |
| 6 |
| 6 |
| 1 |
In a multi-classification task, the calculation of macro-precision and macro-recall follows the following steps. (i) Calculate the precision and recall for each category. (ii) Average the precision and recall for all categories. The precision indicates how many of the predicted positive samples are actually positive samples, and the recall represents how many of the predicted positive examples are correctly predicted. From the experimental results, we can find that our method has good performance in macro-precision and macro-recall.
Since the calculation of macro evaluation metrics is affected by each subclass, we guessed that the precision and recall performance of our method in subclasses would be satisfactory. Since the calculation of macro evaluation index is affected by each subclass of yici, we guessed that the precision and recall performance of our method in subclasses are satisfactory. Therefore, we counted the number of categories in different accuracy ranges, and the specific results are shown in Table 4. For the results in the table, we give a possible explanation by comparing the pictures in the data set. As shown in Figure 3, We can see that birds that the model predicts well tend to have impressive features, such as the head of d, and the beaks of e and f. For images in which the model performs poorly, the birds' unique features are not obvious enough (e.g. a) or the image has a large background(e.g. b and c).
In the ablation study, we selected TransFG as the backbone of our network, and then divided it into two groups. One group uses our patch impact factor module (PIMF), and the other group does not use the patch impact factor module (PIMF). In addition, all the settings of the experiment are kept the same except whether to use the patch impact factor module (PIMF) or not. From the results in Table 3, we can see that after using the PIMF module, our acc improved from 90.2% to 91.4%, an improvement of 1.2 percentage points. The ablation study proves that our proposed module can play a positive role in improving the performance of fine-grained image categorization models.
fi
References:
Harold Abelson and Gerald Jay Sussman and Julie Sussman (1985) Structure and Interpretation of Computer Programs. MIT Press, Cambridge, Massachusetts
Van Horn, Grant and Branson, Steve and Farrell, Ryan and Haber, Scott and Barry, Jessie and Ipeirotis, Panos and Perona, Pietro and Belongie, Serge (2015) Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. 595--604, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Wah, Catherine and Branson, Steve and Welinder, Peter and Perona, Pietro and Belongie, Serge (2011) The caltech-ucsd birds-200-2011 dataset. California Institute of Technology
Khosla, Aditya and Jayadevaprakash, Nityananda and Yao, Bangpeng and Li, Fei-Fei (2011) Novel dataset for fine-grained image categorization: Stanford dogs. Citeseer, 1, 2, Proc. CVPR workshop on fine-grained visual categorization (FGVC)
Simonyan, Karen and Zisserman, Andrew (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E (2017) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60(6): 84--90 AcM New York, NY, USA
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian (2016) Deep residual learning for image recognition. 770--778, Proceedings of the IEEE conference on computer vision and pattern recognition
Ding, Yao and Zhou, Yanzhao and Zhu, Yi and Ye, Qixiang and Jiao, Jianbin (2019) Selective sparse sampling for fine-grained image recognition. 6599--6608, Proceedings of the IEEE/CVF International Conference on Computer Vision
Liu, Chuanbin and Xie, Hongtao and Zha, Zheng-Jun and Ma, Lingfeng and Yu, Lingyun and Zhang, Yongdong (2020) Filtration and distillation: Enhancing region attention for fine-grained visual categorization. 11555--11562, 07, 34, Proceedings of the AAAI Conference on Artificial Intelligence
Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28
Huang, Xin and Wang, Xinxin and Lv, Wenyu and Bai, Xiaying and Long, Xiang and Deng, Kaipeng and Dang, Qingqing and Han, Shumin and Liu, Qiwen and Hu, Xiaoguang and others (2021) PP-YOLOv2: A practical object detector. arXiv preprint arXiv:2104.10419
Hu, Jie and Shen, Li and Sun, Gang (2018) Squeeze-and-excitation networks. 7132--7141, Proceedings of the IEEE conference on computer vision and pattern recognition
Li, Xiang and Wang, Wenhai and Hu, Xiaolin and Yang, Jian (2019) Selective kernel networks. 510--519, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, Qilong and Wu, Banggu and Zhu, Pengfei and Li, Peihua and Zuo, Wangmeng and Hu, Qinghua (2020) Supplementary material for ‘ECA-Net: Efficient channel attention for deep convolutional neural networks. 13--19, Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, WA, USA
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia (2017) Attention is all you need. Advances in neural information processing systems 30
Wang, Xiaolong and Girshick, Ross and Gupta, Abhinav and He, Kaiming (2018) Non-local neural networks. 7794--7803, Proceedings of the IEEE conference on computer vision and pattern recognition
Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and others (2015) Spatial transformer networks. Advances in neural information processing systems 28
Woo, Sanghyun and Park, Jongchan and Lee, Joon-Young and Kweon, In So (2018) Cbam: Convolutional block attention module. 3--19, Proceedings of the European conference on computer vision (ECCV)
Fu, Jun and Liu, Jing and Tian, Haijie and Li, Yong and Bao, Yongjun and Fang, Zhiwei and Lu, Hanqing (2019) Dual attention network for scene segmentation. 3146--3154, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Huang, Zilong and Wang, Xinggang and Huang, Lichao and Huang, Chang and Wei, Yunchao and Liu, Wenyu (2019) Ccnet: Criss-cross attention for semantic segmentation. 603--612, Proceedings of the IEEE/CVF international conference on computer vision
Wang, Fei and Jiang, Mengqing and Qian, Chen and Yang, Shuo and Li, Cheng and Zhang, Honggang and Wang, Xiaogang and Tang, Xiaoou (2017) Residual attention network for image classification. 3156--3164, Proceedings of the IEEE conference on computer vision and pattern recognition
Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining (2021) Swin transformer: Hierarchical vision transformer using shifted windows. 10012--10022, Proceedings of the IEEE/CVF International Conference on Computer Vision
Carion, Nicolas and Massa, Francisco and Synnaeve, Gabriel and Usunier, Nicolas and Kirillov, Alexander and Zagoruyko, Sergey (2020) End-to-end object detection with transformers. Springer, 213--229, European conference on computer vision
Chen, Jieneng and Lu, Yongyi and Yu, Qihang and Luo, Xiangde and Adeli, Ehsan and Wang, Yan and Lu, Le and Yuille, Alan L and Zhou, Yuyin (2021) Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306
Zheng, Sixiao and Lu, Jiachen and Zhao, Hengshuang and Zhu, Xiatian and Luo, Zekun and Wang, Yabiao and Fu, Yanwei and Feng, Jianfeng and Xiang, Tao and Torr, Philip HS and others (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. 6881--6890, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, Shuting and Luo, Hao and Wang, Pichao and Wang, Fan and Li, Hao and Jiang, Wei (2021) Transreid: Transformer-based object re-identification. 15013--15022, Proceedings of the IEEE/CVF international conference on computer vision
He, Ju and Chen, Jie-Neng and Liu, Shuai and Kortylewski, Adam and Yang, Cheng and Bai, Yutong and Wang, Changhu (2022) Transfg: A transformer architecture for fine-grained recognition. 852--860, 1, 36, Proceedings of the AAAI Conference on Artificial Intelligence
Farrell, Ryan and Oza, Om and Zhang, Ning and Morariu, Vlad I and Darrell, Trevor and Davis, Larry S (2011) Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. IEEE, 161--168, 2011 International Conference on Computer Vision
Liu, Jiongxin and Kanazawa, Angjoo and Jacobs, David W and Belhumeur, Peter N (2012) Dog Breed Classification Using Part Localization.. 172--185, ECCV (1)
Yang, Shulin and Bo, Liefeng and Wang, Jue and Shapiro, Linda (2012) Unsupervised template learning for fine-grained object recognition. Advances in neural information processing systems 25
Berg, Thomas and Belhumeur, Peter N (2013) Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. 955--962, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Full Author Name. The Frobnicatable Foo Filter. Face and Gesture submission ID 324. Supplied as additional material {\tt fg324.pdf}. 2014
Full Author Name. Frobnication Tutorial. Supplied as additional material {\tt tr.pdf}. 2014
Alvin Alpher (2002) Frobnication. Journal of Foo 12(1): 234--778
Alvin Alpher and Ferris P. N. Fotheringham-Smythe (2003) Frobnication Revisited. Journal of Foo 13(1): 234--778
Alvin Alpher and Ferris P. N. Fotheringham-Smythe and Gavin Gamow (2004) Can a Machine Frobnicate?. Journal of Foo 14(1): 234--778
Yang, Yun and Hu, Yuanyuan and Zhang, Xingyi and Wang, Song (2022) Two-Stage Selective Ensemble of CNN via Deep Tree Training for Medical Image Classification. IEEE Transactions on Cybernetics 52(9): 9194-9207 https://doi.org/10.1109/TCYB.2021.3061147, Task analysis;Training;Medical diagnostic imaging;Feature extraction;Deep learning;Distortion;Diseases;Computer-aided diagnosis;convolutional neural networks (CNNs);deep learning;ensemble learning
Wang, Pei and Yang, Yun and Xia, Yuelong and Wang, Kun and Zhang, Xingyi and Wang, Song (2023) Information Maximizing Adaptation Network With Label Distribution Priors for Unsupervised Domain Adaptation. IEEE Transactions on Multimedia 25(): 6026-6039 https://doi.org/10.1109/TMM.2022.3203574, Mutual information;Noise measurement;Entropy;Adaptation models;Training;Data models;Semantics;Information theory;label distribution priors;mutual information;unsupervised domain adaptation
Du, Fei and Yang, Peng and Jia, Qi and Nan, Fengtao and Chen, Xiaoting and Yang, Yun (2023) Global and Local Mixture Consistency Cumulative Learning for Long-tailed Visual Recognitions. arXiv, 2305.08661, 10.48550/arXiv.2305.08661, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)