Feature Fusion Units for Fine-grained Image Categorization

HuaZhao1Emailiris5280@126.com

ZujunLiu1Emailliuzj@smartsteps.com

BinYang2Emailresearcher_yang@outlook.com

TianyuLu3Emaillty11037@bupt.edu.cn

YingXing3✉Emailxingying@bupt.edu.cn

1Smart Steps Digital Technology Co., Ltd.Chengfang Street100033BeijingBeijingChina

2China Unicom Research InstituteShouti South Road100037BeijingBeijingChina

School of Intelligent Engineering and AutomationBeijing University of Posts and TelecommunicationsXitucheng Road100876BeijingBeijingChina

Abstract

Fine-grained image categorization aims to categorize subclasses by processing detailed features, which is still a critical problem to be solved in computer version due to the small differences between subclasses. The traditional methods are usually to find features by manual annotation, using specific sliding Windows, using different thresholds and other methods. These methods are not only costly, but also ineffective. In computer version, by calculating attention scores between parts of the picture multiple times and weighting them, the transformer greatly improves the accuracy of categorization. In this paper, we propose a feature weight units. Specifically, transformer is used as the backbone to capture image feature(these features are called patches in transformer), and then all patches are weighted by our feature weight unit. The computal result of feature fusion unit represents the importance of the patch should to be forced on. To verify the effectiveness of our method, we conducted experiments on the CUB-200-2011 and stanford-dog datasets.

Keywords

Fine-grained image

categorization

transformer

computer vision

feature weight unit

Hua Zhao , Zujun Liu , Bin Yang and Tianyu Lu: These authors contributed equally to this work.

Introduction

Fine-grained image categorization is an extension of traditional classification and aims at classifying subclasses of a given category. For example, classifying subclasses of birds\cite{bird:cub,bird:na} and classifying subclasses of dogs\cite{dog:stanford}. Fine-grained image categorization has been an urgent problem in the field of computer vision because of the high requirements on classification features. Traditional image categorization algorithms generally describe images by extracting features, and then use a classifier to classify images. Early methods were based on the color, texture, edge, etc. of the entire image, using some supervised or unsupervised classification algorithms. The visual feature dimension of this kind of method is very high and the calculation is complicated, so it is difficult to reflect the spatial relationship of images. Later, some scholars divided the image into independent regions to extract local features, which can be generally summarized as: local feature extraction; Feature coding; Feature convergence; The classifier classifies four stages. Modeling with local features offers great flexibility and expressiveness, but these features are heavily dependent on manual selection and require a lot of time and experience, as well as professional knowledge and practice.

The development and success of neural networks\cite{network:2,network:1,network:3} in recent years have given researchers hope for solving the problem. Scholars have started to use RPN (Region Proposal Network)\cite{rpn:rpn} to automatically generate candidate regions. The candidate regions are then scaled to a specific size and fed into a feature extraction network to extract features. Although the RPN network can automatically select the local area of the input image, it still has some shortcomings. First, RPN generates a large number of candidate regions to ensure the coverage of key features. Second, RPN networks usually need to be trained separately, so using RPN networks not only increases the computational effort, but also leads to complicates the training of the network. In order to alleviate these problems, some scholars have proposed new methods, such as method based on attention mechanisms\cite{sknet:sknet,senet:senet}, feature coding\cite{vit:vit}. These methods do not require complex annotation information and can be easily implemented for end-to-end training. This approach is also increasingly becoming the mainstream of community to solve fine-grained image categorization problems.

In recent years, transformer\cite{vit:vit,swin:swin}, a method originally applied to natural language processing, has been applied to computer vision. Transformer segments an image into a series of patches and captures important regions of the image using its own self-attention mechanism. Transformer has shown promising performance, and a series of extended work on downstream tasks, such as target classification, object detection\cite{detection:detection}, and semantic segmentation\cite{seg:1,seg:2}, has also demonstrated the effectiveness of transformer in the field of vision.related work

Squeeze-and-Excitation Networks (SENet)\cite{senet:senet} is a network structure proposed by Hu for convolutional neural networks. It automatically obtains the importance of each feature channel by learning and explicitly models the interdependencies between the feature channels. According to this, the useful features are promoted and thr useless features are suppressed.

Inspired by above work, we propose a feature weight unit. Specifically, we take the encoded features as a unique high-dimensional representation and use these high-dimensional representations to generate a set of weight coefficients. The weight coefficients indicate the importance of corresponding feature in each network layer. Finally, all features are weighted by weight coefficient. Performance ofour model on CUB-200-2011 dataset and Stanford Dogs dataset can be seen in Fig 1. Overall, our contributions are:

1. Proposing a feature weight unit to represent the importance of feature.

2. Verifying the feasibility of our proposed method on public datasets.

Fig. 1

Performance of our model on CUB-200-2011 dataset and Stanford Dogs dataset

Related Work

In this section, we will briefly review the methods used to develop fine-grained image categorization.

Trandition Method

The early research on fine-grained images was mainly based on traditional artificial features. In 2011, Wah et al. sorted out and published the CUB-200-2011 dataset\cite{bird:cub}, which was the prelude to fine-grained research. Subsequently, Farrell et al.\cite{trandition1} proposed to first train an attitude classifier at a coarse-grained level to extract local position information, and then further build a fine-grained level model. Liu et al.\cite{trandition2}. used local localization to extract features. They built contour models of the whole dog and its face, and then used feature matching for localization. Yang et al.\cite{trandition3}. propose a template for unsupervised learning to capture locally common shape patterns and interrelationships of objects. Berg et al.\cite{trandition4}. proposed POOF features to make pairwise comparisons between different categories to obtain characteristic representations of specific parts. Because of the weak characterization ability of artificial features, the classification effect has not been greatly improved. In addition, the lack of local localization ability further reduces the classification effect.

Deep Learning Method

The emergence of deep learning has brought earth-shaking changes to image classification research. Deep models can better mine information from data and learn more powerful and robust features, bringing revolutionary improvement in classification effect.

Attention Mechanism is a data processing method in computer vision. According to the different application of the attention mechanism, i.e., the way and location of the attention weights applied, the attention mechanism can be broadly classified into three kinds: space domain, channel domain and hybrid domain.

For convolutional neural networks, the CNN outputs a feature map of C x H x W for each layer, where C is the channel, H and W denote the height and width of the image. Spatial attention learns a weight matrix that represents the weight of each pixel on the feature map of H x W dimensions. Examples include Self-Attention\cite{self-attention}, Non-local Attention\cite{non-local}, and Spatial Transformer\cite{sp}.

The channel domain attention is applying a weight on each channel to represent the relevance of that channel to the key information. If the larger this weight is, the higher the relevance is. Represented by SENet\cite{senet:senet}, SKNet\cite{sknet:sknet}, ECANet\cite{eca}, etc.

The hybrid domain attention mechanism is an attention mechanism that combines space and channel. It improves network performance by combining channel and spatial information. Such as CBAM\cite{cbam}, DANet\cite{danet}, CCNet\cite{ccnet}, Residual Attention\cite{resattention}.

In recent year, transformer originally used to natural language processing was applied to computer vision by scholars. VIT\cite{vit:vit} is the first work to show that applying a pure transformer directly to sequence of image patches can generate satisfactory results. Later, transformer was extended to other domains, such as object detection, semantic segmentation.

Based on VIT, Zheng et al. proposed SETR\cite{seg:2} using ViT as an encoder for segmentation. He et al. proposed TransReID\cite{transid}, which embeds side information with JPM into transformer to improve the performance of object re-identification. He et al. proposed TransFG\cite{transfg}, which introduces transformer into the field of fine-grained classification.

In the context of deep learning optimization relevant to fine-grained image classification, Yang et al.9376703 proposed a two-stage selective ensemble of CNN branches via deep tree training to mitigate vanishing gradients and overfitting. Du et al.9873970 developed a global and local mixture consistency cumulative learning strategy to handle long-tailed data and alleviate head class bias. Wang et al.\cite{du2023global} designed an information maximization adaptation network with label distribution priors to enhance model generalization across domains.

Method

In this section, we will introduce our approach in two parts. The first section briefly introduces our backbone. The second section introduces our method.

Backbone

Patch Embedding

The Patch Embedding in ViT(Vision Transformer) is used to transform the original 2-dimensional image into a series of 1-dimensional patch embeddings. Assuming that the dimension of the input image x is

$H \times W \times C$

, denoting the height, width and number of channels respectively. The Patch Embeeding operation divides the input image into N patches of size

$P^2C$

$N=\frac{H \times W}{P^2}$

$x^{N \times P^2C}=Patch Embedding(x^{H \times W \times C})$

Then patches will be projected onto a space of dimension D by mapping.

$x^{N \times D}=Projec(x^{N \times P^2C})$

After mapping, all patches will be position encoded.

$z_0=x+E_{position}$

$(z_0)$

$E_{position}\in\mathbb{R^{N \times D}}$

denotes the position embedding.

begin{tiny}\begin{equation}N=\frac{H \times W}{P^2}\end{equation}\end{tiny}\fi

Endcoder

In the transformer, the encoder module needs to be encoded multiple times, and each encoding is performed after a multi-headed self-attentive (MSA) and multi-layer perceptron (MLP) block. the l-th encoding can be written as follows:

$z^{\prime}_{l}=MSA(LN(z_{l-1}))+z_{l-1}$

$z_l=MLP(LN(z_{l}^{\prime}))+z_{l}^{\prime}$

where LN() denotes the layer normalization operation.

$z_l^{\prime}$

and

$z_l$

denote the features after MAS and MLP encoding, respectively

Part Selection Module

To full exploit the information in MSA, the weights of each attention are fused and selected. Assuming that the input of the last encoding is

$z_{last-1}$

, the weight of each attention in each encoding can be expressed as:

$A_i=[p_1,p_2,...,p_N] \qquadi \in 1,2...k$

begin{equation}\begin{aligned}W=[W_{A_1},W_{A_2},...,W_{A_K}]=[W_{p_j^i}]\\i \in 1,...k,j\in 1,..N \qquad\end{aligned}\end{equation} where A denotes attention head, k denotes the number of attention head in MSA. p denotes patch, N denotes the number of patch. W denotes the weight of the corresponding attention head in each encoding. All weights are then fused using a cumulative multiplication method.

begin{equation}\begin{aligned}Score_j^{i}=\prod W_{p_j^t} \qquad\qquad \\i \in 1,...k ,j \in 1,..N,t \in 1,..last-1\end{aligned}\end{equation}i denotes the number of attention head, j denotes the number of patch,

Where t denotes the number of encoding.Once the weights of all patches are available, the index of the highest scoring patch in each attention head is obtained, and the patches at the index in

$z_{last-1}$

are input to the classification network.

Our Method

Feature weight unit is a tool used to calculate the importance of patches, as shown in Figure 2. Feature weight unit considers that each patch has a unique representation of high-dimensional features. These high-dimensional features can represent the contribution of each patch to the final result. Feature weight unit can be summarized into two operations - mapping and fusion. Mapping means that the high-dimensional features are mapped to generate the impact factor of the patches. Fusion scales the high-dimensional features by multiplying the impact factor with the high-dimensional features of the patches.Our method is shown in Figure 2. After the input image is segmented into patches and passed through project, position embedding is applied to generates the features which are fed into encoder layer.In the encoder layer, we use Feature weight unit at the end to calculate the importance of patches, and after Feature weight unit processing, the patches are fed into the next encoder layer.Finally, the patches selected by the score calculation are used as the final features for classification.

Fig. 2

Overview of our method

Experiment

In this section, we will introduce the dataset and then detail the parameter settings for our experiments.

Experiment Setup

Dataset

We conducted our experiments on the CUB-200-2011 and Stanford Dogs datasets. CUB-200-2011 dataset contains 200 bird classes with a total of 11,788 images, and the Stanford Dogs dataset contains 120 dog classes with a total of 20,580 images. The specific divisions of the training and validation sets are shown in Table 1.

Table 1
The detailed information of dataset
Name	Train	Validation	total number
CUB-200-2011	5996	5792	11788
Stanford Dogs	12000	8580	20580

Implementation detail

We train our model using the following: image size of

$448 \times 448$

; data enhancement using random horizontal inversion, random vertical flipping, and random cropping; ImageNet21k with the training model; optimizer is SGD, learning rate is 0.01, momentum is 0.9, and cosine annealing as the scheduler. Our hardware devices are GPU: A6000-48G; CPU: i9-12900KF.

Quantitative Analysis

We compared the proposed method with the sota methods on the Stanford Dogs and CUB-200-2011 datasets, respectively, and the specific experimental results are shown in Table 2.

Table 2
Accuracy comparison of different methods on datasets
Method	Dataset	Acc
midrule
multirow{2}*{TransFG}
	CUB-200-2012	$(90.2%)$
	Stanford Dogs	91.8%
midrule
multirow{2}*{OSME-MAMC}
	CUB-200-2012	$(86.2%)$
	Stanford Dogs	$(84.8%)$
multirow{2}*{TransFG}
	CUB-200-2012	$(90.2%)$
	Stanford Dogs	91.8%
midrule
multirow{2}*{OSME-MAMC}
	CUB-200-2012	$(86.2%)$
	Stanford Dogs	$(84.8%)$
midrule
multirow{2}*{Ours}
	CUB-200-2012	91.4%
	Stanford Dogs	$(91.2%)$

In the comparison study, we chose acc as our evaluation metric. We show the expermiental results in the Table 2. It can be seen from the results that our method outperforms the other methods. On the CUB-200-2012 dataset, compared with TransFG and OSME-MAMC, our method achieves 1.2

$%$

improvement and 5.6

$%$

improvement on accuracy metric. On the Stanford Dogs dataset, compared with OSME-MAMC, our method achieves a 6.6

$%$

improvement on accuracy metric. This shows that our proposed method is effective in classifying fine-grained images.

Experiment analysis

In order to better evaluate our method, more evaluation indicators are adopted in this chapter, and the experimental results are analyzed in detail. All studies are done on CUB-200-2011 dataset. We used accuacy, macro-precision and macro-recall to be our evaluation metric, experimental results are shown in Table 3.

Table 3
Results of our method on accuracy, macro-precision and macro-recall
method	accuracy	macro-precision	macro-recall
our	91.4%	90.6%	90.4%

Table 4
The number of categories in different accuracy ranges
accuracy	number
1.0	97
0.9 $(\sim)$ 1.0	42
0.8 $(\sim)$ 0.9	38
0.7 $(\sim)$ 0.8	10
0.6 $(\sim)$ 0.7	6
0.5 $(\sim)$ 0.6	6
0.4 $(\sim)$ 0.5	1

In a multi-classification task, the calculation of macro-precision and macro-recall follows the following steps. (i) Calculate the precision and recall for each category. (ii) Average the precision and recall for all categories. The precision indicates how many of the predicted positive samples are actually positive samples, and the recall represents how many of the predicted positive examples are correctly predicted. From the experimental results, we can find that our method has good performance in macro-precision and macro-recall.

Fig. 3

Model categorizes difficultly categories and categorizes easily categories

Since the calculation of macro evaluation metrics is affected by each subclass, we guessed that the precision and recall performance of our method in subclasses would be satisfactory. Since the calculation of macro evaluation index is affected by each subclass of yici, we guessed that the precision and recall performance of our method in subclasses are satisfactory. Therefore, we counted the number of categories in different accuracy ranges, and the specific results are shown in Table 4. For the results in the table, we give a possible explanation by comparing the pictures in the data set. As shown in Figure 3, We can see that birds that the model predicts well tend to have impressive features, such as the head of d, and the beaks of e and f. For images in which the model performs poorly, the birds' unique features are not obvious enough (e.g. a) or the image has a large background(e.g. b and c).

In the ablation study, we selected TransFG as the backbone of our network, and then divided it into two groups. One group uses our patch impact factor module (PIMF), and the other group does not use the patch impact factor module (PIMF). In addition, all the settings of the experiment are kept the same except whether to use the patch impact factor module (PIMF) or not. From the results in Table 3, we can see that after using the PIMF module, our acc improved from 90.2% to 91.4%, an improvement of 1.2 percentage points. The ablation study proves that our proposed module can play a positive role in improving the performance of fine-grained image categorization models.

Conclusion

In this work, we propose a novel fine-grained image categorizationmodule PIFM and achieve state-of-the-art results on CUB-200-2011 and Stanford Dogs datasets. we take the encoded features of patch as a unique high-dimensional representation and use these high-dimensional representations to generate a set of weight coefficients. The weight coefficients indicate the importance of the corresponding patch in each network layer. At the end of the network, all weight coefficients are fused, indicating the contribution of the corresponding patch to the classification. Experiments are conducted on traditional academy datasets to prove the effectiveness of our module. The experimental results prove that transformer has great potential for fine-grained classification and is worth spending time to explore. In future work, we will further explore the potential of the transformer and experiment on more datasets (academy datasets and large-scale competition datasets) as a way to fully validate our approach.

Author Contribution

H.Z.: Conceptualization, Methodology, Investigation;Z.L.: Methodology, Formal analysis;B.Y.: Investigation, Formal analysis, Writing - original draft;T.L.: Writing - original draft, Writing - review & editing;Y.X.: Conceptualization, Supervision, Funding acquisition.All authors reviewed the manuscript.

Data Availability

The data used in this study are publicly available from the following official repositories:The CUB-200-2011 dataset can be accessed from the California Institute of Technology’s official website: https://www.vision.caltech.edu/datasets/cub_200_2011/The Stanford Dogs dataset can be obtained from Stanford University’s Computer Vision Laboratory repository: http://vision.stanford.edu/aditya86/ImageNetDogs/

References:

Harold Abelson and Gerald Jay Sussman and Julie Sussman (1985) Structure and Interpretation of Computer Programs. MIT Press, Cambridge, Massachusetts

Van Horn, Grant and Branson, Steve and Farrell, Ryan and Haber, Scott and Barry, Jessie and Ipeirotis, Panos and Perona, Pietro and Belongie, Serge (2015) Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. 595--604, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Wah, Catherine and Branson, Steve and Welinder, Peter and Perona, Pietro and Belongie, Serge (2011) The caltech-ucsd birds-200-2011 dataset. California Institute of Technology

Khosla, Aditya and Jayadevaprakash, Nityananda and Yao, Bangpeng and Li, Fei-Fei (2011) Novel dataset for fine-grained image categorization: Stanford dogs. Citeseer, 1, 2, Proc. CVPR workshop on fine-grained visual categorization (FGVC)

Simonyan, Karen and Zisserman, Andrew (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E (2017) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60(6): 84--90 AcM New York, NY, USA

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian (2016) Deep residual learning for image recognition. 770--778, Proceedings of the IEEE conference on computer vision and pattern recognition

Ding, Yao and Zhou, Yanzhao and Zhu, Yi and Ye, Qixiang and Jiao, Jianbin (2019) Selective sparse sampling for fine-grained image recognition. 6599--6608, Proceedings of the IEEE/CVF International Conference on Computer Vision

Liu, Chuanbin and Xie, Hongtao and Zha, Zheng-Jun and Ma, Lingfeng and Yu, Lingyun and Zhang, Yongdong (2020) Filtration and distillation: Enhancing region attention for fine-grained visual categorization. 11555--11562, 07, 34, Proceedings of the AAAI Conference on Artificial Intelligence

Ren, Shaoqing and He, Kaiming and Girshick, Ross and Sun, Jian (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28

Huang, Xin and Wang, Xinxin and Lv, Wenyu and Bai, Xiaying and Long, Xiang and Deng, Kaipeng and Dang, Qingqing and Han, Shumin and Liu, Qiwen and Hu, Xiaoguang and others (2021) PP-YOLOv2: A practical object detector. arXiv preprint arXiv:2104.10419

Hu, Jie and Shen, Li and Sun, Gang (2018) Squeeze-and-excitation networks. 7132--7141, Proceedings of the IEEE conference on computer vision and pattern recognition

Li, Xiang and Wang, Wenhai and Hu, Xiaolin and Yang, Jian (2019) Selective kernel networks. 510--519, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, Qilong and Wu, Banggu and Zhu, Pengfei and Li, Peihua and Zuo, Wangmeng and Hu, Qinghua (2020) Supplementary material for ‘ECA-Net: Efficient channel attention for deep convolutional neural networks. 13--19, Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, WA, USA

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia (2017) Attention is all you need. Advances in neural information processing systems 30

Wang, Xiaolong and Girshick, Ross and Gupta, Abhinav and He, Kaiming (2018) Non-local neural networks. 7794--7803, Proceedings of the IEEE conference on computer vision and pattern recognition

Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and others (2015) Spatial transformer networks. Advances in neural information processing systems 28

Woo, Sanghyun and Park, Jongchan and Lee, Joon-Young and Kweon, In So (2018) Cbam: Convolutional block attention module. 3--19, Proceedings of the European conference on computer vision (ECCV)

Fu, Jun and Liu, Jing and Tian, Haijie and Li, Yong and Bao, Yongjun and Fang, Zhiwei and Lu, Hanqing (2019) Dual attention network for scene segmentation. 3146--3154, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Huang, Zilong and Wang, Xinggang and Huang, Lichao and Huang, Chang and Wei, Yunchao and Liu, Wenyu (2019) Ccnet: Criss-cross attention for semantic segmentation. 603--612, Proceedings of the IEEE/CVF international conference on computer vision

Wang, Fei and Jiang, Mengqing and Qian, Chen and Yang, Shuo and Li, Cheng and Zhang, Honggang and Wang, Xiaogang and Tang, Xiaoou (2017) Residual attention network for image classification. 3156--3164, Proceedings of the IEEE conference on computer vision and pattern recognition

Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining (2021) Swin transformer: Hierarchical vision transformer using shifted windows. 10012--10022, Proceedings of the IEEE/CVF International Conference on Computer Vision

Carion, Nicolas and Massa, Francisco and Synnaeve, Gabriel and Usunier, Nicolas and Kirillov, Alexander and Zagoruyko, Sergey (2020) End-to-end object detection with transformers. Springer, 213--229, European conference on computer vision

Chen, Jieneng and Lu, Yongyi and Yu, Qihang and Luo, Xiangde and Adeli, Ehsan and Wang, Yan and Lu, Le and Yuille, Alan L and Zhou, Yuyin (2021) Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306

Zheng, Sixiao and Lu, Jiachen and Zhao, Hengshuang and Zhu, Xiatian and Luo, Zekun and Wang, Yabiao and Fu, Yanwei and Feng, Jianfeng and Xiang, Tao and Torr, Philip HS and others (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. 6881--6890, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, Shuting and Luo, Hao and Wang, Pichao and Wang, Fan and Li, Hao and Jiang, Wei (2021) Transreid: Transformer-based object re-identification. 15013--15022, Proceedings of the IEEE/CVF international conference on computer vision

He, Ju and Chen, Jie-Neng and Liu, Shuai and Kortylewski, Adam and Yang, Cheng and Bai, Yutong and Wang, Changhu (2022) Transfg: A transformer architecture for fine-grained recognition. 852--860, 1, 36, Proceedings of the AAAI Conference on Artificial Intelligence

Farrell, Ryan and Oza, Om and Zhang, Ning and Morariu, Vlad I and Darrell, Trevor and Davis, Larry S (2011) Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. IEEE, 161--168, 2011 International Conference on Computer Vision

Liu, Jiongxin and Kanazawa, Angjoo and Jacobs, David W and Belhumeur, Peter N (2012) Dog Breed Classification Using Part Localization.. 172--185, ECCV (1)

Yang, Shulin and Bo, Liefeng and Wang, Jue and Shapiro, Linda (2012) Unsupervised template learning for fine-grained object recognition. Advances in neural information processing systems 25

Berg, Thomas and Belhumeur, Peter N (2013) Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. 955--962, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Full Author Name. The Frobnicatable Foo Filter. Face and Gesture submission ID 324. Supplied as additional material {\tt fg324.pdf}. 2014

Full Author Name. Frobnication Tutorial. Supplied as additional material {\tt tr.pdf}. 2014

Alvin Alpher (2002) Frobnication. Journal of Foo 12(1): 234--778

Alvin Alpher and Ferris P. N. Fotheringham-Smythe (2003) Frobnication Revisited. Journal of Foo 13(1): 234--778

Alvin Alpher and Ferris P. N. Fotheringham-Smythe and Gavin Gamow (2004) Can a Machine Frobnicate?. Journal of Foo 14(1): 234--778

Yang, Yun and Hu, Yuanyuan and Zhang, Xingyi and Wang, Song (2022) Two-Stage Selective Ensemble of CNN via Deep Tree Training for Medical Image Classification. IEEE Transactions on Cybernetics 52(9): 9194-9207 https://doi.org/10.1109/TCYB.2021.3061147, Task analysis;Training;Medical diagnostic imaging;Feature extraction;Deep learning;Distortion;Diseases;Computer-aided diagnosis;convolutional neural networks (CNNs);deep learning;ensemble learning

Wang, Pei and Yang, Yun and Xia, Yuelong and Wang, Kun and Zhang, Xingyi and Wang, Song (2023) Information Maximizing Adaptation Network With Label Distribution Priors for Unsupervised Domain Adaptation. IEEE Transactions on Multimedia 25(): 6026-6039 https://doi.org/10.1109/TMM.2022.3203574, Mutual information;Noise measurement;Entropy;Adaptation models;Training;Data models;Semantics;Information theory;label distribution priors;mutual information;unsupervised domain adaptation

Du, Fei and Yang, Peng and Jia, Qi and Nan, Fengtao and Chen, Xiaoting and Yang, Yun (2023) Global and Local Mixture Consistency Cumulative Learning for Long-tailed Visual Recognitions. arXiv, 2305.08661, 10.48550/arXiv.2305.08661, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Declarations

Funding: Not applicable.

Conflict of interest: All authors declare no conflict of interest.

Ethics approval and consent to participate: Not applicable.

Consent for publication: Not applicable.

Data availability: The data used in this study are publicly available from the following official repositories:The CUB-200-2011 dataset can be accessed from the California Institute of Technology’s official website: https://www.vision.caltech.edu/datasets/cub_200_2011/; The Stanford Dogs dataset can be obtained from Stanford University’s Computer Vision Laboratory repository: http://vision.stanford.edu/aditya86/ImageNetDogs/.

Materials availability: Not applicable.

Code availability: The custom code generated during the current study to support the findings reported herein is not publicly deposited at this stage but is available from the corresponding author upon reasonable request. Interested researchers may contact the corresponding author (Ying Xing, E-mail: xingying@bupt.edu.cn) with a brief description of their research purpose to obtain the code.

Author contribution: Hua Zhao: Conceptualization, Methodology, Investigation; Zujun Liu: Methodology, Formal analysis; Bin Yang: Investigation, Formal analysis, Writing - original draft; Tianyu Lu: Writing - original draft, Writing - review \& editing; Ying Xing: Conceptualization, Supervision, Funding acquisition.

Editorial Policies for:

Springer journals and proceedings: https://www.springer.com/gp/editorial-policies

Nature Portfolio journals: https://www.nature.com/nature-research/editorial-policies

Scientific Reports: https://www.nature.com/srep/journal-policies/editorial-policies

BMC journals: https://www.biomedcentral.com/getpublished/editorial-policiesbibliography{sn-bibliography}

Additional Files

Additional file 1

Additional file 2

Additional file 3

Yes