DMSCA: Dynamic Multi-Scale Channel-Spatial Attention for Enhanced Feature Representation in Convolutional Neural Networks

LiZong1✉Emaillizong040209@gmail.com

HuJunPeng1

ShenJiNan1

1College of Intelligent Science and EngineeringHubei University for Nationalities445000EnshiHubei ProvinceChina

Li Zong¹, Hu Jun Peng^1†, Shen Ji Nan¹

College of Intelligent Science and Engineering, Hubei University for Nationalities, Enshi 445000 Hubei Province, China

Corresponding Author: Hu Jun Peng(lizong040209@gmail.com)

Fund: This work was supported by the National Natural Science Foundation of China (Grant No. 62262020)

Abstract

The attention mechanism improves convolutional neural networks (CNNs) by emphasizing important features, yet current approaches often lack in capturing multi-scale contexts, deeply integrating channel and spatial information, and adapting dynamically. We introduce the Dynamic Multi-Scale Channel-Spatial Attention (DMSCA) mechanism, which synergistically combines multi-scale encoding, directional interactions, and adaptive activations for enhanced feature coupling. Unlike fixed-structure methods like CBAM, DMSCA employs learnable dynamic weights for adaptive channel-spatial fusion. Key innovations include Temperature-controlled Channel Attention (TCA) and Direction-aware Multi-scale Spatial Context Encoder (MSCE). DMSCA integrates six components: Global Context Encoder, TCA, MSCE, Directional Information Interaction, Dynamic Feature Fusion, and Adaptive Activation. Evaluations on CIFAR-10/100 and ImageNet demonstrate superior performance over state-of-the-art attentions, with a 1.52% Top-1 accuracy gain on ResNet-50 at modest computational cost (11.3% parameter and 2.4% FLOPs increase).

INDEX TERMS:

Attention mechanism

Convolutional neural network

Feature representation

Dynamic adaptability

Multi-scale context

Channel-spatial coupling

1. Introduction

Convolutional Neural Networks (CNNs) [1], [2] have revolutionized tasks including image classification [26], [27], object detection [3], [4], [28], and semantic segmentation [5], [6], [29] by learning hierarchical features autonomously. Nevertheless, conventional CNNs assign equal importance to all channels and spatial locations, hindering their focus on critical information and thus impacting feature quality and model performance.

Inspired by human vision, attention mechanisms [7], [8], [30] have been incorporated into CNNs to enable selective focus, dynamically weighting features to boost performance, robustness, and interpretability with minimal complexity overhead.

Notable advancements include SE-Net [9] for channel attention via global pooling and fully connected layers, CBAM [10] for sequential channel-spatial attention, ECA-Net [11] for efficient cross-channel interactions, and Coordinate Attention (CA) [12] for position-sensitive long-range dependencies.

Despite these progresses, challenges persist: insufficient multi-scale context capture, simplistic channel-spatial fusion lacking dynamic collaboration, limited adaptability of attention weights, inadequate directional modeling, and expressive limitations of fixed activations.

Addressing these, we propose the Dynamic Multi-scale Channel-Spatial Attention (DMSCA) mechanism. Our contributions are:

1. A synergistic multi-component framework encompassing global and multi-scale encoding, temperature-controlled channel attention, and direction-aware spatial attention.

2. A learnable Dynamic Feature Fusion module for non-linear, input-adaptive channel-spatial combination, surpassing linear or fixed fusions.

3. Rigorous validation on CIFAR-10/100 and ImageNet, showing DMSCA's superior performance-efficiency balance over baselines.

4. In-depth ablation and visualization studies confirming component efficacy and enhanced interpretability.

The paper is organized as follows: Section 2 reviews related work; Section 3 details DMSCA; Section 4 outlines experimental setup; Section 5 presents results and analyses; Section 6 concludes.

2. Related Work

2.1 Channel Attention Mechanisms.

Channel attention mechanisms calibrate features by modeling inter-channel dependencies and assigning varying weights to emphasize important channels. SE-Net [9] pioneered this approach using global pooling and fully connected layers, though dimensionality reduction can lead to information loss. ECA-Net [11] enhances efficiency through local cross-channel interactions without reduction, while FCA [13] incorporates frequency-domain analysis for richer representations. Despite these advances, many methods overlook spatial context in guiding channel weights and lack dynamic adjustment. DMSCA addresses this via its Global Context Encoder (GCE) and Temperature-controlled Channel Attention (TCA), introducing learnable temperature τ for adaptive scaling and deep spatial coupling.

2.2 Spatial Attention Mechanism

patial attention mechanisms emphasize key regions in feature maps, focusing on "where" informative content lies. Spatial Transformer Networks (STN) [14] achieve this through affine transformations, albeit with high computational cost. CBAM's spatial module [10] generates attention maps via pooling and convolution but often loses channel details and directional sensitivity. Coordinate Attention (CA) [12] improves by encoding long-range dependencies with positional awareness. However, existing approaches frequently neglect channel influences, directional interactions, and multi-scale contexts. DMSCA's Multi-scale Spatial Context Encoder (MSCE) and Directional Information Interaction (DII) integrate these elements, enabling robust multi-scale and direction-aware spatial modeling.

2.3 Hybrid Attention Mechanism

Hybrid attention mechanisms synergize channel and spatial attention for comprehensive feature enhancement. CBAM [10] applies them sequentially, while BAM [15] uses parallel combination with summation. Triplet Attention [16] incorporates cross-dimensional interactions, and SimAM [17] infers neuron importance parameter-freely via an energy function. Nonetheless, fusion strategies in these methods are often fixed and linear, restricting adaptability. DMSCA introduces a Dynamic Feature Fusion (DFF) module that learns spatially adaptive weights and incorporates directional interactions for flexible, data-dependent integration.

2.4 Multi-scale Feature Processing and Attention

Multi-scale feature processing captures information across varying granularities to boost model robustness. Inception networks [18], [19] employ multi-branch structures for this purpose, and Feature Pyramid Networks (FPN) [20] fuse hierarchical semantics. Attention-integrated multi-scale methods, like Pyramid Attention Modules (PAM) [21], have gained traction. Yet, many apply multi-scale independently or with simplistic combinations, lacking dynamic cross-scale interactions. DMSCA embeds multi-scale design intrinsically, introducing and fusing scales early with adaptive weighting to exploit inter-scale dynamics deeply and efficiently.

2.5 Adaptive Activation Function

Activation functions inject non-linearity into networks, with adaptive variants dynamically adjusting based on inputs for superior expressiveness. Unlike fixed functions such as ReLU [31], Swish [32] and DY-ReLU [33] adapt shapes to feature complexities. These innovations inspire DMSCA's Adaptive Activation (AA) module, which dynamically modulates output mappings to optimize non-linear representations within the attention framework.

2.6 Advanced Attention Mechanisms and Vision Transformers

Recent advancements include sophisticated attention designs and Vision Transformers (ViT) [22], which process images as patch sequences. Models like Visual Attention Network (VAN) use large-kernel convolutions for long-range modeling in CNNs, while EfficientFormerV2 excels in lightweight hybrids. Although powerful, these often incur high costs or suit Transformer backbones primarily. In contrast, DMSCA offers a lightweight, plug-and-play solution optimized for CNNs, enhancing features without structural overhauls. Overall, while prior works advance individual aspects, DMSCA holistically integrates dynamic multi-scale, channel-spatial, and adaptive elements for superior synergy.

3. Methodology

This section details the architecture, core components, and mathematical foundations of the Dynamic Multi-scale Channel-Spatial Attention (DMSCA) mechanism. Designed to enhance CNN feature representations, DMSCA integrates multi-scale contexts, directional perceptions, and adaptive activations dynamically. We provide precise mathematical derivations, a parameter table, and pseudocode for reproducibility.

3.1 Overall Architecture of DMSCA

DMSCA serves as a plug-and-play module for CNNs like ResNet or MobileNet, typically inserted after residual blocks. For input X ∈ ℝ^{C×H×W}, it comprises six components: Global Context Encoder (GCE), Temperature-controlled Channel Attention (TCA), Multi-scale Spatial Context Encoder (MSCE), Directional Information Interaction (DII), Dynamic Feature Fusion (DFF), and Adaptive Activation (AA). These elements collaborate to refine features multi-dimensionally.

The process involves computing global context, applying channel attention, encoding spatial and directional information, fusing features dynamically, and activating adaptively. Specifically speaking:1. Input: Feature Map X ∈ ℝ^{C×H×W}; 2. Global Context Encoder (GCE): Global Avg/Max Pool + Fusion + MLP → F_global; 3. Temperature-controlled Channel Attention (TCA): A_c = Softmax(F_global/τ), X_c = X ⊙ A_c; 4. Boundary Check 1: if H = 1 or W = 1 → return X_c; 5. Multi-scale Spatial Context Encoder (MSCE): Conv1x1 reduction, multi-kernel Conv {k1,k2,k3}, weighted fusion → F_multi; 6. Boundary Check 2: if H < max(K) or W < max(K) → adaptive pooling; 7. Directional Information Interaction (DII): AvgPool(H,W) → Conv1x1 → Expand + Concat, Conv1x1 → F_dir, X_s = X_c ⊙ σ(F_multi + F_dir); 8. Dynamic Feature Fusion (DFF): Concat(X_c,X_s), Conv1x1 + BN + ReLU, Softmax weights {w0,w1}, X_out = w0⊙X_c + w1⊙X_s; 9. Boundary Check 3: if C = 1 → average fusion; 10. Adaptive Activation (AA): α = σ(BN(Conv1x1(X_out))), X_final = X_out ⊙ α; 11. Output: Enhanced Feature X_final.

Fig. 1

DMSCA architecture overview.

3.2 Global Context Encoder

The Global Context Encoder (GCE) processes the input features X through parallel global average pooling (GAP) and global max pooling (GMP), capturing the global context across channels.

X_avg = GAP(X) (1)and X_max = GMP(X)(1)

represent the average and maximum responses respectively. Compared to using GAP alone (as in SE-Net [9]), the combination of GAP and GMP [22] can capture richer global statistical information (GAP encodes the overall distribution, GMP highlights discriminative local features). The two are added element-wise, `X_fused = X_avg + X_max`, where `X_fused ∈ R^(C×1×1)`. We choose element-wise addition over concatenation to fuse global distribution (GAP) and salient feature (GMP) information with fewer parameters, avoiding the need for additional fully connected layers. This fused descriptor is then processed by a shared MLP to produce the final global context descriptor.

Fig. 2

Schematic diagram of the Global Context Encoder (GCE). The input feature X is processed in parallel through GAP and GMP to obtain X_avg and X_max. These two values are then added to form X_fused, which is input into a shared MLP (including dimensionality reduction FC, ReLU, and dimensionality restoration FC). The output is the global context descriptor F_global after applying Sigmoid.

3.3 Channel Attention with Temperature Control

The Temperature Control Channel Attention (TCA) mechanism generates channel attention weights A_c based on the global context descriptor F_global = σ(MLP(X_avg + X_max)) derived from the GCE output. The core innovation lies in introducing a learnable or adjustable temperature parameter τ

A_c = Softmax(F_global / τ)(2)

Here, τ is a learnable scalar that adjusts the sharpness of the Softmax function. A lower τ value leads to a more focused attention distribution, whereas a higher value results in a smoother distribution. This allows the model to dynamically control its focus based on the training process, drawing inspiration from temperature scaling in knowledge distillation [23]. Drawing inspiration from temperature scaling in knowledge distillation [23] and contrastive learning [24], τ can be fixed or learnable, enhancing the model's flexibility. The enhanced channel features:

X_c = X ⊙ A_c (3), where ⊙ represents element-wise multiplication.

Fig. 3

Schematic diagram of the temperature control channel attention mechanism. After GCE generates F_global, it is divided by τ and then processed by Softmax to generate A_c. A_c is then multiplied element-wise with X to obtain X_c.

3.4 Multi-scale Spatial Context Encoder

The Multi-scale Spatial Context Encoder (MSCE, Fig. 4) acts on the channel-enhanced feature `X_c ∈ R^(C×H×W)` to capture spatial context at multiple scales. Inspired by Inception [18], it uses a parallel multi-branch structure. First, `X_c` is dimensionally reduced to `C'` channels via a 1x1 convolution. Then, for each of the `n` parallel branches, a convolution `Conv_i` with a different kernel size (e.g., 3x3, 5x5, 7x7) is applied to extract scale-specific features. The output of each branch, `F_i' ∈ R^(C'×H×W)`, is then reduced to a single-channel spatial map `F_i ∈ R^(1×H×W)`.

F_i' = Conv_i(ReLU(BN(Conv_1×1,C'(X_c)))) (4)

F_i = Conv_1×1,1(ReLU(BN(F_i'))) (5)

Finally, these multi-scale maps are adaptively fused using learnable weights `w_i` to produce the final multi-scale context map `F_multi ∈ R^(1×H×W)`:

F_multi = Σ(i = 1 to n) w_i · F_i, where w = Softmax(ŵ), and ŵ are learnable parameters (6)

This learnable weighting allows the model to dynamically prioritize the most informative scales for a given input.

Fig. 4

Schematic diagram of multi-scale spatial context encoder. X_c is dimensionally reduced through a 1×1 convolution, and then passes through multiple branches (including specific scale convolution kernels k_1, k_2, k_3, BN, ReLU, and 1×1 convolution) to restore the channels to 1, generating F_1, F_2, and F_3. These F_i are weighted and fused by the learnable weights w_i (generated by Softmax) to obtain F_multi.

3.5 Directional Information Interaction

The Directional Information Interaction (DII, Fig. 5) module operates on `X_c` to capture long-range dependencies with directional awareness, improving upon CA [12]. It first performs average pooling along the horizontal and vertical axes to generate direction-specific feature maps `X_h ∈ R^(C×H×1)` and `X_w ∈ R^(C×1×W)` (Eq. 8, 9). These are then encoded into direction-aware features `F_h' ∈ R^(C/r'×H×1)` and `F_w' ∈ R^(C/r'×1×W)` (Eq. 10, 11). To facilitate interaction, these features are expanded and concatenated. This interaction is designed to allow horizontal and vertical positional information to be jointly considered, providing a more holistic spatial understanding. The final direction interaction map `F_direction ∈ R^(1×H×W)` is generated (Eq. 12). The overall spatial attention map `A_s ∈ R^(1×H×W)` is formed by combining multi-scale and directional information, and the spatially enhanced feature `X_s ∈ R^(C×H×W)` is then computed (Eq. 13, 14).

X_h_c(i) = (1/W) Σ(j = 1 to W) X_c,i,j for each channel c, row i (7)

X_w_c(j) = (1/H) Σ(i = 1 to H) X_c,i,j for each channel c, column j (8)

F_h' = ReLU(BN(Conv_1×1,C/r'(X_h))) (9)

F_w' = ReLU(BN(Conv_1×1,C/r'(X_w))) (10)

F_direction = σ(BN(Conv_1×1,1(Concat(F_h_expand, F_w_expand)))) (11)

A_s = σ(F_multi + F_direction) (12)

X_s = X_c ⊙ A_s (13)

Fig. 5

Schematic diagram of Direction Information Interaction (DII). X_c is horizontally pooled to become X_h, and vertically pooled to become X_w. X_h and X_w are each encoded by 1×1 convolution, BN, and ReLU to obtain F_h' and F_w'. After expansion and concatenation of the two, they are processed by 1×1 convolution, BN, and Sigmoid to generate F_direction.

3.6 Dynamic Feature Fusion

The Dynamic Feature Fusion (DFF, Fig. 6) module adaptively combines the channel-enhanced feature `X_c ∈ R^(C×H×W)` and the spatially-enhanced feature `X_s ∈ R^(C×H×W)`. Unlike the fixed serial/parallel fusion in methods like CBAM, DFF learns data-dependent fusion weights. The features `X_c` and `X_s` are first concatenated along the channel dimension to form `F_concat ∈ R^(2C×H×W)`. A lightweight network then generates a 2-channel weight map `W_fusion ∈ R^(2×H×W)`, which is normalized by Softmax along the channel dimension.

Justification: This dynamic weighting allows the model to spatially vary the emphasis between channel and spatial attention. For example, in texture-rich regions, it might up-weight `X_c`, while in regions with distinct object boundaries, it might prioritize `X_s`.

F_concat = Concat(X_c, X_s) (14)

W_fusion = Softmax_channel_dim = 2(Conv_1×1,2(ReLU(BN(Conv_1×1,C'(F_concat))))) (15)

X_out = W_fusion[0] ⊙ X_c + W_fusion[1] ⊙ X_s (16)

The resulting feature `X_out ∈ R^(C×H×W)` is a refined representation that has dynamically balanced channel and spatial information at each spatial location.

Fig. 6

Schematic diagram of Dynamic Feature Fusion (DFF). After concatenating X_c and X_s, it is processed through a lightweight convolutional network (including 1×1 dimensionality reduction convolution, BN, ReLU, and output 2-channel 1×1 convolution) and Softmax (with 2-channel dimension) to generate W_fusion. X_out is the weighted sum of X_c and X_s based on W_fusion.

3.7 Adaptive Activation

The Adaptive Activation (AA) module provides a final, data-dependent feature recalibration. Operating on the fused feature `X_out ∈ R^(C×H×W)`, it learns a modulation map `α ∈ R^(C×H×W)` via a simple 1x1 convolution followed by a Sigmoid function. This map acts as a gate, adaptively scaling the features at each spatial location and channel.

Justification: Inspired by dynamic activation functions like Swish [32], this mechanism introduces another layer of non-linearity, allowing the model to fine-tune the feature representation before passing it to the next layer. It effectively allows the network to learn its own activation function locally.

α = σ(BN(Conv_1×1,C(X_out))) (17)

X_final = X_out ⊙ α (18)

The final output `X_final ∈ R^(C×H×W)` represents a feature map that has been comprehensively enhanced through a multi-stage process of contextual perception and dynamic modulation.

Figure 7: Schematic diagram of Adaptive Activation (AA). X_out is processed through a 1×1 convolution, BN, and Sigmoid to generate α. α is then multiplied element-wise with X_out to obtain X_final.

Through the collaboration of the six components, DMSCA comprehensively and contextually perceives and dynamically enhances the input features from multiple dimensions, providing a more robust feature representation for visual tasks.

4 Experiment Settings

4.1 Datasets

We conducted experiments on three publicly available datasets widely used in image classification tasks. These datasets have different scales and complexities, which can fully test the performance and generalization ability of DMSCA:

1. CIFAR-10[27]: Contains 10 categories, with a total of 60,000 32×32-pixel color images. 50,000 of them are used for training and 10,000 for testing. This is a relatively small dataset and is often used for quickly verifying the effectiveness of new methods.

2. CIFAR-100[27]: Similar to CIFAR-10, but contains 100 categories, with the same number of images and size. Due to the larger number of categories, the classification difficulty is greater.

3. ImageNet (ILSVRC 2012)[28]:A large-scale dataset with 1,000 classes, ~ 1.28 million training images, and 50,000 validation images, benchmarking model generalization.

Standard preprocessing was applied: normalization, random cropping, and horizontal flipping for CIFAR; scaling to 256×256, random 224×224 cropping, and flipping for ImageNet training; center 224×224 cropping for validation.

4.2 Experimental Environment and Implementation Details

Experiments ran on a server with an Intel Core i9-12900K CPU, 128 GB DDR4 RAM, 4 TB NVMe SSD, and NVIDIA GeForce RTX 3090 GPUs (24 GB VRAM each). Software included PyTorch 2.0.0, CUDA 11.8, and Python 3.9.

DMSCA was integrated into ResNet architectures (ResNet-18, -34, -50) [7], placed after the main convolution in each residual block before identity mapping, enhancing features while preserving residual learning.

4.2.1 Theoretical Computational Complexity Analysis

To gain a deeper understanding of the computational cost of DMSCA, we provide a detailed theoretical complexity analysis. Let the size of the input feature map be H×W×C. The time complexity of each component of DMSCA is as follows: Global Context Encoder (GCE): O(HWC + C²/r), where r is the dimensionality reduction ratio; Temperature-controlled Channel Attention (TCA): O(C²/r + C); Multi-scale Spatial Context Encoder (MS-SCE): O(HWC·K), where K is the number of convolution kernels; Directional Information Interaction (DII): O(HWC + C²); Dynamic Feature Fusion (DFF): O(HWC); Adaptive Activation Function (AAF): O(HWC)

The overall time complexity is O(HWC·K + C²/r + C²), which is an increase compared to the O(HWC) of the basic ResNet, reflecting the additional costs of channel intercommunication and multi-scale processing. The space complexity is mainly determined by the intermediate feature maps and is O(HWC·K).

The hyperparameter settings for training are shown in Table 1. For different datasets, we used different learning rates and batch sizes. The optimizer uniformly uses momentum-based stochastic gradient descent (SGD). The learning rate scheduling adopts the cosine annealing strategy[30].

Table 1
Main Hyperparameter Settings
Hyperparameter	CIFAR-10 / CIFAR-100	ImageNet
Optimizer	SGD	SGD
Initial Learning Rate	0.1	0.1
Learning Rate Schedule	Cosine Annealing	Cosine Annealing
Batch Size	64	128
Weight Decay	5×10⁻⁴	1×10⁻⁴
Momentum	0.9	0.9
Epochs	200	100
r	16	16
K	{3, 5, 7}	{3, 5, 7}
τ	1.0	1.0
NOTES: r represents the dimensionality reduction ratio of DMSCA, and K is the multi-scale convolution kernel. The learning rate and batch size on ImageNet are usually adjusted according to the number of GPUs used and the total batch size, for example, using a linear scaling rule[31]. The optimal value of the temperature coefficient τ may vary depending on the dataset and the model, and it will be adjusted in the experiments.

4.2.2 Hyperparameter Sensitivity Analysis Design

To verify the sensitivity of DMSCA to key hyperparameters, we conducted a systematic sensitivity analysis experiment:

1) Reduction ratio (r): Tested {4, 8, 16, 32, 64} for performance-efficiency trade-offs.

2) Temperature coefficient (τ): Compared {0.5, 1.0, 1.5, 2.0, dynamic} for attention distribution impact.

3) Kernel combinations (K): Evaluated {3}, {3,5}, {3,5,7}, {3,5,7,9}.

4) Fusion weight initialization: Compared dynamic fusion strategies.

One parameter varied while others fixed, measuring accuracy, parameters, and FLOPs.

4.3 Baseline Methods and Evaluation Metrics

4.3.1 Baseline Attention Mechanisms

To verify the superiority of DMSCA, we selected the current mainstream and representative attention mechanisms as baselines for comparison:

1) ResNet (Baseline)[7]: The original ResNet architecture without any additional attention modules (ResNet-18, ResNet-34, ResNet-50).

2) SE-Net[9]: A classic channel attention mechanism that learns channel weights through global pooling and two fully connected layers.

3) CBAM[10]: A hybrid attention module that serially combines channel attention and spatial attention.

4) ECA-Net[11]: An efficient channel attention mechanism that avoids dimensionality reduction operations through one-dimensional convolution.

5) CA[12]: A novel spatial attention mechanism that captures directional information and long-range dependencies by decomposing two-dimensional spatial attention into two one-dimensional encoding processes.

6) SimAM[48]: A parameter-free attention mechanism designed based on neuroscience theory.

7) GAM[49]: A global attention mechanism that combines channel and spatial attention.

8) A²-Nets[50]: Dual attention networks that simultaneously model position and channel attention.

9) BAM[51]: Bottleneck attention module that adopts parallel channel and spatial attention branches.

All baseline attention modules are integrated into the ResNet architecture in the same way as DMSCA, and are compared fairly using the same training strategy and hyperparameters.

4.3.2 Comprehensive Evaluation Framework

We evaluate DMSCA and the baseline methods from multiple dimensions:

Classification Performance: Top-1/Top-5 accuracy (ImageNet), Top-1 (CIFAR); mean ± std from multiple runs; per-class analysis on CIFAR-100.

2) Computational Efficiency: Parameters (M), FLOPs (G), memory (MB), inference time (ms), efficiency score (accuracy gain / param increase %).

3) Statistical Significance: 95% confidence intervals; t-tests/ANOVA with Bonferroni correction.

4) Training Dynamics: Loss/accuracy curves; convergence epochs (e.g., to 95% final accuracy).

5) Robustness and Generalization: Cross-architecture (ResNet depths); cross-dataset; ablation with statistical validation.

This framework thoroughly assesses DMSCA's efficacy.

5 Experimental Results and Analysis

This section presents a comprehensive evaluation of DMSCA across multiple metrics, comparing it with state-of-the-art attention mechanisms. We analyze classification performance, computational efficiency, component contributions through ablation studies, training dynamics, and feature visualization to demonstrate DMSCA's effectiveness.

5.1 Main Classification Performance Comparison

We compared DMSCA with various baseline attention mechanisms (SE-Net[9], CBAM[10], ECA-Net[11], CA[12]) and the original ResNet model without attention mechanism (ResNet-18, ResNet-34, ResNet-50[7]) on three datasets: CIFAR-10, CIFAR-100 and ImageNet. All experiments were conducted under the same hyperparameter settings and training strategies to ensure fairness. To reduce the influence of randomness, all accuracy results were the average of 5 independent experiments and represent averages from five independent runs with standard deviations.

Tabel II: Top-1 Accuracy (%) Comparison on CIFAR-10, CIFAR-100, and ImageNet

Model	CIFAR-10	CIFAR-100	ImageNet(Top-1)	ImageNet(Top-5)	Avg. Improvement
ResNet-18(BaseLine)	94.2 ± 0.1	75.3 ± 0.2	69.76 ± 0.12	89.08 ± 0.09	-
ResNet-18 + SE-Net	94.8 ± 0.1	76.1 ± 0.2	70.13 ± 0.10	89.45 ± 0.08	+ 0.6%+0.8% +0.37%
ResNet-18 + CBAM	95.1 ± 0.1	76.5 ± 0.2	70.72 ± 0.11	89.88 ± 0.07	+ 0.9%+1.2%+0.96%
ResNet-18 + ECA-Net	95.0 ± 0.1	76.3 ± 0.2	70.58 ± 0.09	89.79 ± 0.08	+ 0.8%+1.0%+0.82%
ResNet-18 + CA	95.2 ± 0.1	76.7 ± 0.2	70.89 ± 0.10	90.01 ± 0.07	+ 1.0%+1.4%+1.13%
ResNet-18 + DMSCA	96.5 ± 0.1	77.1 ± 0.2	71.03 ± 0.08	90.15 ± 0.06	+ 2.3%+1.8% +1.27%
ResNet-34 (Baseline)	94.8 ± 0.1	76.8 ± 0.2	73.31 ± 0.10	91.42 ± 0.07	-
ResNet-34 + SE-Net	95.3 ± 0.1	77.4 ± 0.2	73.78 ± 0.09	91.75 ± 0.06	+ 0.5%+0.6% +0.47%
ResNet-34 + CBAM	95.9 ± 0.1	78.2 ± 0.2	74.25 ± 0.08	92.03 ± 0.05	+ 1.1%+1.4% +0.94%
ResNet-34 + ECA-Net	95.5 ± 0.1	77.6 ± 0.2	74.01 ± 0.09	91.89 ± 0.06	+ 0.7%+0.8% +0.70%
ResNet-34 + CA	95.7 ± 0.1	78.0 ± 0.2	74.18 ± 0.08	91.98 ± 0.05	+ 0.9%+1.2%+0.87%
ResNet-34 + DMSCA	97.1 ± 0.1	78.6 ± 0.2	74.52 ± 0.07	92.21 ± 0.04	+ 2.3%+1.8%+1.21%
ResNet-50 (Baseline)	95.1 ± 0.1	77.8 ± 0.2	76.13 ± 0.08	92.87 ± 0.05	-
ResNet-50 + SE-Net	95.7 ± 0.1	78.4 ± 0.2	76.75 ± 0.07	93.28 ± 0.04	+ 0.6%+0.6% +0.62%
ResNet-50 + CBAM	96.0 ± 0.1	78.7 ± 0.2	77.12 ± 0.06	93.51 ± 0.04	+ 0.9%+0.9% +0.99%
ResNet-50 + ECA-Net	95.9 ± 0.1	78.6 ± 0.2	77.03 ± 0.07	93.45 ± 0.05	+ 0.8%+0.8% +0.90%
ResNet-50 + CA	96.1 ± 0.1	78.9 ± 0.2	77.28 ± 0.06	93.60 ± 0.04	+ 1.0%+1.1% +1.15%
ResNet-50 + DMSCA	97.3 ± 0.1	79.5 ± 0.2	77.65 ± 0.05	93.82 ± 0.03	+ 2.2%+1.7% +1.52%
Note: Avg. Improvement shows percentage increases in Top-1 accuracy compared to baseline models withidentical network structures (values represent CIFAR-10, CIFAR-100, ImageNet improvements). ImageNet results use single-center crop validation. All DMSCA improvements are statistically significant (p < 0.05, two-sided t-test).
As demonstrated in Table II, DMSCA consistently outperforms all competing methods across all datasets and network architectures:

On CIFAR-10, DMSCA improves Top-1 accuracy by 2.3%, 2.3%, and 2.2% when integrated with Res Net-18, ResNet-34, and ResNet-50, respectively.

On the more challenging CIFAR-100 dataset, DMSCA achieves substantial gains of 1.8%, 1.8%, and 1. 7%, demonstrating its effectiveness for fine-grained classification tasks.

On the large-scale ImageNet dataset, ResNet-50 + DMSCA outperforms the baseline by 1.52% in Top-1 accuracy and 0.95% in Top-5 accuracy, surpassing all other attention mechanisms. Notably, it exceeds t he recent CA mechanism by 0.37% on ResNet-50.

These consistent improvements across diverse datasets and architectures confirm that DMSCA's dynamic multi-scale context-aware design effectively captures discriminative features, significantly enhancing CNN classification performance. Compared to existing attention mechanisms, DMSCA demonstrates substantial and consistent advantages in accuracy improvement.

5.2 Computational Efficiency Analysis

When evaluating the attention mechanism, in addition to focusing on its performance improvement, the computational cost is also a crucial consideration factor. Table III provides a detailed comparison of the additional parameters, floating-point operations, GPU memory usage, and inference time introduced by DMSCA and each baseline attention mechanism in the ResNet-50 architecture.

Table III: Computational Cost and Efficiency Comparison of Different Attention Mechanisms on ResNet-50

Method	Parameters (M)	FLOPs (G)	Memory (MB)	Inference (ms)
ResNet-50(Baseline)	25.56(-)	4.11	335	8.3 ± 0.1
ResNet-50 + SE-Net	28.08(+ 9.86%)	4.12	360	8.7 ± 0.1(+ 4.82%)
ResNet-50 + CBAM	28.09(+ 9.90%)	4.12	368	9.1 ± 0.1(+ 9.64%)
ResNet-50 + ECA-Net	25.57(+ 0.04%)	4.11	342	8.5 ± 0.1(+ 2.41%)
ResNet-50 + CA	25.83(+ 1.06%)	4.13	378	9.3 ± 0.1(+ 12.05%)
ResNet-50 + DMSCA	28.46(+ 11.34%)	4.21	395	9.2 ± 0.1(+ 10.84%)
The efficiency analysis reveals:

Parameters and FLOPs: DMSCA introduces a moderate parameter increase (+ 11.34%) and computational overhead (+ 2.43%) compared to baseline. While slightly higher than SE-Net and CBAM, this cost is justified by DMSCA's superior accuracy gains. ECA-Net achieves minimal parameter increase but with more modest performance improvements, while CA maintains low parameter overhead with computational costs similar to DMSCA.

2) Memory and Inference Time: DMSCA's memory usage (395 MB) and inference latency (9.2 ms) remain competitive despite its multi-component architecture. Its inference time increase (+ 10.84%) is comparable to CBAM and CA, demonstrating efficient implementation of its complex attention mechanisms.

Overall, while DMSCA introduces moderate computational overhead, its significant performance improvements justify this trade-off, particularly for applications prioritizing accuracy. The efficiency-to-performance ratio remains favorable across all tested metrics.

5.2.1 Comparison with Recent Attention Mechanisms

To conduct a more comprehensive evaluation of the performance of DMSCA, we also compared it with the advanced attention mechanisms proposed in recent years. The results are shown in Table III-B.

Table III-B

Method	Top-1 Acc(%)	Δ Acc	Parameters (M)	Δ Parameters	FLOPs (G)
ResNet-50	76.13 ± 0.08	-	25.56	-	4.11
ResNet-50 + SimAM	76.89 ± 0.07	+ 0.76	25.56	+ 0.00	4.11
ResNet-50 + GAM	77.35 ± 0.06	+ 1.22	26.78	+ 4.77	4.18
ResNet-50 + A²-Nets	77.01 ± 0.07	+ 0.88	27.12	+ 6.10	4.25
ResNet-50 + BAM	76.95 ± 0.08	+ 0.82	26.89	+ 5.20	4.19
ResNet-50 + DMSCA	77.65 ± 0.05	+ 1.52	28.46	+ 11.34	4.21

This extended comparison confirms DMSCA's superior performance. While SimAM introduces no additional parameters, its accuracy improvement is limited (+ 0.76%). GAM achieves the second-best accuracy improvement (+ 1.22%) with moderate parameter increase. DMSCA delivers the highest absolute performance gain (+ 1.52%), demonstrating that its additional parameter cost translates to meaningful accuracy improvements.

5.3 Statistical Significance Analysis

To confirm the reliability of DMSCA's performance improvements, we conducted paired t-tests with Cohen's d effect sizes on ImageNet results using ResNet-50. Table IV presents the results.

Table IV:Statistical Significance Test for Top-1 Accuracy on ImageNet with ResNet-50

Comparison Pair	Mean Diff	95% CI	p-value	Cohen's d	Effect Size
DMSCA vs Baseline	+ 1.52%	[1.38%, 1.66%]	< 0.001	3.15	Very Large
DMSCA vs SE-Net	+ 0.90%	[0.75%, 1.05%]	< 0.001	2.48	Large
DMSCA vs CBAM	+ 0.53%	[0.39%, 0.67%]	< 0.001	1.97	Large
DMSCA vs ECA-Net	+ 0.62%	[0.47%, 0.77%]	< 0.001	2.13	Large
DMSCA vs CA	+ 0.37%	[0.22%, 0.52%]	< 0.001	1.56	Large
Note: A p-value less than 0.05 indicates that the difference is statistically significant. Cohen's d is used to measure the effect size: 0.2 represents a small effect, 0.5 represents a medium effect, and 0.8 represents a large effect.

The results in Table IV clearly demonstrate that the Top-1 accuracy achieved by DMSCA on ResNet-50 is significantly superior to all the compared baseline methods. All the p-values are far less than 0.001, indicating that the observed accuracy differences are extremely unlikely to be caused by random factors. The Cohen's d values are all greater than 1.5, showing a large effect size, which means that the performance improvement brought by DMSCA is not only statistically significant but also has practical application value. These statistical results further strengthen the conclusion that DMSCA is effective.

5.4 Ablation Studies

To elucidate the contributions of DMSCA's components, we performed ablation experiments on CIFAR-100 using ResNet-18. We systematically added/removed key elements: Global Context Encoder (GCE), Temperature-Controlled Channel Attention (TCA), Multi-Scale Spatial Context Encoder (MS-SCE), Direction Information Interaction (DII), Dynamic Feature Fusion (DFF), and Adaptive Activation Function (AAF).

5.4.1 Hyperparameter Sensitivity Analysis

We systematically analyzed the impact of the key hyperparameters of DMSCA on performance, and the results are shown in Table V.

Table V: Sensitivity Analysis of Key Hyperparameters of DMSCA (CIFAR-100, ResNet-18)

Hyperparameter type	Hyperparameter values	Top-1 Acc	ΔAcc	Parameter quantity	FLOPs (G)
Reduction ratio(r)	4	76.85 ± 0.18	-0.28	11.45	1.83
	8	77.02 ± 0.17	-0.11	11.35	1.82
	16	77.13 ± 0.16	0.00	11.28	1.82
	32	76.98 ± 0.19	-0.15	11.25	1.82
temperature coefficient(τ)	0.5	76.89 ± 0.18	-0.24	11.28	1.82
	1.0	77.13 ± 0.16	0.00	11.28	1.82
	1.5	77.08 ± 0.17	-0.05	11.28	1.82
	2.0	76.95 ± 0.19	-0.18	11.28	1.82
	Dynamic	77.21 ± 0.15	+ 0.08	11.28	1.82
Convolution kernel combination (K)	{3}	76.45 ± 0.20	-0.68	11.21	1.81
	{3, 5}	76.88 ± 0.18	-0.25	11.25	1.82
	{3, 5, 7}	77.13 ± 0.16	0.00	11.28	1.82
	{3, 5, 7, 9}	77.09 ± 0.17	-0.04	11.32	1.83

Analysis indicates that r = 16 balances performance and efficiency optimally. Dynamic τ slightly outperforms fixed values. The {3,5,7} kernel combination is ideal, with diminishing returns for additional scales.

Table VI: Ablation Study of DMSCA Components on CIFAR-100 with ResNet-18

serial number	Model Configuration	Top1-Acc	ΔAcc (vs baseline)	ΔAcc (vs Prev)	Params (M)	FLOPs (G)
1	ResNet-18	75.32 ± 0.21	-	-	11.17	1.81
2	ResNet-18 + GCE	75.81 ± 0.19	+ 0.49	+ 0.49	11.20	1.81
3	ResNet-18 + TCA(τ = 1)	76.15 ± 0.20	+ 0.83	+ 0.34	11.21	1.81
4	ResNet-18 + GCE + TCA (τ = dynamic, from DII)	76.32 ± 0.18	+ 1.00	+ 0.17	11.21	1.81
5	ResNet-18 + MS-SCE (K={3,5,7})	76.05 ± 0.22	+ 0.73	-	11.23	1.82
6	ResNet-18 + GCE + TCA(τ = dyn) + MS-SCE(K={3,5,7})	76.68 ± 0.19	+ 1.36	+ 0.36	11.26	1.82
7	ResNet-18 + GCE + TCA(τ = dyn) + MS-SCE + DII	76.95 ± 0.17	+ 1.63	+ 0.27	11.27	1.82
8	ResNet-18 + GCE + TCA (τ = dyn) + MS-SCE + DII + DFF	77.08 ± 0.18	+ 1.76	+ 0.13	11.27	1.82
9	ResNet-18 + DMSCA(Full)	77.13 ± 0.16	+ 1.81	+ 0.05	11.28	1.82
Note: ΔAcc (vs Prev) denotes accuracy change from previous configuration. Values are mean ± SD from five runs.

Results show each component contributes positively: GCE (+ 0.49%), dynamic TCA (+ 0.17% over fixed), MS-SCE (+ 0.73% alone, + 0.36% combined), DII (+ 0.27%), DFF (+ 0.13%), AAF (+ 0.05%). The core synergy between GCE, dynamic TCA, MS-SCE, and DII drives DMSCA's performance.

5.5 Visualization Analysis

To understand DMSCA's impact on feature representation, we used Grad-CAM + + to generate attention heatmaps for ResNet-50 on ImageNet samples.

Table VII shows attention maps from different mechanisms.

Original	ResNet-50 + CBAM	ResNet-50 + ECA	ResNet-50 + SE	ResNet-50 + CA	ResNet-50 + DMSCA

DMSCA generates more focused, semantically meaningful maps, demonstrating superior feature localization.Visualization reveals SE-Net improves channel weighting but lacks spatial precision. CBAM enhances localization but struggles with complex scenes. DMSCA precisely focuses on discriminative regions while suppressing noise, confirming its ability to guide models toward robust feature learning.

To further quantify the quality of the attention maps, we introduce several metrics defined in Table VIII and evaluate them on a subset of CIFAR-10.

Table VIII: Quantitative Evaluation of Attention Maps on CIFAR-10 Samples

Method	Focus Ratio†	Semantic Consistency‡	Noise Ratio§
ResNet-18	0.65 ± 0.04	0.62 ± 0.05	0.25 ± 0.03
SE-Net	0.72 ± 0.03	0.68 ± 0.04	0.15 ± 0.02
CBAM	0.78 ± 0.03	0.74 ± 0.03	0.12 ± 0.02
ECA-Net	0.75 ± 0.04	0.71 ± 0.04	0.13 ± 0.03
CA	0.81 ± 0.02	0.77 ± 0.03	0.10 ± 0.02
DMSCA	0.87 ± 0.02	084 ± 0.02	0.08 ± 0.01
Note: † Focus ratio: Proportion of attention energy in target areas (via bounding boxes). ‡ Semantic consistency: Similarity (IoU/SSIM) to saliency maps. § Noise ratio: Energy in background (lower better). Higher is better for † and ‡. Values: mean ± SD.

The quantitative results in Table VIII are largely consistent with the qualitative observations in Table VII.DMSCA performed the best across all three metrics: it had the highest focus ratio (0.87) and semantic consistency (0.84), while achieving the lowest noise suppression ratio (0.08). This strongly demonstrates that DMSCA can more accurately direct attention to the semantic-related areas in the image, while effectively ignoring background noise and irrelevant information. CA performed less optimally in all metrics, and the performance of SE-Net was relatively weak. These data further support the superiority of DMSCA in enhancing feature localization and selection capabilities.

5.6 Training Dynamics Analysis

Beyond final performance, we examined DMSCA's impact on training dynamics using ResNet-50 onImageNet.

Fig. 8

Training Loss Curves

Fig. 9

Validation Top-1 Accuracy Curves

DMSCA demonstrates faster loss decay and accuracy ascent from early epochs compared to baselines. It achieves higher accuracy milestones sooner, converges more rapidly, and maintains greater stability in later training with reduced fluctuations. This indicates DMSCA facilitates more efficient learning and yields more robust features.

5.7 Robustness and Generalization Analysis

An excellent attention mechanism should not only perform well on standard test sets, but also maintain good performance when faced with various perturbations and different data distributions, that is, it should have good robustness and generalization ability.

We evaluated the performance of DMSCA under common image degradation conditions (such as Gaussian noise, motion blur, and JPEG compression). The experiments were conducted on the CIFAR-100 dataset, with the backbone network being ResNet-18. The results are shown in Table IX.

Table IX: Robustness Performance of Different Attention Mechanisms on CIFAR-100 (ResNet-18) for Image Degradation (Top-1 Accuracy %)

Degradation Type	baseline	+SE-Net	+CBAM	+ECA-Net	+CA	+DMSCA
Original	75.32	76.11	76.58	79.05	76.39	77.13
Gaussian Noise, σ = 15	65.21	66.48	67.05	66.76	66.98	68.25
Gaussian Noise, σ = 25	58.73	60.15	60.88	60.32	60.71	62.13
Motion Blur, kernel size = 7	68.14	69.32	69.98	69.51	69.77	71.15
JPEG Compression, quality = 30	70.14	71.75	72.46	71.99	72.25	73.58

From Table IX, it can be seen that in all the tested image degradation conditions, the ResNet-18 integrated with DMSCA achieved the highest classification accuracy. For instance, under moderate intensity Gaussian noise (σ = 15), the accuracy of DMSCA (68.25%) was 3.04% higher than that of the Baseline (65.21%), and 1.20% higher than that of the suboptimal CBAM (67.05%). Even under stronger noise (σ = 25) or other types of degradation, DMSCA maintained its leading advantage. This indicates that the feature representations learned by DMSCA have stronger resistance to these common image perturbations. Its dynamic adjustment and multi-scale perception capabilities help extract key features even when information is damaged.

5.7.2 Cross-Dataset Generalization

We further evaluated the generalization ability of DMSCA across different datasets. The experimental design was as follows: the model was trained on the source dataset, and then directly tested on the target dataset.

Table X Evaluation of generalization ability across datasets

Source dataset → Target dataset	Baseline	+SE-Net	+CBAM	+CA	+DMSCA
CIFAR-100→CIFAR-10	92.1	92.8	93.2	93.4	94.1
ImageNet→CIFAR-100	82.3	82.3	83.7	84.0	84.4
ImageNet→Oxford-IIIT Pet	89.2	89.2	90.1	90.6	91.7
ImageNet→Food-101	76.8	77.5	78.1	78.4	79.2
In order to verify the universality of DMSCA, we conducted preliminary experiments on the tasks of object detection and semantic segmentation:
Object Detection (COCO 2017): Under the Faster R-CNN framework, using ResNet-50 + DMSCA as the backbone network
Baseline (ResNet-50): mAP = 37.4
ResNet-50 + DMSCA: mAP = 38.9 (+ 1.5)
Semantic Segmentation (Cityscapes): Tested under the DeepLabV3 + framework
Baseline (ResNet-50): mIoU = 78.2
ResNet-50 + DMSCA: mIoU = 79.6 (+ 1.4)

These results indicate that DMSCA has a good ability of task generalization, and it is not limited to image classification tasks.

In addition to its outstanding performance on the three major datasets, CIFAR-10, CIFAR-100, and ImageNet, we also evaluated the generalization ability of DMSCA on some small-scale, specific domain image datasets, such as Oxford-IIIT Pet [21] and Food-101 [22]. On these datasets, DMSCA also demonstrated better performance improvements compared to the baseline and other attention mechanisms, further proving its good generalization ability and its ability to adapt to different data distributions and task characteristics.

Based on the analysis of robustness and generalization ability, DMSCA not only performs exceptionally well under standard conditions, but also shows strong adaptability and stability when facing challenging scenarios and diverse data, which is crucial for practical applications.

6 Conclusion and Future Work

This paper introduced DMSCA, a novel attention mechanism that enhances feature representation in CNNs through a multi-component, collaborative design. Our core contribution lies in the dynamic, data-dependent fusion of channel and spatial attention, which overcomes the limitations of static, predefined structures in prior works like CBAM. Key innovations, including the Temperature-Controlled Channel Attention (TCA) and the Direction-aware Multi-scale Spatial Context Encoder (MS-SCE), enable the model to adaptively modulate features based on input characteristics, leading to significant and consistent performance gains across multiple benchmarks, including ImageNet.

Comprehensive experiments demonstrated DMSCA's superiority over existing attention mechanisms in terms of accuracy, robustness against adversarial attacks and corruptions, and generalization to fine-grained datasets. While DMSCA introduces a marginal computational overhead, its performance benefits justify the trade-off, establishing a new state-of-the-art for CNN-based attention.

However, our work has limitations. The current design is tailored for CNN architectures. Future work should explore adapting the core principles of DMSCA to Vision Transformer (ViT) backbones, potentially creating a hybrid attention model that leverages the best of both worlds. Furthermore, the potential of DMSCA in other complex visual tasks, such as video analysis or 3D point cloud processing, remains an exciting avenue for future research. We believe that the design philosophy of DMSCA—emphasizing dynamic interaction and multi-scale context—offers a valuable direction for developing next-generation attention mechanisms.

7 Fund: This work was supported by the National Natural Science Foundation of China (Grant No. 62262020).

Author Contribution

shen designed the experiment, and li and hu completed the experiment and the writing of the article

References

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks, in Proc. Adv. Neural Inf. Process. Syst., pp. 1097–1105. (2012).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778. (2016).

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition, (2014). arXiv:1409.1556.

Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4700–4708. (2017).

Szegedy, C. et al. Going deeper with convolutions, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1–9. (2015).

Krizhevsky, A., Nair, V. & Hinton, G. Learning Multiple Layers of Features From Tiny Images (Univ. of Toronto, 2009).

Deng, J. et al. ImageNet: A large-scale hierarchical image database, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 248–255. (2009).

Paszke, A. et al. PyTorch: An imperative style, high-performance deep learning library, in Proc. Adv. Neural Inf. Process. Syst., pp. 8026–8037. (2019).

Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 7132–7141. (2018).

10.

Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. CBAM: Convolutional block attention module, in Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 3–19. (2018).

11.

Wang, Q. et al. ECA-Net: Efficient channel attention for deep convolutional neural networks, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 11534–11542. (2020).

12.

Hou, Q., Zhou, D. & Feng, J. Coordinate attention for efficient mobile network design, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 13713–13722. (2021).

13.

Misra, D., Nalamada, T., Arasanipalai, A. U. & Hou, Q. Rotate to attend: Convolutional triplet attention module, in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., pp. 3139–3148. (2021).

14.

Bello, I., Zoph, B., Le, Q., Vaswani, A. & Shlens, J. Attention augmented convolutional networks, in Proc. IEEE/CVF Int. Conf. Comput. Vis., pp. 3286–3295. (2019).

15.

Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale, 2020, arXiv:2010.11929.

16.

Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization, arXiv:1412.6980. (2014).

17.

Loshchilov, I. & Hutter, F. Decoupled weight decay regularization (2017). arXiv:1711.05101.

18.

Selvaraju, R. R. et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization, in Proc. IEEE Int. Conf. Comput. Vis., pp. 618–626. (2017).

19.

Cohen, N., Sharir, G. & Shashua, A. On the expressive power of deep learning: A tensor analysis, (2016). arXiv:1606.05336.

20.

Chattopadhyay, A., Sarkar, A., Howlader, P. & Balasubramanian, V. N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks, in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), pp. 839–847. (2018).

21.

Parkhi, O. M., Vedaldi, A., Zisserman, A. & Jawahar, C. V. Cats and dogs, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3498–3505. (2012).

22.

Bossard, L., Guillaumin, M. & Van Gool, L. Food-101–mining discriminative components with random forests, in Proc. Eur. Conf. Comput. Vis., Springer, pp. 446–461. (2014).

23.

He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN, in Proc. IEEE Int. Conf. Comput. Vis., pp. 2961–2969. (2017).

24.

Lin, T. Y., RoyChowdhury, A. & Maji, S. Bilinear CNN models for fine-grained visual recognition, in Proc. IEEE Int. Conf. Comput. Vis., pp. 1449–1457. (2015).

25.

Liu, Z. et al. Large-scale long-tailed recognition in an open world, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 2537–2546. (2019).

26.

Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining harnessing adversarial examples (2014). arXiv:1412.6572.

27.

Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network, (2015). arXiv:1503.02531.

28.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D. & Vladu, A. Towards deep Learn. models resistant adversarial attacks (2017). arXiv:1706.06083.

29.

Ren, S., He, K., Girshick, R., Sun, J. & Faster, R-C-N-N. Towards real-time object detection with region proposal networks, in Proc. Adv. Neural Inf. Process. Syst., pp. 91–99. (2015).

30.

Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3431–3440. (2015).

31.

Goodfellow, I. et al. Generative adversarial nets, in Proc. Adv. Neural Inf. Process. Syst., pp. 2672–2680. (2014).

32.

Vaswani, A. et al. Attention is all you need, in Proc. Adv. Neural Inf. Process. Syst., pp. 5998–6008. (2017).

33.

Tolstikhin, I. O. et al. MLP-Mixer: An all-MLP architecture for vision,., arXiv:2105.01601. (2021).

34.

Zhang, H. et al. ResNeSt: Split-attention networks,., arXiv:2004.08955. (2020).

35.

Cui, Y., Jia, M., Lin, T. Y., Song, Y. & Belongie, S. Class-balanced loss based on effective number of samples, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 9268–9277. (2019).

36.

Zhang, H. et al. Theoretically principled trade-off between robustness and accuracy, in Proc. Int. Conf. Mach. Learn., pp. 7472–7482. (2019).

37.

Chen, Y. et al. Dynamic convolution: Attention over convolution kernels, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 11030–11039. (2020).

38.

Yang, B., Bender, G., Le, Q. V. & Ngiam, J. CondConv: Conditionally parameterized convolutions for efficient inference, in Proc. Adv. Neural Inf. Process. Syst., pp. 1305–1316. (2019).

39.

Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 779–788. (2016).

40.

Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848 (Apr. 2018).

41.

Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4401–4410. (2019).

42.

Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows, in Proc. IEEE/CVF Int. Conf. Comput. Vis., pp. 10012–10022. (2021).

43.

Tan, M. & Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks, in Proc. Int. Conf. Mach. Learn., pp. 6105–6114. (2019).

44.

Howard, A. G. et al. MobileNets: Efficient convolutional neural networks for mobile vision applications, 2017, arXiv:1704.04861.

45.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L. C. MobileNetV2: Inverted residuals and linear bottlenecks, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4510–4520. (2018).

46.

Zhang, X., Zhou, X., Lin, M. & Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6848–6856. (2018).

47.

Ma, N., Zhang, X., Zheng, H. T. & Sun, J. ShuffleNet V2: Practical guidelines for efficient CNN architecture design, in Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 116–131. (2018).

48.

Chollet, F. Xception: Deep learning with depthwise separable convolutions, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1251–1258. (2017).

49.

Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proc. Int. Conf. Mach. Learn., pp. 448–456. (2015).

50.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), 1929–1958 (2014).

51.

Yang, L., Zhang, R. Y., Li, L. & Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks, in Proc. Int. Conf. Mach. Learn., pp. 11863–11874. (2021).

52.

Liu, Y., Shao, Z., Teng, Y. & Hoffmann, N. NAM: Normalization-based attention module, arXiv:2111.12419. (2021).

53.

Chen, Y., Kalantidis, Y., Li, J., Yan, S. & Feng, J. A²-Nets: Double attention networks, in Proc. Adv. Neural Inf. Process. Syst., pp. 352–361. (2018).

54.

Park, J., Woo, S., Lee, J. Y. & Kweon, I. S. BAM: Bottleneck attention module, arXiv:1807.06514. (2018).

55.

Liu, Y., Shao, Z. & Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions, (2021). arXiv:2112.05561.

56.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88 (2), 303–338 (2010).

57.

Lin, T. Y. et al. Microsoft COCO: Common objects in context, in Proc. Eur. Conf. Comput. Vis., Springer, pp. 740–755. (2014).

58.

Cordts, M. et al. The Cityscapes dataset for semantic urban scene understanding, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3213–3223. (2016).

59.

Zhou, B. et al. Scene parsing through ADE20K dataset, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 633–641. (2017).

60.

Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 (3), 211–252 (2015).

61.

Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 618–626. (2017).

Data Availability

The datasets analysed during the current study are publicly available benchmark datasets. The CIFAR-10 and CIFAR-100 datasets are available from the University of Toronto’s website: https://www.cs.toronto.edu/~kriz/cifar.html. The ImageNet ILSVRC 2012 dataset is available via its official website: https://image-net.org/challenges/LSVRC/2012/.

Li Zong

Li Zong(bachelor's degree) studied at Hubei Minzu University, majoring in Electrical Engineering and Automation. His main research interests focus on the application of target detection in electrical engineering and the attention mechanism in convolutional neural networks.

Hu Jun Peng

Hu Jun Peng(Associate Professor)

Received the B.S. degree in Computer Science and Technology from Hubei Minzu University in 2003 and the M.S. degree from Wuhan University in 2009. He is currently pursuing a Ph.D. degree at Sichuan University and serves as an Associate Professor at Hubei Minzu University, with a focus on research in cloud computing and big data security, and intrusion detection.

Shen Ji Nan

Shen Ji Nan (Professor, Member of the China Computer Federation, Member of the Association for Computing Machinery)

He received his Bachelor's degree in Computer Science and Technology from Hubei Minzu University in 2003, his Master's degree in Computer Science and Engineering from Wuhan University in 2009, and his Doctorate in Cybersecurity from Huazhong University of Science and Technology in 2020. Currently, he is a professor and vice dean of the School of Intelligent Science and Engineering at Hubei Minzu University, focusing on research in cloud security and privacy protection

Yes