Research on Few-Shot Fault Diagnosis and Feature Extraction Mechanism Based on ADPCW-ELCNN
Pengfei
Pang
1
Email3235287357@qq.com
1
Army Engineering University
210001
Nanjing
China
2
Nanjing University of Posts and Telecommunications
210007
Nanjing
China
Pengfei Pang
1
, Tengji Xia 2,*
1 Army Engineering University, Nanjing, 210001, China
2 Nanjing University of Posts and Telecommunications, Nanjing, 210007, China
*Correspondence: 3235287357@qq.com (Tengji Xia)
This work was supported by the National Natural Science Foundation of China under Grant 51705531.
ABSTRACT
Aiming at the challenges in few-shot fault diagnosis of rotating machinery, such as weak features being easily masked by noise, the disconnection between parameter and feature adaptation, and the difficulty in balancing generalization ability with efficiency, an Adaptive Dual-Parameter Collaborative Wavelet Convolutional Neural Network is proposed. This model focuses on dual-parameter collaborative optimization, dynamic feature adaptation, and lightweight generalization enhancement. Through innovative designs including quantitative correlation between wavelet and convolutional kernel parameters, as well as global-local feature adaptation, it effectively integrates the advantages of time-frequency analysis with lightweight requirements. Experimental validation on bearing and gearbox datasets under complex working conditions shows that the model size is only 55 KB, with an average diagnostic accuracy of 99.98%, significantly outperforming traditional models. Even under strong noise and extreme few-shot scenarios with only 20 samples, it maintains excellent performance, providing an efficient and practical new solution for intelligent fault diagnosis of rotating machinery with limited samples.
INDEX TERMS
rotating machinery
few-shot
fault diagnosis
wavelet
convolutional neural network
A
1. Introduction
As the key power unit in intelligent manufacturing and energy equipment, the health status of core components in rotating machinery, such as industrial gearboxes and rolling bearings, directly affects production continuity, system safety, and maintenance costs [1]. Due to prolonged exposure to loads, shocks, and friction, these components exhibit a significantly higher probability of failure compared to others. Moreover, early fault signals, characterized by weak low-frequency impulses, are easily masked by background noise, load coupling, and electromagnetic interference. As a result, fault signals often display low signal-to-noise ratios (SNR) and strong non-stationary characteristics [2]. Additionally, collecting fault samples in industrial settings typically requires equipment disassembly during downtime, which disrupts production and increases costs. These challenges collectively create an engineering bottleneck defined by small sample sizes and high noise interference, rendering traditional diagnostic techniques inadequate for precise fault identification [3].
A
Fault diagnosis technology has evolved through three distinct stages in response to complex working conditions, each of which exhibits limitations in scenarios with limited samples and strong noise interference. The first stage, which relies on physical models, offers clear physical interpretability but depends on precise equipment parameters [
4]. Under few-shot conditions, this approach tends to overfit, while its diagnostic accuracy drops sharply in the presence of strong noise [
5]. The second stage involves signal processing-based methods[
6], which extract features through techniques such as Fourier Transform (FT) [
7], Wavelet Transform (WT) [
8], and Variational Mode Decomposition (VMD) [
9]. Although these methods eliminate the need for complex physical models, they require manual setting of time-frequency parameters. Consequently, they struggle to cover multiple fault types with limited samples, are prone to misidentification under strong noise, and lack adaptability to varying working conditions [
10]. The third stage comprises intelligent diagnosis methods [
11]. Shallow models, such as Support Vector Machines (SVM) [
12]and Random Forests (RF) [
13], demonstrate certain classification capabilities under simple conditions but suffer from inadequate feature representation in few-shot scenarios and a significant decline in accuracy amid strong noise[
13]. Among deep learning approaches, models like Autoencoders (AE) [
14]and Restricted Boltzmann Machines (RBM) [
15] exhibit low sensitivity to local weak features, while architectures such as Recurrent Neural Networks (RNN) [
16], Long Short-Term Memory Networks (LSTM) [
17], and Transformers[
18] emphasize global temporal relationships but demonstrate limited focus on localized fault characteristics.
To address the limitations of traditional technologies, few-shot fault diagnosis technology has emerged as a research focus, and the primary objective is to resolve the problems of overfitting and weak generalization ability caused by insufficient samples through three technical approaches, namely data augmentation, model optimization, and transfer learning, thereby achieving accurate fault identification. For data augmentation, the data volume is expanded through signal transformations such as Gramian Angular Field transformation or generative models, among which the GAN-based virtual sample generation method proposed by Yu [3] is regarded as representative. For model optimization, the dependence on large-scale sample data is reduced by improving the structure of Convolutional Neural Network (CNN), with the deformable convolution proposed by Wang [19] and the lightweight Transformer-CNN architecture designed by Fang [20] as typical optimization schemes. For transfer learning, adaptation to few-shot fault diagnosis scenarios is achieved by leveraging pre-trained models from the source domain, where meta-learning-based parameter transfer [21] and ResNet18 pre-training with fine-tuning [22] are commonly used implementation strategies.
However, limitations still persist in existing technologies. Insufficient mining of fault features is achieved through data augmentation. A collaborative mechanism between parameters and features is lacking in model optimization. Transfer learning is overly dependent on the source domain, and lightweight performance is not adequately considered. A comprehensive transfer collaborative mechanism integrating data and model has not been established. Consequently, it is difficult to simultaneously balance the efficiency of feature extraction under small sample scenarios and robustness against strong noise.
To address the aforementioned challenges, two strategies combining wavelet transform with CNN are commonly adopted in existing research [23], but both suffer from inherent limitations. The first is a serial architecture, where wavelet transform is employed as an independent pre-signal processing step, and the processed features are subsequently fed into the CNN [24]. Although multi-scale feature adaptability is enhanced by this method, end-to-end optimization cannot be achieved due to the decoupling of the front-end and back-end modules. Additionally, parameter tuning relies on manual experience, resulting in unsatisfactory performance under small sample conditions. The second is an embedded architecture, where learnable wavelet functions are used to replace the convolution kernels of CNN, enabling implicit extraction of time-frequency features [25], [26]. However, existing methods are generally plagued by issues such as fixed wavelet scales and difficulty in balancing time-frequency resolution, with analysis accuracy degrading under strong noise conditions. Meanwhile, most models still adopt complex deep-layer structures, making it impossible to balance the preservation of weak features and the requirement for model lightweighting under small sample scenarios.
Based on a comprehensive review of existing research, the core bottlenecks in few-shot and strong-noise fault diagnosis of rotating machinery can be summarized into four aspects firstly the absence of end-to-end collaboration between wavelet transform and CNN hinders the achievement of overall optimization secondly insufficient adaptation exists between convolution kernel parameters and key wavelet parameters such as scale and length resulting in low efficiency of feature extraction thirdly weak feature transmission distortion is caused by deep network architectures and the excessive model size makes it unsuitable for edge deployment and fourthly the lack of weak feature enhancement mechanisms for few-shot scenarios is observed in existing methods making fault features difficult to be effectively separated from strong noise.
Based on the aforementioned bottlenecks, and building upon the existing Explainable Lightweight CNN (ELCNN) framework[27], an Adaptive Dual-Parameter Collaborative Wavelet ELCNN (ADPCW-ELCNN) is proposed. The model achieves technological breakthroughs through the following innovative designs. First, a quantitative correlation mechanism between wavelet scale and convolution kernel length is established to address the disconnection between parameter adaptation and feature extraction. Second, a global-local collaborative parameter adaptation strategy is designed to enhance the model’s matching capability for full-frequency-band features. Third, a nonlinear mapping unit is introduced to simultaneously achieve weak feature enhancement and noise suppression. Finally, a single-layer lightweight architecture is adopted, which avoids feature transmission distortion while reducing the number of model parameters to a level suitable for deployment on edge devices.
Compared with existing research, not only has the performance of the proposed method in fault diagnosis tasks been verified through experiments, but also a collaborative optimization theory covering parameter collaboration, feature evolution, and architecture fidelity has been established at the mechanism level, providing dual support of experiments and theory for the effectiveness of the model. For the subsequent content, Chap. 2 elaborates on the model architecture and mathematical derivation; Chap. 3 conducts experimental verification based on the Ottawa and MCC5-THU datasets; Chap. 4 carries out in-depth mechanism analysis on the aforementioned collaborative optimization theory; and Chap. 5 summarizes the full text and prospects future research directions.
2. ADPCW-ELCNN proposed for few-shot signals
As shown in Fig. 1, ADPCW-ELCNN is constructed based on ELCNN to form an end-to-end fault diagnosis process. Vibration signals are processed through an adaptive wavelet convolutional layer for multi-scale feature extraction. By establishing a quantitative scale-kernel length relationship and employing a global-local collaboration mechanism, an adaptive mapping is achieved where small scales are matched with short kernels and large scales with long kernels, thereby enabling dynamic parameter adjustment and full-frequency-band feature coverage. Subsequently, the features are enhanced in signal-to-noise ratio via a square nonlinear enhancement unit. Finally, the diagnosis results are output by a single-layer lightweight architecture. The principles of each module will be described in the following sections.
2.1 Joint optimization design of wavelet scale and convolution kernel length
In conventional CNNs, convolutional kernels are generally initialized through random strategies, resulting in uncertain frequency response characteristics during the early training phase, which makes it difficult to directly match the fault features with clear physical meanings in vibration signals. To incorporate prior knowledge of time-frequency analysis into the model, wavelet functions are employed as the initialization basis for the convolutional kernels. Wavelet functions possess inherent time-frequency localization capability, and their center frequencies and bandwidths can be adjusted via scaling factors, thereby providing an effective mathematical foundation for constructing convolutional kernels related to the physical mechanisms of fault features. The discrete convolution form is expressed as follows.
where
is the input signal,
is the length of the convolutional kernel,
represents the scanning position of the convolutional sliding operation,
denotes the step index of the sliding process, and
stands for the wavelet scale parameter.
After wavelet functions are introduced into convolution operations, their scale parameters and kernel lengths become two key parameters affecting the feature extraction effect. A quantitative relationship between the two is established to achieve collaborative optimization, and the matching criterion is derived based on the energy concentration law of wavelet functions. The energy of wavelet functions is highly concentrated in the time domain and frequency domain [
28]. For a given mother wavelet, a time range containing most of its energy can be defined [
29]. Drawing on the idea of
interval quantization in statistics, this energy concentration range can be used as the theoretical basis for determining kernel lengths[
30].
To ensure that signals within this energy range are effectively captured by convolution operations, the length
of the convolutional kernel should be no less than the time-domain width of this range. Based on this, the following matching relationship is established.
where
is the energy concentration range coefficient of the mother wavelet, which is defined as the temporal half-width containing a specific proportion of energy, such as 99.7% [
31];
denotes the wavelet scale; the coefficient 2 is used to cover the positive and negative symmetric portions of the energy interval;
represents the ceiling operation; and the addition of 1 is performed to ensure that the length of the convolutional kernel is odd, thereby avoiding temporal shift.
This formula reveals the positive correlation between wavelet scale and convolutional kernel length, where larger scales correspond to low-frequency wide intervals and thus require long convolutional kernels for coverage, while smaller scales correspond to high-frequency narrow intervals and can be matched with short convolutional kernels. This quantitative relationship provides the theoretical basis and parameter initialization foundation for the subsequent construction of a global-local collaborative optimization mechanism, enabling dynamic parameter adaptation and accurate multi-scale feature extraction.
2.2 Scale adaptive design for global-local collaboration
To achieve adaptive extraction of multi-scale features, a global-local collaborative scale adaptation mechanism is adopted in the model. Through hierarchical parameter adjustment, the convolutional kernels are configured to exhibit the cooperative relationship illustrated in Fig. 2, where small scales are paired with short kernels and large scales with long kernels. Specifically, the mechanism encompasses hierarchical learning of convolutional kernel scales and coordinated optimization of their lengths.
1) Hierarchical learning at convolutional kernel scale
Scale learning is divided into fundamental scales and local scales, with the former serving as a globally unified benchmark and the latter being fine-tuned for individual adaptation. The specific analysis is as follows.
The first approach is global guidance, where a learnable global scale factor is introduced and mapped into an interpolation weight
through the Sigmoid function, followed by normalization to the range of [-1, 1].
Subsequently, linear interpolation is applied between the predefined maximum fundamental scale
and the minimum fundamental scale
to generate the dynamic fundamental scale
. The relevant formulas are as follows.
The second approach is local adaptation, where a learnable local factor
is introduced for the i-th convolutional kernel to perform differentiated scaling on the fundamental scale, thereby adapting to the scale characteristics of local features. The calculation formula is as follows.
where
is the number of output channels. High-frequency impact features drive the adjustment of the small-scale, while low-frequency periodic features drive the adjustment of the large-scale, enabling the scale of the convolutional kernels to be precisely matched to the scale characteristics of the local features.
2) Collaborative optimization of convolution kernel length
To ensure the continuity of feature extraction and computational stability, the length of the convolutional kernels is determined by a global baseline constraint
combined with a local fine-tuning
mechanism. The calculation formula is provided below.
where
represents the learnable energy concentration range coefficient of the wavelet,
denotes the length adjustment factor,
corresponds to the floor operation,
is the maximum constraint on the convolutional kernel length, and
is the minimum constraint on the convolutional kernel length.
2.3 Design of square nonlinear mapping module
The squared nonlinear mapping module is designed to enhance the model's ability to capture high-frequency feature components, particularly transient signals such as fault impacts, while optimizing the stability of the gradient update process. This module can reshape the feature energy distribution and suppress noise interference, while synergistically optimizing the scale and length parameters with the adaptive wavelet convolutional layer, enhancing the model's feature discrimination ability in noisy environments and under few-shot conditions.
1) Optimization of convolutional kernel scale
The square nonlinear mapping optimizes model performance through the following dual mechanisms.
First, scale parameter adjustment is performed. The amplitude of the features is nonlinearly enhanced by the square transformation, where the energy enhancement for high-frequency fault features is significantly greater than that for low-frequency background signals. This asymmetric enhancement effect drives smaller scale parameters to be adaptively selected by the model, enabling local details of high-frequency faults to be more accurately captured by small-scale convolutional kernels.
Second, scale gradient optimization is carried out. By introducing a dedicated gradient computation path, the gradient distribution during backpropagation is restructured by this module. This optimization not only accelerates the convergence process of the model but also guides the scale parameters to be updated in a direction that favors the extraction of high-frequency features. The relevant formula is provided below.
where
is the feature extracted by the square mapping, and
is the feature learned by the convolutional layer.
2) Optimization of convolutional kernel length
The square nonlinear mapping module collaboratively optimizes convolutional kernel length parameters through the following mechanisms.
First, length parameter adjustment is performed. A correspondence between scale and receptive field is established by the module. When high-frequency features corresponding to small scales are processed, a smaller receptive field is required by the model to focus on local details. Therefore, short convolutional kernels are utilized to avoid the cross-region blurring effect that may be caused by long convolutional kernels. In contrast, when low-frequency features corresponding to large scales are handled, a larger receptive field is needed to integrate global information, for which long convolutional kernels are effectively employed to cover wide-bandwidth features.
Second, length gradient optimization is implemented. The gradient calculation term introduced by the square mapping directionally optimizes the update process of the length parameters. In regions with high-frequency fault features, the gradient update is accelerated, driving the length parameters to be adaptively shortened to meet the requirements of small-scale analysis. In low-frequency background regions, the gradient update remains gentle, allowing the length parameters to be kept stable and ensuring the integration capability for global features. The relevant gradient optimization formula is provided below.
2.4 Design of single-layer convolutional architecture
To avoid the distortion in the transmission of underlying fault features caused by the iterative learning of multi-layer convolutional networks, ADPCW-ELCNN adopts an extremely simple single-layer architecture consisting of adaptive dual-parameter wavelet convolution, square nonlinear mapping, pooling, and classification. This ensures accurate preservation and adaptive optimization of fault features in the time-frequency domain. The complete computational expression is provided below.
where
is the adaptive dual-parameter cooperative wavelet,
is the convolutional layer,
is the square nonlinear mapping,
is the pooling layer, and
is the classification layer.
The main advantages of this architecture are reflected in three aspects. First, the feature transmission path is effectively streamlined to reduce information loss. While mechanisms such as wavelet scale-length collaborative optimization and global-local parameter adaptation are preserved, the capability for adaptive adjustment of time-frequency resolution is maintained. Second, significant lightweight efficiency is achieved through a non-redundant single-layer architecture, enabling it to meet the stringent requirements for real-time fault diagnosis on edge devices. Finally, all modules work in coordination to form a complete chain from parameter optimization and feature fidelity to lightweight deployment.
3. Experiment
3.1 Data description
To validate the diagnostic performance under strong time-varying conditions with small samples, two types of data covering complex scenarios with variable speeds and variable loads are selected, the bearing dataset from the University of Ottawa [32] and the gearbox dataset from MCC5 Group and Tsinghua University (MCC5-THU) [33].
1) Ottawa
This dataset is collected from a rolling bearing experimental platform under strongly time-varying speeds, with the bearing model ER16K, and focuses on the bearing fault features under dynamic speed changes. Two dynamic speed curves are selected in the experiment, namely the increasing A0 and decreasing A1. Three health conditions are covered, including Healthy (H), Inner Race Fault (IF), and Outer Race Fault (OF), resulting in 5 types of samples corresponding to labels 0–4. The data sampling frequency is 200 kHz, and the number of samples is 400, which are divided into training/validation/test sets in a ratio of 7:2:1, as detailed in TABLE I.
This dataset is collected from an experimental platform for rolling bearings under strong time-varying speed conditions, where ER16K bearings are used. The research primarily focuses on the characterization of bearing fault features under dynamically varying speed conditions. The experimental design includes four typical dynamic speed scenarios, monotonic increase A0, monotonic decrease A1, increase then decrease A2, and decrease then increase A3. It also covers three bearing health states, Healthy (H), Inner Race Fault (IF), and Outer Race Fault (OF), forming a 4×3 combination of fault conditions, with sample labels corresponding to 0 to 2. The data sampling frequency is 200 kHz. Four sample size scales of 20, 40, 80, and 100 are set up for small-sample learning scenarios, and the training set, validation set, and test set are divided in the ratio of 7:2:1. The specific data configuration is shown in Table 1.
Table 1
Description of the Ottawa dataset.
|
Working condition
|
Description of working condition
|
Rotational speed (rpm)
|
Fault Type
|
Label
|
|
A0
|
Monotonic increase
|
From 846 to 1428
From 750 to 1668
From 888 to 1628
|
H|IF|OF
|
0|1|2
|
|
A1
|
Monotonic decrease
|
From 1734 to 822
From 1458 to 594
From 1494 to 588
|
H|IF|OF
|
0|1|2
|
|
A2
|
Increase then decrease
|
From 882 to 1518, and then to 1260
From 906 to 1464, and then to 1122
From 840 to 1302, and then to 870
|
H|IF|OF
|
0|1|2
|
|
A3
|
Decrease then increase
|
From 1452 to 888, and then to 1236
From 1518 to 888, and then to 1164
From 1560 to 1134, and then to 1470
|
H|IF|OF
|
0|1|2
|
2) MCC5-THU
This dataset is collected through a motor-gearbox experimental platform, simulating dynamic operating conditions with varying speeds and loads. Data from the vertical Z-direction sensor is adopted, whose vibration amplitude is approximately 30% higher than that of other directions, aiding in the reduction of signal distortion. The dataset comprises four operating condition combinations, which include B0 with speed 0→1000→500 rpm and load 10 Nm; B1 with speed 0→1000→500 rpm and load 20 Nm; B2 with speed 1000 rpm and load 0→10 Nm; and B3 with speed 1000 rpm and load 0→20 Nm. The speed/load curves are illustrated in Fig. 3. The experiment covers three types of gear fault states, including Gear Pitting (GP), Miss Teeth (MT), and Tooth Break (TB), forming a 4×3 combination of fault conditions, with sample labels corresponding to 0–2. The data sampling frequency is 12.8 kHz. The sample splitting ratio is consistent with that of the Ottawa dataset. The detailed dataset configuration is provided in Table 2.
Table 2
Description of the MCC5-THU dataset.
|
Working condition
|
Rotational speed (rpm)
|
Load (Nm)
|
Fault Type
|
Label
|
|
B0
|
0 ~ 1000 ~ 500
|
10
|
GP|MT|TB
|
0|1|2
|
|
B1
|
20
|
GP|MT|TB
|
0|1|2
|
|
B2
|
1000
|
0 ~ 10
|
GP|MT|TB
|
0|1|2
|
|
B3
|
0 ~ 20
|
GP|MT|TB
|
0|1|2
|
3.2 Model parameters and evaluation indicators
To ensure the diagnostic stability of the constructed model, the key structural parameters, training parameters, and evaluation metrics of the ADPCW-ELCNN model are systematically elaborated in this subsection.
1) Model structural parameters
A
The structural parameters of ADPCW-ELCNN are detailed in TABLE 3. The optimization rationale for the key parameters is outlined below.
First, the number of convolution kernels is considered. As shown in Fig. 4, when the number of convolution kernels is set to 8, the model achieves a classification accuracy of 100% at the 9th step of training, with the fastest convergence speed of training loss and the lowest final loss value. Therefore, the number of convolution kernels is determined to be 8.
Second, the signal sampling length is determined. In the Ottawa dataset, to completely cover at least two rotational cycles, the sampling length is set to 32768 points, with the minimum rotational frequency being 12.5 Hz; in the MCC5-THU dataset, to ensure a frequency resolution of 0.05 Hz, the sampling length is set to 256000 points.
2) Model training parameters
The training parameters of ADPCW-ELCNN are kept consistent with those of ELCNN, with the Adam optimizer and cross-entropy loss function adopted. The batch size is set to 64, and a dynamically adjusted learning rate strategy with an initial value of 0.01 is employed. Meanwhile, the number of early stopping patience is set to 10, which is effectively used to prevent model overfitting and shorten training time.
3) Evaluation indicators
Four metrics, namely Accuracy (Acc), Precision (Pre), Recall (Rec), and F1 Score (F1) [24], are introduced to comprehensively evaluate the fault diagnosis performance of the model, through which the classification performance of the model is quantified from multiple dimensions.
Table 3
Network structure parameters of ADPCW-ELCNN.
3.3 Effectiveness analysis of improvement strategy
To verify the necessity of wavelet convolutional kernel, adaptive scale, and adaptive kernel length in ADPCW-ELCNN, ablation experiments are conducted on the Ottawa dataset with a SNR of -10dB.
A
The results presented in TABLE 4 indicate that each component exerts a significant influence on model performance. First, removing the wavelet convolutional kernel causes the F1-score to drop sharply to 16.17%, as the model lacks prior frequency knowledge and cannot effectively capture fault features. Second, with the adaptive scale module removed, the F1-score decreases to 83.77%, which is 15.91% lower than that of the full model, proving that this module plays a crucial role in adapting to fault frequency drift. Third, when the adaptive kernel length is removed, the F1-score is 88.15%, representing an 11.53% decline compared to the full model, indicating that this module effectively matches transient impact intervals. Finally, the full model achieves an F1-score of 99.68%, with Acc, Pre, and Rec all approaching 100%, and its performance is significantly superior to all comparative models.
In summary, the wavelet convolutional kernel, adaptive scale, and adaptive kernel length all play an irreplaceable role in the model, collectively ensuring the accurate extraction and effective identification of fault features under strong noise conditions.
The confusion matrix in Fig. 5 and the t-SNE visualization results in Fig. 6 further validate the superiority of the complete model in feature learning and classification discrimination. Samples from the complete model form distinct clusters in the feature space, with the diagonal of the confusion matrix reaching 100%, whereas the comparative models exhibit severe sample misclassification and feature overlap.
In summary, the prior-guided capability of the wavelet convolutional kernel, the frequency adaptation achieved by the adaptive scale mechanism, and the effective coverage of transient impact intervals provided by the adaptive kernel length work synergistically. This collaboration enhances the performance and robustness of the model in fault diagnosis, while also significantly improving the discriminative power of the features.
Table 4
Effectiveness verification results of different improvement strategies for ADPCW-ELCNN.
3.4 Superiority analysis of improvement strategy
Based on the Ottawa dataset with 100 samples across a -10 dB to 10 dB SNR range, the dual-parameter global-local coordination mechanism is validated in three aspects: wavelet functions, scale learning mechanisms, and convolutional kernel lengths.
1) Comparison of different wavelet functions
To evaluate wavelet function applicability for fault diagnosis, four wavelets were examined based on diagnostic performance and feature representation: Db[34]、Coif[35]、Morlet[36]、Mexican Hat[37].
Regarding diagnostic performance, Fig. 7 clearly shows that under strong noise, the model using the Mexican Hat wavelet achieves an F1-score of 96.98%, significantly outperforming the other three wavelets. The Db4 wavelet exhibits time-frequency localization bias due to its asymmetric structure. The Coif4 wavelet has limited high-frequency attenuation capability. The Morlet wavelet struggles to suppress low-frequency noise owing to its lack of vanishing moments. In contrast, the Mexican Hat wavelet demonstrates superior robustness due to its second derivative property of the Gaussian function.
In terms of feature representation, Fig. 8 illustrates that the Db4 and Coif4 wavelets, with their large time-frequency window areas, result in energy dispersion and blurred identification of fault sidebands. The Morlet wavelet is affected by low-frequency noise, which weakens the fault frequency. The Mexican Hat wavelet, as the second derivative of the Gaussian function, provides the best detection capability for impact signals. Fault frequencies and sideband energy are sharply focused, high-frequency noise is effectively suppressed, and time-frequency localization is achieved without deviation.
2) Comparison of learning methods at different scales
After determining the optimal wavelet function, the performance differences among fixed-scale, single-parameter adaptive-scale, and the proposed dual-parameter scale-level learning methods are compared to reveal the influence of scale learning on feature representation.
Figure 9 shows that the dual-parameter model achieves an F1-score of 94.03% at SNR = -10 dB, significantly outperforming the other approaches. Figure 10 further confirms that the dual-parameter mechanism captures a richer distribution of scales, while fixed-scale methods remain constant and single-parameter methods exhibit limited scale adaptability. Consequently, dual-parameter learning is more suitable for high-noise scenarios.
3) Comparison of different convolutional kernel lengths
To investigate the influence of convolutional kernel length on model performance, the generalization ability of the proposed dual-parameter collaborative optimization mechanism under small-sample conditions was systematically validated by comparing it with fixed-size convolutional kernels in scenarios with decreasing sample sizes. Three fixed convolutional kernel sizes, namely 128, 256, and 512, were configured as controls, and performance was evaluated under few-shot conditions with sample sizes of 20, 40, 60, and 80, respectively.
The experimental results in Fig. 11 show that under a sufficient sample size of 100, the performance of fixed kernel sizes of 256 and 512 is comparable to that of the collaborative optimization mechanism. However, as shown in Fig. 12, when the sample size is reduced to 20, the F1-score of the collaborative optimization model remains as high as 94.03%, significantly outperforming all fixed-size schemes. It can thus be concluded that small kernels struggle to capture global features, while large kernels are prone to overfitting. The collaborative optimization mechanism effectively balances the demands of local and global feature extraction.
3.5 Comprehensive performance analysis with existing models
A
To comprehensively validate the fault diagnosis performance of ADPCW-ELCNN, five representative models were selected as benchmark comparisons, including a conventional CNN, a wavelet kernel-based convolutional neural network (W-CNN) [
25], a scale-learnable wavelet kernel network (WKN) [
26], a deep convolutional neural network with wide first-layer kernels (WDCNN) [
38], and a dual-path convolution with attention mechanism and bidirectional gated recurrent unit (DCA-BiGRU) [
39]. The results presented in TABLE 5 are obtained by conducting 10 repeated experiments and taking the average value on the Ottawa-A0 and MCC5-THU-B0 datasets with a signal-to-noise ratio of 10 dB. Furthermore, a thorough comparison and analysis of each model is performed from the four dimensions of lightweight degree, computational efficiency, diagnostic accuracy, and training speed.
The results indicate that ADPCW-ELCNN demonstrates significant advantages in all aspects. Its parameter count is only 9.19K, equivalent to 0.02 percent to 0.21 percent of the comparative models, and its model size is merely 55KB, ranging from 0.09 percent to 83.33 percent of the comparative models. The floating-point operations are as low as 0.02×10³M, representing only 0.254 percent to 3.448 percent of the comparative models. The diagnostic accuracy reaches 99.98 percent, slightly outperforming the second-best WKN model. The training convergence time requires only 71.26 seconds. Furthermore, the multi-metric detailed comparison in Fig. 13 shows that the model comprehensively leads in Acc, Pre, Rec, and F1-score, fully confirming its overall performance superiority.
Table 5
Comparison of lightweight performance of different improved CNNs.
3.6 Robustness analysis under few-shot
To verify the reliability of ADPCW-ELCNN under high-noise and few-shot conditions, four operational scenarios from two types of datasets were examined. Different noise intensities were simulated by setting SNR gradients from − 10 dB to 10 dB, while few-shot conditions were established by varying sample sizes from 20 to 80. Comparative analyses were conducted with five other improved CNN models from two perspectives: anti-noise performance and few-shot performance. Details are as follows.
1) Analysis of anti-noise performance
Figure 14 and Fig. 15 compare the anti-noise performance of the six CNN models. As the SNR decreases, the performance of models such as the conventional CNN and W-CNN declines significantly. In contrast, ADPCW-ELCNN maintains excellent performance across all operational scenarios on both the Ottawa and MCC5-THU datasets, with the highest F1-score reaching 99.97% and the lowest exceeding 88%, significantly outperforming other models. This validates the effectiveness and diagnostic reliability of its dual-parameter adaptive collaborative wavelet convolutional architecture under high-noise environments.
2) Analysis of few-shot performance
Figure 16 and Fig. 17 further compare the performance of the six models under few-shot conditions. When the sample size is only 20, the performance of the conventional CNN, W-CNN, WDCNN, and DCA-BiGRU declines significantly due to their reliance on large amounts of training data. Although WKN exhibits a certain level of generalization capability, its F1-score still falls below that of ADPCW-ELCNN. Benefiting from its collaborative feature extraction mechanism, ADPCW-ELCNN achieves F1-scores of 94.03 percent, 94.37 percent, 96.84 percent, and 98.38 percent under the four operational scenarios of the Ottawa, and also maintains a high F1 level on the MCC5-THU, demonstrating excellent generalization capability in small-sample learning.
3.7 Analysis of representational capability based on convolutional kernels and feature maps
To explore the mechanism by which ADPCW-ELCNN captures fault features under few-shot conditions, visualization comparisons of convolution kernel morphologies and time-frequency feature maps of different CNNs are performed on the MCC5-THU dataset under the B0 working condition with a SNR of -5 dB and a sample size of 100. This process reveals the sources of its representational advantages.
1) Morphological analysis of convolutional kernel
The time-domain waveforms of convolutional kernels reflect the signal processing logic of models, with significant performance differences observed among models under high-noise and few-shot conditions. As shown in Fig. 18, the convolutional kernel waveforms of conventional models such as CNN, WDCNN, and DCA-BiGRU appear cluttered. In W-CNN, some convolutional kernels are distorted due to backpropagation, while WKN exhibits insufficient sensitivity to impulse signals. In contrast, by leveraging the second-derivative property of the Mexican Hat wavelet, the dual-parameter learning strategy, and the collaborative optimization mechanism of convolutional kernel length, ADPCW-ELCNN dynamically adapts to input signals, enabling more targeted feature extraction.
2) Time-frequency analysis of feature maps
A
The previous section analyzed the signal processing logic of each model from the perspective of convolutional kernel morphology. This section further examines the models’ feature extraction capabilities for fault characterization via time-frequency feature maps. Under high-noise conditions, three typical gearbox faults show clear characteristics: GP manifests as periodic meshing impacts; MT shows irregular impact intervals and amplitude fluctuations from tooth shape mutations; TB mainly exhibits high-frequency transient impacts due to local tooth surface damage. These fault features are severely masked in noisy environments, which traditional methods struggle to identify. Figures
19–
21 respectively present the time-frequency transformation results of each improved CNN under identical noise and few-shot conditions.
It is shown in the experiments that all fault signals are affected by the meshing frequency and its harmonics. The time-frequency diagrams of comparative models such as CNN, W-CNN, WDCNN, and DCA-BiGRU are significantly affected by noise, only roughly reflecting the meshing frequency components and failing to extract the fault frequencies. Although WKN can respond to some fault frequencies, its ability to identify high-frequency impacts such as TB is weak. In contrast, the time-frequency performance of ADPCW-ELCNN is outstanding. The GP fault exhibits regular clustered areas. The MT fault has clear features at 138.4 Hz. The TB fault forms distinct and clean transient impact areas at 286.3 Hz and 1896.4 Hz.
This difference arises from the underlying architecture designs of the models. Traditional CNN lacks signal priors and thus has relatively blind feature extraction. The wavelet convolution kernel of W-CNN is prone to distortion during backpropagation. The Morlet wavelet used in WKN is insufficiently matched to high-frequency impacts. WDCNN tends to smooth details under strong noise. The attention mechanism of DCA-BiGRU struggles to focus on effective features in noise. ADPCW-ELCNN, based on the time-frequency localization characteristics of the Mexican Hat wavelet and combined with the dual-parameter adaptive mechanism, is capable of achieving clear and robust feature extraction under extreme operating conditions, which verifies its effectiveness and advantages in fault diagnosis.
4. Analysis of feature extraction mechanism
To reveal the mechanism of ADPCW-ELCNN in few-shot fault diagnosis, verification experiments are conducted from three dimensions of feature evolution, parameter collaboration, and architecture fidelity. The experiments are performed based on the Ottawa dataset under few-shot and strong-noise working conditions with a sample size of 20 and a SNR of -10 dB, to ensure that the conclusions are consistent with actual application requirements.
4.1 Feature evolution mechanism
In few-shot strong-noise scenarios, fault features are often masked by background noise, making it a diagnostic challenge to achieve the precise balance between noise suppression and feature enhancement. Based on this, A0-IF fault samples with an SNR of -10 dB from the Ottawa dataset are selected. By tracking the time-frequency feature changes of the signal after passing through the adaptive wavelet convolutional layer, the square layer, and the pooling layer, the transformation patterns of each module on fault features are quantitatively analyzed.
To quantitatively evaluate feature extraction performance, three metrics are introduced. Fault-energy ratio denotes the percentage of energy at the 67.9 Hz main frequency and its sidebands relative to total energy. Noise suppression rate reflects the attenuation of high-frequency noise energy above 300 Hz versus the original signal. Feature sharpness is measured by the standard deviation of energy distribution in the fault frequency region, with smaller values indicating more concentrated features.
The time-frequency domain distributions in Fig. 22 and quantitative results in Table 6 demonstrate that each module exerts a cascaded transformation effect in feature extraction. The adaptive wavelet convolutional layer achieves preliminary noise reduction with a 55.9% noise suppression rate, laying the foundation for subsequent processing. However, its fault-energy ratio drops from 6.7% to 3.1%, indicating this layer primarily plays a foundational denoising role. The square layer acts as a key turning point for feature enhancement, raising the fault-energy ratio to 11.0% and noise suppression rate to 77.6% via nonlinear transformation. Feature sharpness is significantly improved, clearly highlighting the fault frequency components. The pooling layer preserves feature integrity during compression: the fault-energy ratio decreases by only 0.1%, the noise suppression rate is further improved to 81.8%, and feature sharpness remains stable, demonstrating dual functions of optimization and fidelity preservation.
Table 6
Quantitative indicators of features learned in each layer of ADPCW-ELCNN.
|
No.
|
Layer
|
Proportion of fault-energy(%)
|
Noise suppression rate(%)
|
Characteristic sharpness (standard deviation)
|
|
1
|
Input Layer
|
6.7
|
0
|
26.1
|
|
2
|
Wavelet convolutional layer
|
3.1
|
55.9
|
20.6
|
|
3
|
Square layer
|
11.0
|
77.6
|
260.1
|
|
4
|
Pooling layer
|
10.9
|
81.8
|
259.5
|
4.2 Parameter collaboration mechanism
The frequency resolution is determined by the wavelet scale A. The time-domain coverage is influenced by the convolution kernel length B. The time-frequency feature capture accuracy is jointly determined by these two parameters. This section analyzes the collaborative mechanism between these two parameters and reveals the advantages of dual-parameter optimization.
1) Analysis of parameter training characteristics
The results presented in Fig.
23 demonstrate the convergence behaviors of the scale and length of different convolution kernels after training, and these parameters exhibit obvious regularities. The results show that the parameters of each convolution kernel converge to different intervals. For example, kernel_4 converges to
, while kernel_8 converges to
. Further analysis indicates that
and
exhibit a significant positive correlation, with the Pearson correlation coefficient reaching 0.79. Additionally,
is approximately 2 to 4 times the value of
. This indicates that the adjustment of frequency resolution needs to be matched with the corresponding time-domain coverage, avoiding feature capture deviations caused by parameter mismatches.
2) Analysis of differences in feature extraction
The necessity of the synergy between
and
is verified by comparing the time-frequency feature extraction effects under the single-parameter fixed strategy and the dual-parameter synergistic strategy, as shown in Fig.
24. The single-parameter fixed strategy has obvious defects: the fixation of
leads to the loss of high-frequency features, while the fixation of
results in feature blurriness. In contrast, the dual-parameter synergistic strategy can comprehensively capture multi-band features, including low-frequency fundamental frequencies such as 42.9Hz and 67.9Hz as well as high-frequency harmonics such as 160.8Hz and 228.7Hz. The feature boundaries are clear, and there is no obvious noise interference.
3) Analysis of performance comparison Rules
The results presented in Table
7 further quantify the diagnostic performance under different parameter optimization strategies. The results show that the average accuracy of the dual-parameter synergistic strategy is achieved at 96.98%, which is significantly higher than 82.53% of the fixed
strategy and 87.26% of the fixed
strategy. Meanwhile, the fault-energy ratio is increased to 10.9%. The main reason is that the single-parameter fixed strategy disrupts the dynamic correlation between scale and length, making it difficult to adapt to wide-band fault features. In contrast, the dual-parameter synergistic strategy achieves accurate coverage of full-band features through the adaptive linkage between
and
. In summary, the dual-parameter synergistic mechanism of wavelet scale and convolution kernel length effectively addresses the problem of incomplete feature extraction in the single-parameter strategy, providing important support for the efficient capture of multi-band fault features.
Table 7
Comparison of dual parameter and single parameter optimization performance of convolutional kernel.
|
No.
|
Optimization method
|
Average accuracy/%
|
Correlation coefficient between and
|
Proportion of fault-energy/%
|
|
1
|
Dual parameter collaboration
|
96.98
|
0.79
|
10.9
|
|
2
|
Fixed scale
|
82.53
|
-
|
8.5
|
|
3
|
Fixed length
|
87.26
|
-
|
7.1
|
4.3 Architecture fidelity mechanism
Under few-shot conditions, due to the inherently limited data features, increasing the number of convolutional layers can lead to attenuation or distortion of key features during transmission, potentially introducing pseudo-features unrelated to actual faults. To explore the impact of network depth on feature fidelity, two deeper variant models were constructed, namely a single-layer plus one convolutional layer and a single-layer plus two convolutional layers, based on the ADPCW-ELCNN architecture. By analyzing the feature transmission process focusing on the critical fault frequency of 42.9 Hz on the Ottawa few-shot subset, a quantitative evaluation was conducted from three key dimensions: the feature transmission loss rate, reflecting the retention level of key features; the feature distortion degree, measuring the deviation of features from the actual fault morphology; and the non-fault feature occurrence rate, indicating the extent of pseudo-features introduced by the model.
From the time-frequency feature distribution in Fig. 25, it was observed that the single-layer convolutional structure maintained the key frequency component of 42.9 Hz relatively well. In contrast, the single-layer plus one convolutional layer structure exhibited a noticeable loss of this feature, accompanied by the emergence of a pseudo-feature at 52.6 Hz. The single-layer plus two convolutional layer structure further displayed feature blurring and enhanced energy of pseudo-features. Quantitative indicator analysis further confirmed this trend.
From the quantitative indicators in Table 8, as network depth increased, the feature transmission loss rate significantly rose from 15.85 percent in the single-layer structure to 68.29 percent in the two-layer structure. Simultaneously, the feature distortion degree decreased from 0.74 to 0.36, indicating an exacerbation of feature distortion. More notably, the non-fault feature occurrence rate increased progressively from zero in the single-layer structure to 7.5 percent, suggesting that deeper structures introduced more pseudo-features unrelated to faults. These changes were directly reflected in the diagnostic performance, with the average accuracy rate sharply declining from 96.98 percent in the single-layer structure to 33.33 percent in the two-layer structure.
Experimental results indicate that under few-shot conditions, increasing network depth leads to issues such as loss of key features, aggravated feature distortion, and increased introduction of pseudo-features. The single-layer convolutional design adopted in ADPCW-ELCNN effectively avoids defect accumulation in multi-layer iterations through a streamlined architecture, achieving a good balance between feature extraction and overfitting control. This provides important insight for few-shot fault diagnosis tasks, suggesting that under limited data conditions, priority should be given to ensuring the fidelity of feature transmission rather than indiscriminately increasing network depth.
Table 8
Comparison of feature transfer indicators for different architectures.
|
No.
|
Architecture type
|
Average accuracy/%
|
Loss rate/%
|
Feature distortion degree
|
Non-fault feature occurrence rate/%
|
|
1
|
Single layer
|
96.98
|
15.85
|
0.74
|
0
|
|
2
|
Single layer + 1 layer
|
72.51
|
56.61
|
0.43
|
2.6
|
|
3
|
Single layer + 2 layer
|
33.33
|
68.29
|
0.36
|
7.5
|
5. Conclusion
To address the challenges of insufficient feature extraction and limited model generalization in few-shot fault diagnosis for rotating machinery, the ADPCW-ELCNN few-shot fault diagnosis model is proposed. A quantitative correlation mechanism between wavelet scales and convolutional kernel lengths is established, by which adaptive parameter synergy optimization is enabled and the inherent challenge of fragmented parameter adaptation in traditional models is overcome. Through the designed cascaded feature enhancement path, the recognition of weak fault features is effectively improved, while feature fidelity is ensured and efficient model compression is achieved via the single-layer lightweight architecture.
The profound value of the model lies in the construction of a novel diagnostic paradigm that is deeply integrated with physical priors and data-driven approaches. By this paradigm, not only the core bottleneck of few-shot fault diagnosis in high-noise environments is effectively addressed, but also a technical solution balancing interpretability and deployability is provided for intelligent diagnostics. Its technical advantages offer important insights for reliable diagnosis in edge computing scenarios.
Building on these research findings, future efforts can be directed toward two main areas the enhancement of model perception through multi-source information fusion technology and the improvement of engineering generality via cross-device transfer learning mechanisms. These directions are aimed at advancing the application of intelligent diagnostic technologies across broader scenarios, thereby more robust technical support is provided for equipment health management.
Acknowledgements and Funding Information
This work was supported by the National Natural Science Foundation of China (Grant No.: 52375412).