A
ECG-Based Automated Detection of Sleep Apnea Using Deep Neural Networks and Hidden Markov Models
Bushehr, Iran.
Email: ghimatgar@pgu.ac.ir
Sara Qnatyana,b, Hojat Ghimatgara,b, Ahmad Keshavarza,b
a Computational Neuroscience Laboratory, ICT Research Institute, Faculty of Intelligent Systems Engineering and Data Science, Persian Gulf university, 7516913817 Bushehr, Iran
b Department of Electrical Engineering, Faculty of Intelligent Systems Engineering and Data Science, Persian Gulf University, Bushehr 75169-138, Iran
* Corresponding author:
HOJAT GHIMATGAR
Department of Electrical Engineering,
Faculty of Intelligent Systems Engineering and Data Science, Persian Gulf University
Tel: (+ 98) 77 31222150
Fax: (+ 98) 77 33440376
Abstract
Sleep disorders constitute a substantial global health burden, with more than eighty distinct conditions currently recognized. Among these, obstructive sleep apnea (OSA) represents the most prevalent sleep-related respiratory disorder, characterized by recurrent episodes of complete or partial upper airway obstruction during sleep. Although often asymptomatic, OSA exerts profound detrimental effects on cardiovascular, neurological, and pulmonary systems, thereby necessitating timely and accurate diagnosis. Conventional diagnostic approaches rely on polysomnography (PSG), which, despite its high diagnostic accuracy, remains costly, time-intensive, and dependent on specialized equipment and expert clinical supervision. These limitations in accessibility and operational complexity have prompted the development of more practical alternatives. Electrocardiogram (ECG)-based approaches have consequently attracted considerable attention owing to their capacity for continuous cardiac monitoring and sensitivity to subtle physiological alterations associated with sleep-disordered breathing. However, the inherent nonstationarity of ECG signals and substantial inter-subject variability continue to constrain model generalizability, thereby underscoring the critical need for robust and computationally efficient deep learning architectures. In this study, we propose a deep learning framework that integrates multiple surface-level ECG-derived features—including R-R intervals (RRI), ECG-derived respiration (EDR), and respiratory amplitude (RAMP)—extracted from the PhysioNet Apnea-ECG database. A record-wise data partitioning strategy was implemented to rigorously prevent data leakage across training, validation, and test sets. A hidden Markov model (HMM) was further incorporated as a post-processing module to refine apnea episode detection. Across five repeated hold-out validation experiments with varying training and validation partitions, the CNN-Transformer-LSTM architecture achieved an accuracy of 89.16 ± 0.94%, sensitivity of 81.42 ± 3.27%, and specificity of 94.00 ± 1.16%. Five-fold cross-validation yielded enhanced performance with accuracy of 90.62 ± 1.54%, sensitivity of 84.15 ± 3.80%, and specificity of 94.44 ± 2.22%. Integration of the HMM module further improved classification performance by approximately 5.00–6.00%, demonstrating the efficacy of the proposed framework for reliable and efficient OSA screening in both clinical and home-based monitoring applications.
Keywords:
Obstructive Sleep Apnea
ECG-Derived Respiration
Hybrid Deep Neural Networks
CNN-Transformer-LSTM
Hidden Markov Model
Physiological Signal Processing
1. Introduction
Sleep is essential for maintaining physiological balance, and respiratory disorders such as apnea and hypopnea can severely disrupt it. Obstructive sleep apnea (OSA), the most common sleep-related breathing disorder, is characterized by repeated obstruction of the upper airway and reduction or cessation of airflow during sleep. If left undiagnosed and untreated, it increases the risk of cardiovascular diseases, stroke, cognitive impairments, and chronic fatigue [1], [2].
The apnea index indicates the number of apnea events per hour of sleep and is typically obtained through polysomnography (PSG). In this method, various physiological signals are recorded throughout a night at a sleep center and subsequently interpreted by a specialist or operator. Despite its appropriate accuracy, this process is time-consuming and dependent on individual judgment, and may be prone to subjective errors. Additionally, the unfamiliar conditions of the environment may create a "first-night effect" and alter the patient's sleep quality [3] With the advancement of wearable sensors and IoT technologies, the use of single-lead ECG has emerged as a promising option for OSA detection due to its non-invasive nature, simplicity, and portability [4].
However, individual differences in physiological signals, the impact of concurrent cardiac diseases, and the need to maintain a balance between model complexity and computational efficiency are significant challenges in developing ECG-based diagnostic systems that require special attention in the design and implementation of these systems [5].
In the initial stages of research on sleep apnea detection, traditional machine learning methods were predominantly employed based on manually extracted features from ECG signals. Algorithms such as Support Vector Machine (SVM), Decision Tree, and Artificial Neural Networks (ANN) demonstrated acceptable performance in classifying Obstructive Sleep Apnea (OSA) events. For instance, Cartwright et al. [6], Varon et al. [7], and Babaeizadeh et al. [8] reported accuracies ranging from 84.00% to 85.00%. Furthermore, Pinho et al. [9], combining HRV and EDR features using ANN and SVM, achieved an accuracy of 82.12%, while Sharma et al. [10], utilizing the BAWFB wavelet filter and SVM algorithm, attained an accuracy of 90.11%. Hassan et al. [11] also achieved an accuracy of 88.88% and specificity of 91.49% using TQWT wavelet transform and the RUSBoost algorithm. Nevertheless, the reliance on manually extracted features and high sensitivity to noise resulted in reduced generalizability and robustness of these methods on real-world data.
A
To address these limitations, several studies shifted toward unsupervised learning methods and hybrid models to achieve more automated and efficient feature extraction. Feng et al. [12] introduced the FSSAE model combined with HMM and the MetaCost algorithm, achieving an accuracy of 85.10%. Viswabhargav et al. [13], using a combination of EDR and SRE features with fuzzy clustering and SVM, reported an accuracy of approximately 78.00%. Li et al. [14], employing sparse autoencoder and HMM model, achieved 85.00% accuracy at the segment level and 100.00% at the record level. Despite improvements in automatic feature extraction, these models still face limitations in generalizability and clinical application due to computational complexity and high sensitivity to parameter tuning.
A
In recent years, the remarkable advancement of Deep Learning has brought about a fundamental transformation in obstructive sleep apnea detection. Networks such as CNN, RNN, and hybrid CNN-RNN architectures, with their capability to automatically extract spatial and temporal features from raw ECG signals, have achieved considerable accuracies. Cheng et al. [15] attained 97.80% accuracy using RNN, and Urtasun et al. [16] reported 90.80% accuracy employing CNN on 10-second epochs. Chen et al. [17] developed an end-to-end spatio-temporal model based on CNN and BiGRU combination, achieving 91.22% accuracy at the segment level and 97.10% at the record level. Yang et al. [18] using SE networks and Cheng et al. [19] through multi-band decomposition of ECG signals improved model performance. Zhou et al. [20] also converted ECG signals into spectral images and employed modified AlexNet and ResNet architectures to identify visual patterns associated with apnea events.
Despite these advances, data imbalance and difficulty in generalization under clinical conditions remain among the main challenges. Liu et al. [21], despite achieving an overall accuracy of 88.20%, reported a recall of only 78.50%, indicating the difficulty in reliably detecting apnea events in real-world conditions. Qin et al. [22] using a hybrid CNN-BiGRU model achieved 91.10% accuracy, and Tyagi et al. [23] with the FT-EDBN network attained 89.11% accuracy at the segment level. Subsequently, attention-based networks such as RAFNet [24] and BAFNet [25], by considering temporal dependencies between adjacent segments, demonstrated superior performance in contextual learning. RAFNet achieved 91.40% accuracy and BAFNet obtained 91.30% accuracy with perfect accuracy at the record level.
Overall, deep learning-based approaches exhibit superior capability compared to traditional and hybrid methods in modeling temporal dependencies and learning complex patterns from physiological signals. However, challenges such as data imbalance, model interpretability, and clinical validation remain active and open research avenues in this field.
The various factors causing sleep apnea and individual differences in physiological signals, coupled with the complexity of existing methods that require powerful computational resources to achieve high accuracy, motivated us to develop a practical and optimized approach. Our objective is to find a solution that, using relatively modest resources, can deliver acceptable and reliable results in sleep apnea detection for implementation on an individual basis.
The structure of this article is as follows: Section 2 presents the proposed method, including data preprocessing, feature extraction, model architecture, post-processing strategy using Hidden Markov Model, and trainig; Section 3 is dedicated to results; Section 4 includes discussion, limitations, and suggestions for future work; and finally, Section 5 provides conclusions and a summary of key achievements.
To provide a comprehensive overview of the methodology employed in this study, a general workflow is illustrated in the form of a flowchart, delineating the sequential steps from the initial phases to the final analysis. This flowchart encompasses the key stages of ECG signal preprocessing, relevant feature extraction, the design and training of deep learning models, and, ultimately, the evaluation of model performance. Each phase has been meticulously structured to ensure logical coherence and methodological consistency throughout the research process. The overall workflow of the proposed study is depicted in Fig. 1.
Fig. 1
Workflow overview of the OSA detection model
Click here to Correct
2. Materials and methods
2.1. Dataset Description
This study utilizes the Apnea-ECG database, a publicly available dataset comprising 70 nocturnal recordings of high-quality, single-lead electrocardiogram signals. These recordings were collected from two independent clinical studies and include patients diagnosed with sleep apnea as well as healthy control subjects. Each participant underwent monitoring of biosignals under supervised conditions for one or two consecutive nights.
The raw ECG signals, initially recorded at a sampling frequency of 200 Hz, were downsampled to 100 Hz during preprocessing to maintain consistency with the original experimental sampling rate. Following quality assessment, one or more suitable recordings were selected for each participant. The final dataset consists of 70 recordings: 27 recordings from 9 participants in the first study and 43 recordings from 23 participants in the second study.
Respiratory events in the Apnea-ECG database are annotated using a binary labeling system at one-minute intervals. During expert annotation and subsequent reviews, no distinction was made between apneas and hypopneas, with both marked as abnormal respiratory events (label: A), while intervals of normal breathing were recorded as normal (label: N). Additionally, the cumulative duration of abnormal breathing periods was calculated for each participant to support stratified analyses.
To evaluate model performance across different levels of respiratory disturbance severity, the dataset was divided into three classes based on the total duration of abnormal breathing in each recording:
Class C (control group): less than 5 minutes of abnormal breathing.
Class B (borderline group): 10 to 96 minutes of abnormal breathing.
Class A (apnea group): more than 100 minutes of abnormal breathing.
For training and evaluation, the dataset was split into two equal subsets. The training set comprises 35 labeled recordings from participants with identifiers a01 through c10, while the test set includes 35 independent recordings labeled x01 through x35. The assignment of test recordings to their respective groups is presented in Table 1. This stratified and structured partitioning provides a robust framework for developing and validating machine learning models, ensuring balanced representation of physiological diversity and apnea severity across both training and testing phases. Key characteristics of the dataset are summarized in Table 2 [26].
Table 1
Assignment of Test Set Recordings to Apnea, Borderline, and Control Groups
Group
Identifiers
A
X01, X02, X05, X07, X08, X09, X13, X14, X15, X19, X20, X21, X23, X25, X26, X27, X28, X30, X31, X32
B
X03, X10, X11, X12, X16
C
X04, X06, X17, X18, X22, X24, X29, X33, X34, X35
Table 2
Descriptive Statistics of the Sleep Apnea-ECG Dataset
Dataset Characteristics
Training Data
Testing Data
Overall Mean
Patients
65.71%
65.71%
65.71%
Healthy controls
34.29%
34.29%
34.29%
Male participants
85.71%
77.14%
81.43%
Female participants
14.29%
22.86%
18.57%
Mean participants age
46.00 years
44.00 years
45.00 years
Mean body mass index
28.00
28.20
28.10
Mean recording duration
489.29 min
494.37 min
491.83 min
Mean recording hours
8.15 hrs
8.24 hrs
8.20 hrs
Additional considerations regarding the dataset are as follows:
All apneas in this dataset are either OSA or mixed.
Pure central apneas, including Cheyne-Stokes respiration, are not present.
Hypopneas are treated equivalently to apneas; specifically, they are defined as a ≥ 50% reduction in airflow accompanied by a ≥ 4% decrease in oxygen saturation and subsequent compensatory breaths.
In summary, the majority of recordings correspond to OSA, although some may contain mixed apneas, and no recordings of pure central sleep apnea (CSA) are present.
2.2. Data Preprocessing
Initially, the ECG signals were segmented into one-minute intervals to enable more precise signal analysis and improve the performance of learning models. At this stage, 17,045 one-minute segments were extracted from the training data and 17,268 segments from the test data. Following the methodology proposed by Zhu Zhaokun et al. [27], the preprocessing procedure was meticulously implemented to eliminate interfering noise and facilitate the extraction of robust features.
Given the non-stationary nature of ECG signals, the use of simple linear filters for noise removal exhibits limited efficacy, as these filters cannot adapt to the gradual variations of the signal over time. One of the most significant noise components is baseline wander, which occurs in the frequency range of 0 to 1.5 Hz and is primarily caused by respiratory movements, electrode displacement, and changes in body position. The presence of this drift may adversely affect the accurate detection of critical points such as the R-peak, as well as the morphology of the T-wave.
To address this issue, the discrete wavelet transform (DWT) was employed in this study [28]. This transform enables time-frequency domain analysis of the signal and is highly effective in separating the slow and fast components of the signal. In the preprocessing stage, the signals were decomposed using the Daubechies wavelet of order 6 (db6) up to the sixth level, and the approximation coefficients corresponding to low frequencies were removed. Subsequently, through inverse wavelet transform reconstruction, a filtered version free from baseline wander was obtained, in which the QRS complex structure and other electrophysiological cardiac components were well preserved.
Following signal filtering, three primary features were extracted, namely:
R-R interval (RRI): an indicator of heart rate variability and a reflection of autonomic nervous system activity;
R-wave amplitude (RAMP): associated with ventricular contraction strength and the physiological state of cardiac activity;
ECG-derived respiration (EDR): representing respiratory pattern and depth, which is important for detecting apnea patterns.
For RRI extraction, the robust Pan-Tompkins algorithm [29] was utilized to identify R-peaks, and abnormal intervals with heart rates below 30 or above 180 beats per minute were excluded. The R-wave amplitude was determined by computing the signal maximum within a 25-sample window around each peak. Additionally, the EDR signal was reconstructed from the analysis of dominant QRS complex components and extraction of the principal component using PCA [30].
One of the challenges was the difference in sampling rates among features. RRI and RAMP are event-driven and naturally have much lower sampling rates compared to EDR. To integrate the data and create a coherent input for the model, all three signals were resampled to 4 Hz. To this end, RRI and RAMP were upsampled using quadratic spline interpolation, and EDR was aligned using rational downsampling. This process provided a suitable foundation for use in deep learning models without compromising the dynamic information of the signals.
In the normalization phase, Z-Score-based standardization was initially examined. Subsequently, according to the reference article methodology, the mean of each signal was removed and feature-wise scaling was applied. Due to the inherent amplitude differences among features, only RAMP and EDR underwent amplification scaling to prevent saturation of activation functions in the network and establish balance in feature magnitudes. Table 3 provides an overview of the PhysioNet Apnea-ECG dataset following preprocessing. Figure 2 illustrates the preprocessing pipeline for ECG signals.
Table 3
Description of the PhysioNet Apnea-ECG Dataset After Preprocessing
Dataset
Sleep Apnea
Normal
Total
Train set
6129
9832
15961
Test set
6100
9838
15938
Total
12229
19670
31899
Fig. 2
Preprocessing Pipeline Framework
Click here to Correct
2.3. Utilized Models
Following the preprocessing stage and extraction of surface-level features, each physiological feature was represented as a 240×1 vector, indicating the variations of that feature over 60 seconds of the ECG signal with a sampling rate of 100 Hz. The three extracted features were concatenated column-wise to form a 240×3 matrix, which, while preserving temporal synchronization, enabled the modeling of interactions between cardiac and respiratory dynamics. This representation provided a standardized input structure for all deep learning models.
Given the sequential nature of ECG signals, recurrent neural networks (RNNs) were initially employed to model temporal dynamics. Long Short-Term Memory (LSTM) architectures [31] were utilized to capture long-term temporal dependencies through their gating mechanisms and memory cells, demonstrating robust performance in detecting apnea-related events. However, the prolonged training time and high computational cost associated with LSTMs constrained their practical applicability. To enhance computational efficiency, alternative architectures including Gated Recurrent Units (GRU) [17] Bidirectional GRU (BiGRU) [32], and Bidirectional LSTM (BiLSTM) [33] were adopted. Several key optimized hyperparameters for these networks are summarized in Table 4, and their architectural schematics are illustrated in Figs. 36.
Fig. 3
The architecture of the LSTM
Click here to Correct
Fig. 4
The architecture of the GRU
Click here to Correct
Fig. 5
The architecture of the BiGRU
Click here to Correct
Fig. 6
The architecture of the BiLSTM
Click here to Correct
Concurrently, Convolutional Neural Networks (CNNs) were employed to extract local features and repetitive morphological patterns from the ECG signals, with the architectural schematic illustrated in Fig. 7[34] Building upon these feature extraction capabilities, four hybrid architectures—CNN–BiLSTM, CNN–GRU, CNN–LSTM, and CNN–BiGRU—were developed to further enhance temporal modeling performance. The key optimized hyperparameters of these hybrid models are summarized in Table 5.
Fig. 7
The architecture of the CNNs
Click here to Correct
To achieve enhanced computational efficiency and accelerated training, a hybrid CNN–Transformer–LSTM architecture was designed and implemented, inspired by the framework presented in [35]. This architecture integrates the CNN's capability for extracting local features, the Transformer's ability to model long-range dependencies through multi-head attention mechanisms—as illustrated schematically in Fig. 8—and the LSTM's capacity for capturing temporal dynamics within a unified framework. The convolutional component comprises three one-dimensional convolutional layers with 64, 128, and 128 filters, respectively, followed by max-pooling and dropout layers. To preserve temporal information, positional encoding with a dimensionality of 128 was applied. The Transformer encoder incorporates a two-head attention mechanism, a feedforward network, layer normalization, and residual connections. Subsequently, an LSTM layer with 128 units and a dropout layer was employed, followed by fully connected layers with sigmoid activation for binary classification. Model training was performed using the Adam optimizer with binary cross-entropy as the loss function, and performance evaluation was conducted using accuracy as the primary metric.
Fig. 8
The architecture of the Transformer Encoder
Click here to Correct
Table 4
Hyperparameter Settings of the LSTM, GRU, BiLSTM, and BiGRU Networks Used in This Study.
Parameter
Model
LSTM
GRU
BiLSTM
BiGRU
Number of LSTM Layers
3
-
-
-
Number of GRU Layers
-
3
-
-
Number of BiLSTM Layers
-
-
2
-
Number of BiGRU Layers
-
-
-
3
Units per Layer
384, 384, 384
64, 32, 16
64, 64
256, 128, 64
Dropout Rate
0.1, 0.2, 0.3
-
0.5
0.4, 0.3, 0.2
Dense Layers
128, 64, 32
32
32
128, 64 (with L2 regularization)
Batch Normalization
-
-
-
Used (after BiGRU layers)
Table 5
Architectural Hyperparameters of CNN-Based Deep Learning Models.
Parameter
Model
CNN
CNN_LSTM
CNN_GRU
CNN_BiLSTM
CNN_BiGRU
Number of Conv 1D Layers
3
1
2
2
2
Filters per conv 1D Layer
32, 64, 128
32
64, 128
64, 128
64, 128
Kernel Size
3
3
3
3
3
Number of LSTM Layers
-
1
-
-
-
Number of GRU Layers
-
-
2
-
-
Number of BiLSTM Layers
-
-
-
2
-
Number of BiGRU Layers
-
-
-
-
2
Units per Layer
-
64
128, 64
128, 64
128, 64
Max pooling
2
2
2
2
2
Dropout Rate
0.3, 0.5
-
0.5
0.5
0.5
Dense Layers
64
1
64
64
64
2.4. Post-Processing with Hidden Markov Model (HMM)
In this section, we introduce the Hidden Markov Model (HMM) as a post-processing technique to refine sleep apnea classification results. As illustrated in Fig. 9, each one-minute ECG segment is associated with a hidden state (Normal or Apnea), with observations represented as feature vectors. We employ an HMM where emission probabilities remain consistent across all records, while transition matrices are record-specific. Maximum Likelihood Estimation (MLE) is used to estimate model parameters, integrating short-term classifier outputs with long-term transition dynamics to improve detection accuracy [36].
An HMM is formally specified by the following elements:
The hidden Markov state sequence over time
:
1
The corresponding sequence of observed variables:
2
A discrete state space
, where each hidden state
.
Each observation
is modeled as a continuous-valued feature vector.
The model parameters include:
Initial state distribution:
3
State transition probability matrix:
4
Emission distribution (observation likelihood given the hidden state):
5
Marginal distribution of hidden states:
6
Posterior probability of hidden states given observations:
7
These parameters satisfy the following normalization constraints:
8
Applying Bayes’ theorem, the conditional probability of an observation given a hidden state can be written as:
9
Fig. 9
A Schematic Representation of a Hidden Markov Model
Click here to Correct
2.5. Training phase
The training procedure was designed to ensure rigorous, unbiased evaluation of the proposed framework while preserving the temporal and subject-specific characteristics of the ECG recordings. All models were trained using a subject-wise validation strategy to prevent data leakage between training and evaluation sets—a critical requirement in physiological signal modeling, where segments from the same subject are inherently correlated.
Two complementary validation schemes were employed. First, a five-fold cross-validation protocol was implemented at the subject level. In each fold, the available recordings were partitioned into mutually exclusive training and validation groups, ensuring that no subject appeared in more than one subset per fold. This approach enabled the model to be exposed to a wide range of physiological variability while maintaining strict independence between training and validation data. The order of subjects within each subset was randomized prior to training to avoid unintentional bias arising from fixed subject sequences, while randomization seeds were controlled to guarantee reproducibility. For each fold, all segments belonging to the selected subjects were concatenated sequentially to form the final training and validation matrices. This procedure preserved the native temporal ordering of the ECG segments, which is essential for learning physiologically meaningful temporal dependencies.
Within each fold, the model was trained iteratively, and its predictions on the validation subjects were used to derive subject-specific temporal statistics, including transition and emission probabilities required for the Hidden Markov Model (HMM) post-processing module. These statistics were aggregated across subjects to construct the final HMM parameters for each fold. All runs—including intermediate model checkpoints, prediction traces, and HMM parameter sets—were archived systematically to ensure complete reproducibility and facilitate downstream comparative analysis.
In addition to cross-validation, a hold-out evaluation was conducted to provide an independent assessment of generalization performance. The dataset was partitioned into an 80% training portion and a 20% test portion at the subject level. To enhance robustness, the training set was further divided into five rotating internal validation subsets, yielding five independent training runs. In each run, the model was trained from scratch, validated on a distinct subset of training subjects, and subsequently applied to the fixed test set. The final performance of the hold-out evaluation was computed as the average across the five runs, ensuring that the reported results were not influenced by a specific sampling configuration.
Throughout all experiments, the test set remained completely unseen during model development, training, and parameter tuning. This strict separation guaranteed an unbiased and clinically meaningful estimate of real-world performance. The combination of subject-wise cross-validation, multi-run hold-out evaluation, controlled randomization, and systematic recording of temporal statistics provided a robust methodological foundation for reliable assessment of the proposed deep learning–HMM framework.
2.6. Performance metrics
The model's performance was comprehensively evaluated using a range of statistical metrics derived from the confusion matrix, which comprises four fundamental quantities: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). Specifically, TP represents correctly identified OSA cases, while FP corresponds to non-OSA instances erroneously classified as OSA. TN denotes correctly classified normal instances, and FN refers to OSA cases misclassified as normal. These metrics are defined as follows:
10
11
12
13
14
Through this comprehensive set of evaluation metrics, a thorough analysis of the model's strengths and limitations in classifying OSA from ECG data was conducted, providing detailed insights into its classification performance.
3. Results
3.1. Per-segment performance
Preliminary analyses indicated that conventional Z-score normalization, despite its widespread adoption, is unsuitable for signals derived from surface-level time-series features. This method reduces the dynamic range of signals and attenuates subtle variations in the RRI, RAMP, and EDR components, leading to distortion of discriminative patterns, convergence difficulties, and reduced classification accuracy in the initial models.
Among the nine initial models reported in Tables 4 and 5, only the results obtained from RRI_RAMP_EDR data with a 240×3 structure, normalized according to the reference method, were utilized for subsequent analyses. This preprocessing approach ensured enhanced model stability and performance. At this stage, only the hold-out validation strategy was applied, and the mean performance of each model across five independent runs is presented in Table 6. Evaluation of models based on recurrent, convolutional, and hybrid architectures indicated that application of a postprocessing stage consistently led to substantial improvement in the performance of all networks. Comparison of mean values and standard deviations of the metrics demonstrates that this stage, in addition to increasing classification accuracy, plays a critical role in stabilizing model outputs and reducing variability across different runs. Overall, the optimal performance was observed in the BiLSTM model following postprocessing, achieving an accuracy of 89.04%, sensitivity of 87.69%, and F1-score of 86.00%. These results demonstrate the inherent capability of the bidirectional LSTM architecture to extract both long-term and short-term temporal patterns, coupled with improved decision-making through the postprocessing stage.
Among single-layer recurrent models, both LSTM and GRU exhibited increased F1-scores following activation of the postprocessing stage—from 77.45% to 84.98% and from 77.18% to 85.96%, respectively. Notably, the standard deviation of the F1-score decreased substantially in both models, indicating that postprocessing enhances decision stability and reduces sensitivity to run-to-run variations. However, GRU outperformed LSTM overall, which may be attributed to its simpler architecture and superior capacity for modeling short-term ECG dependencies.
Among CNN-RNN hybrid models, CNN-GRU and CNN-LSTM demonstrated competitive performance; however, within this group, CNN-GRU outperformed CNN-LSTM, achieving an F1-score of 85.13% and accuracy of 88.68%. This is likely attributable to the GRU's ability to prevent overfitting in deeper architectures, as the reduced parameter count of GRU, when combined with CNN, confers greater robustness against signal variations.
Conversely, bidirectional hybrid architectures such as CNN-BiLSTM and CNN-BiGRU, despite relative performance gains compared to models without postprocessing, exhibited higher metric variability, particularly in the standard deviations of sensitivity and specificity. This phenomenon suggests that bidirectional models, in settings with limited experimental runs, are more susceptible to fluctuations arising from architectural complexity or sensitivity to noisy ECG patterns. Nonetheless, CNN-BiGRU delivered satisfactory performance with 87.56% accuracy and 83.78% F1-score, and demonstrated greater stability than CNN-BiLSTM.
Examination of the poorest performance revealed that the CNN-only model, both without and with postprocessing, produced the lowest results. With an initial F1-score of 73.86% and only modest improvement to 81.47% following postprocessing, this model demonstrated that feature extraction based solely on CNN is insufficient for temporally dependent ECG signals. The high standard deviation of sensitivity confirms that architectures without memory mechanisms cannot reliably identify apnea patterns.
Regarding standard deviation analysis, deeper or multi-component models such as BiGRU and CNN-BiLSTM exhibited the highest variability in sensitivity and specificity, indicating their greater dependence on run-specific patterns and susceptibility to overfitting particular temporal structures. Conversely, GRU and CNN-GRU, following postprocessing, presented the lowest standard deviations, demonstrating high stability, robust generalization, and reduced sensitivity to variations in data distribution.
Overall, the results indicate:
Best overall performance: BiLSTM (with postprocessing)
Most stable models: GRU and CNN-GRU
Weakest model (in terms of error rate and sensitivity): CNN (without postprocessing)
Highest run-to-run variability: CNN-BiLSTM and BiGRU
Largest performance improvement following postprocessing: GRU and CNN-GRU
These findings indicate that combining recurrent networks with short- and medium-term memory capacity (e.g., GRU) with a rule-based postprocessing stage provides the optimal balance between accuracy, sensitivity, and stability for ECG-based apnea detection. Implementation of the proposed model from [35] also confirmed accelerated training and inference while achieving performance comparable to the best initial model, reaching 89.18% accuracy. Considering computational efficiency and processing speed, this model was selected as the final architecture.
Analysis of Table 7 demonstrated that model performance evaluated based on two normalization approaches and the presence or absence of postprocessing reveals that feature extraction quality and the normalization method are crucial for system stability and classification accuracy. The utilization of three features—RRI, RAMP, and EDR—was consistent across all experiments; however, the input standardization method and the application of postprocessing induced significant changes in model behavior.
Using the reference method (without Z-score normalization), the model without postprocessing achieved an F1-score of 79.45% and accuracy of 85.20%. Specificity was relatively high at 91.60%, but sensitivity was lower at 74.89%, indicating superior performance in identifying normal samples while committing more errors in detecting apneic events. With postprocessing enabled, performance improved substantially, with F1-score reaching 85.19%, accuracy 89.18%, and sensitivity 81.42%. The concurrent reduction in standard deviations demonstrates that postprocessing not only increases overall accuracy but also stabilizes model performance across different runs, reflecting the beneficial impact of temporal smoothing and decision correction rules on network outputs.
Conversely, application of Z-score normalization resulted in notable performance degradation for models without postprocessing. Under this condition, F1-score decreased to 73.80% and accuracy to 77.60%; importantly, the standard deviations of sensitivity and specificity increased compared to the reference method. This increase in variability indicates that Z-score normalization cannot stably standardize RRI, RAMP, and EDR features and may disrupt critical dynamic characteristics of ECG signals. Postprocessing improved performance modestly, with F1-score increasing to 78.72% and accuracy to 82.11%; however, overall performance remained inferior to the reference method. Standard deviations of sensitivity and specificity, though reduced compared to the no-postprocessing condition, remained elevated relative to the reference approach, clearly indicating that the primary limitation lies in Z-score normalization, and postprocessing can only partially mitigate its adverse effects.
In summary:
1.
The optimal overall performance corresponds to the reference method with postprocessing, offering the highest F1-score, accuracy, and stability with minimal variability. This demonstrates that RRI, RAMP, and EDR features in their appropriately scaled form provide maximum discriminative power.
2.
Z-score normalization is unsuitable for these features, as it reduces accuracy, decreases sensitivity, and increases standard deviations. This indicates that the statistical distribution of these features is incompatible with standard normalization while preserving temporal or morphological relationships.
3.
Postprocessing constitutes a critical component for enhancing performance, improving both model accuracy and stability across runs. Its effect is substantially more pronounced with the reference method than with Z-score normalization.
4.
Higher specificity relative to sensitivity across all models indicates that detecting normal conditions is less challenging, while apneic events introduce greater complexity in ECG temporal patterns.
Table 8 demonstrates that the type of input features, the normalization method, and the presence or absence of postprocessing all exert direct and significant effects on the performance of ECG-based apnea detection systems. Observed patterns indicate that combining diverse features, avoiding Z-score normalization, and applying postprocessing constitutes the most effective approach for improving model accuracy and stability.
In the optimal scenario—utilizing all three features (RRI, RAMP, and EDR) according to the reference method with active postprocessing—the model achieved an F1-score of 87.25%, specificity of 94.44%, and accuracy of 90.62%. These values surpass all other configurations in both mean performance and standard deviation, indicating that combining complementary features provides maximum discriminative power for apnea patterns. Additionally, postprocessing reduces decision noise, improving model stability across experimental runs. The substantial increase in sensitivity (from 75.36% to 84.15%) further highlights the critical role of postprocessing in reducing false negatives and improving detection of apneic events.
In contrast, when individual features were utilized independently without postprocessing, average performance declined significantly and standard deviations increased. RRI alone, inherently dependent on heart rate, performed considerably worse than the three-feature combination, achieving an F1-score of 75.66% with high variability (± 6.24), indicating poor stability across subjects. RAMP exhibited similar behavior; despite containing respiratory-related information, its standalone performance reached only 73.78% F1-score with relatively low sensitivity (71.77%) and high standard deviation (± 10.07), demonstrating substantial variability in detecting apneic events. EDR alone produced moderate accuracy; however, decreased sensitivity (69.85%) and elevated standard deviations confirmed that absence of complementary features reduces stability and increases model fluctuations.
Application of postprocessing improved performance for all individual features—for instance, EDR sensitivity increased from 69.85% to 81.84%—yet average performance remained inferior to the combined features scenario, and elevated standard deviations indicated the inherent limitation of relying on a single feature to address inter-subject variability.
Z-score normalization consistently induced performance degradation. In the RRI–RAMP–EDR combination without postprocessing, F1-score decreased to 75.39% and sensitivity to 71.85%, indicating that standard normalization not only failed to improve performance but disrupted the statistical structure of the features—particularly for nonlinear, derived features such as EDR and RAMP, which are highly dependent on signal dynamics. Increased standard deviations across all Z-score metrics further demonstrate that this normalization approach is unsuitable for derived ECG features.
Even for individual features such as RRI or EDR, Z-score normalization decreased mean accuracy and increased inter-run variability. Sensitivity values, critical in clinical applications, were lower than those observed with the reference method. Although postprocessing improved performance relative to the no-postprocessing condition, results remained significantly inferior to the reference approach, confirming that the primary limitation lies in the normalization method rather than the decision-making stage.
Overall, increased standard deviations directly reflect instability across subjects and model sensitivity to inherent signal variations. The highest standard deviations were observed when Z-score normalization and single-feature inputs were employed, emphasizing the importance of appropriate preprocessing and feature integration.
In conclusion, the results demonstrate:
1.
The optimal performance corresponds to the RRI + RAMP + EDR combination with the reference method and active postprocessing.
2.
Z-score normalization is unsuitable for surface-level and derived features, resulting in decreased accuracy and stability.
3.
Individual features alone cannot provide sufficient stability, as evidenced by their elevated standard deviations and inter-subject variability.
4.
Postprocessing is effective across all scenarios and exerts the greatest impact on sensitivity, which is critical for clinical apnea detection.
5.
Combining multiple complementary features, particularly RRI, RAMP, and EDR, enhances the discriminative capacity for apnea patterns and minimizes model fluctuations.
Table 6
Mean Classification Metrics Over Five Independent Hold-Out Executions for the Initial Nine Models
Model
Postprocessing stage
Accuracy(%)
Specificity(%)
Sensitivity(%)
F1-Score(%)
LSTM
off
83.29 ± 0.56
88.23 ± 3.95
75.28 ± 5.54
77.45 ± 1.11
on
88.82 ± 0.70
92.52 ± 3.52
82.85 ± 4.89
84.98 ± 0.95
GRU
off
82.97 ± 0.74
87.72 ± 2.06
75.32 ± 2.97
77.18 ± 1.13
on
89.48 ± 0.70
92.60 ± 1.73
84.44 ± 3.48
85.96 ± 1.61
BiLSTM
off
82.38 ± 0.70
84.62 ± 2.40
78.77 ± 3.12
77.36 ± 0.89
on
89.04 ± 0.71
89.87 ± 1.38
87.69 ± 2.44
86.00 ± 1.08
BiGRU
off
82.96 ± 0.84
92.21 ± 1.68
68.05 ± 4.51
75.26 ± 2.10
on
87.43 ± 0.97
92.75 ± 4.05
78.84 ± 8.26
82.58 ± 2.51
CNN
off
80.91 ± 0.87
87.14 ± 2.31
70.87 ± 5.54
73.86 ± 2.28
on
86.08 ± 1.89
89.64 ± 4.89
80.35 ± 7.21
81.47 ± 2.51
CNN-LSTM
off
82.52 ± 0.46
83.85 ± 3.05
80.37 ± 4.23
77.84 ± 0.82
on
88.09 ± 0.77
88.29 ± 2.82
87.79 ± 3.07
84.95 ± 0.68
CNN-GRU
off
82.95 ± 0.87
84.04 ± 2.00
81.19 ± 1.65
78.47 ± 0.81
on
88.68 ± 0.46
90.72 ± 2.55
85.38 ± 3.83
85.13 ± 0.77
CNN-BiLSTM
off
82.20 ± 1.04
85.75 ± 3.74
76.48 ± 6.47
76.58 ± 2.13
on
86.68 ± 2.25
87.02 ± 5.96
86.12 ± 7.12
83.16 ± 2.53
CNN-BiGRU
off
82.40 ± 1.23
86.22 ± 4.44
76.23 ± 6.13
76.80 ± 1.84
on
87.56 ± 1.94
89.57 ± 4.51
84.32 ± 6.83
83.78 ± 2.68
Table 7
Hold-Out Classification Results of the Final Model Using RRI_AMP_EDR Inputs Under Two Normalization Schemes
Input Features
Normalization
Postprocessing stage
Accuracy (%)
Sensitivity (%)
Specificity (%)
F1-Score (%)
RRI_Ramp_EDR
Reference Method
off
85.20 ± 0.95
74.89 ± 3.43
91.60 ± 1.48
79.45 ± 1.68
on
89.18 ± 0.94
81.42 ± 3.27
94.00 ± 1.16
85.19 ± 1.57
RRI_Ramp_EDR
Z-score
off
77.60 ± 1.82
82.38 ± 4.40
74.63 ± 5.09
73.80 ± 1.27
on
82.11 ± 2.73
85.83 ± 3.39
79.86 ± 5.55
78.72 ± 2.41
Table 8
Cross-Validation Results (5-Fold) for the Final Selected Model
Input Features
Normalization
Postprocessing stage
Accuracy (%)
Sensitivity (%)
Specificity (%)
F1-Score (%)
RRI_Ramp_EDR
Reference Method
off
86.35 ± 1.24
75.36 ± 4.71
92.82 ± 1.92
80.65 ± 2.53
on
90.62 ± 1.54
84.15 ± 3.80
94.44 ± 2.22
87.25 ± 1.74
RRI
Reference Method
off
81.94 ± 4.24
74.40 ± 6.55
86.27 ± 4.29
75.66 ± 6.24
on
87.55 ± 4.73
80.84 ± 6.62
91.37 ± 4.07
83.08 ± 6.53
Ramp
Reference Method
off
80.99 ± 2.31
71.77 ± 10.07
86.39 ± 4.13
73.78 ± 5.35
on
85.49 ± 3.78
77.31 ± 15.59
90.31 ± 4.52
79.16 ± 9.06
EDR
Reference Method
off
83.18 ± 2.09
69.85 ± 4.76
91.24 ± 1.01
75.86 ± 3.53
on
88.37 ± 2.15
81.84 ± 7.98
92.23 ± 1.39
83.94 ± 4.36
RRI_Ramp_EDR
Z-score
off
82.42 ± 1.92
71.85 ± 7.85
88.24 ± 3.72
75.39 ± 4.14
on
87.57 ± 2.49
83.00 ± 6.44
90.00 ± 3.87
83.57 ± 2.74
RRI
Z-score
off
78.58 ± 4.09
70.58 ± 6.50
83.36 ± 3.62
71.29 ± 6.70
on
83.52 ± 4.31
73.11 ± 8.70
89.86 ± 5.26
76.92 ± 6.76
EDR
Z-score
off
78.61 ± 4.45
64.72 ± 12.07
87.08 ± 2.78
69.22 ± 7.38
on
83.74 ± 3.61
72.73 ± 14.81
90.19 ± 4.28
76.41 ± 7.86
3.2. Feature visualization
The t-SNE algorithm was employed to visualize the original RRI, AMP, and EDR signals, as well as the features extracted by the CNN–Transformer–LSTM model for both training and validation sets (Figs. 10–11). The results indicate that the original signals are widely scattered, making it challenging to distinguish between normal and sleep apnea cases. In contrast, the features extracted by the CNN–Transformer–LSTM model exhibit clear clustering, demonstrating a distinct separation between the two classes in the feature
Click here to download actual image
representation space.
Figure 10: Visualization of raw input signals feature maps using t-SNE visualizing algorithm
Fig. 11
Visualization of extracted feature maps using t-SNE visualizing algorithm
Click here to Correct
3.3. Sample First-Run
For each run, the order of training records was randomized while maintaining reproducibility. For example, in Run 1 for the apnea group, 32 records were assigned to the training set and 8 to the validation set, where the first training record was a05 and the first validation record was a10.
The model was subsequently executed for each of the remaining records in the borderline and control groups, following the identical procedure applied to the apnea group, and the corresponding transition and emission matrices were computed. Each model run was stored in a dedicated directory to ensure reproducibility and facilitate subsequent analyses. This procedure was repeated across all five folds of the cross-validation process.
3.4. Training and Validation Performance
Figures 12 and 13 illustrate the training and validation accuracy and loss curves for both the hold-out and five-fold cross-validation approaches. These results correspond to the second experimental run, providing detailed insight into the model's learning dynamics. The accuracy curves demonstrate progressive improvement across both training and validation datasets, while the loss curves indicate convergence and stability throughout the training process. Comparison of hold-out and cross-validation approaches highlights the model's consistency and generalization capability across different validation strategies.
Fig. 12
The accuracy and loss of training and validation sets for hold-out validation
Click here to Correct
Fig. 13
The accuracy and loss of training and validation sets for five-fold cross validation
Click here to Correct
3.5. Hypnogram comparison
Figure 14 presents two hypnograms from the same subject: one comparing the ground truth with the initial classifier output, and the other comparing the ground truth with predictions refined by a hidden Markov model (HMM). The comparison clearly demonstrates the HMM's effectiveness in smoothing predictions and enhancing temporal consistency. These results confirm that incorporating HMMs into the postprocessing stage improves both the accuracy and coherence of sequential classifications in clinical applications.
Fig. 14
Hypnogram Comparison Before and After HMM Postprocessing
Click here to Correct
3.6. Confusion matrix
Confusion matrices constitute essential tools for evaluating model performance, as they clearly illustrate the distribution of correct and erroneous predictions across classes. In clinical applications such as apnea detection, accurate classification of both positive and negative cases is critically important, as misclassifications can have significant diagnostic implications.
As demonstrated in Figs. 15, which correspond to the second experimental run of both the hold-out approaches, application of the postprocessing stage resulted in substantial reductions in both false-positive and false-negative predictions. This reduction in classification errors directly contributed to improvements in overall accuracy, sensitivity, and model reliability. Such enhancements are crucial for minimizing diagnostic errors and ensuring reliable identification of apneic events.
Fig. 15
The confusion matrix of Hold_out validation
Click here to Correct
4. Discussion
The findings of this study, based on the analysis of 70 ECG records from the PhysioNet Apnea-ECG database, demonstrate that deep learning models can overcome the inherent limitations of traditional methods relying on manual feature extraction—limitations that have been repeatedly identified in prior literature as the primary obstacle to accurate and robust detection of apnea and hypopnea events. One of the fundamental strengths of this research is its precise and controlled data-driven design. Unlike many previous studies that employed fixed or sometimes arbitrary percentage-based splits for constructing training and validation sets, the present study managed class distribution, segment counts, and record allocation with high precision and transparency. This approach not only minimized the possibility of data leakage between different subsets but also enabled the models to learn genuine ECG signal patterns and prevented bias arising from random sampling—an issue reported as one of the common shortcomings in physiological signal-based research.
The use of a rigorous hold-out strategy, whereby the test set was kept completely separate from the training process, model selection, and hyperparameter tuning, provided grounds for independent model evaluation. This aspect has been less frequently observed in the existing literature, limiting the validity of comparisons with previous findings. However, one of the main challenges observed in analyzing the results is the confusion arising from label quality and the overlap between apnea and hypopnea events—a problem that inherently affects the model's decision boundaries and in some cases reduces the accuracy of detecting positive (patient) samples. In such an unstable context, the application of a post-processing stage based on Hidden Markov Models (HMM) plays an important role in improving temporal coherence, reducing sensitivity to noisy labels, and enhancing discriminability between respiratory subtypes.
Analysis of Table 6 revealed that GRU and BiLSTM architectures demonstrated superior performance compared to other models, although identifying patients remained challenging. The reduction in F1-score indicates that models tend to identify healthy samples with greater accuracy, while the drop in detection for the positive class stems from the same labeling complexities and event overlaps. Moreover, comparison of model training times shows that although BiLSTM provided accurate results (with approximately 2 to 3 minutes per epoch), CNN-based architectures with significantly shorter training times (13 to 48 seconds) represent attractive options for real-time applications and systems deployable in clinical environments. Given that this study applied more rigorous standards than previous research, the stability of model performance under such challenging conditions holds greater scientific value.
Comparison with previous work demonstrated that despite this study's focus on precise data-driven design and independent evaluation, model performance under noisy conditions and complex data remains competitive and in some cases more stable. Nevertheless, the overlap between different types of respiratory events and inter-individual variations in physiological patterns increasingly highlight the need for developing multi-stage, hierarchical models or multi-task learning approaches—approaches that can establish clearer distinctions between obstructive apnea, hypopnea, and related transient events.
From a methodological perspective, the use of early stopping played an important role in reducing overfitting and improving model generalizability. The results presented in Table 9 demonstrate that the proposed model achieved acceptable performance in terms of the combination of accuracy and sensitivity, confirming the importance of proper selection of validation strategies (5-fold and hold-out), precise data configurations, and the presence of efficient post-processing.
Despite significant achievements, the present study illuminates multiple avenues for future research. First, replacing the Pan-Tompkins algorithm with more advanced methods such as Hamilton, adaptive monitoring filters, or time-frequency domain techniques could enhance the accuracy of surface feature extraction. Second, extending the temporal window length from 60 seconds to longer intervals (3 to 5 minutes) could provide the model with information related to slower physiological dynamics and increase its sensitivity to more complex events. Third, combining deep learning-based models with hierarchical classification, probabilistic models, or Transformer architectures could improve decision boundaries under noisy data conditions and enhance the model's ability to detect subtypes. Finally, conducting cross-database validation using independent datasets could more precisely determine the model's generalizability when confronted with real clinical conditions—an essential requirement for practical deployment of such systems in medical environments.
Overall, the results of this study demonstrate that the combination of precise data design, use of rigorous validation strategies, employment of HMM as post-processing, and utilization of early stopping provides a powerful and efficient framework for ECG-based sleep apnea detection. However, evolution of feature extraction methods, development of multi-stage classifiers, and validation under real-world conditions can lead to significant improvements in the efficiency and stability of automated diagnostic systems in subsequent steps [[37],[38], [39], [33], [40], [41], [42], [23], [19], [21], [43], [44], [45], [46] ].
Table 9
Comparison and Analysis of Research Findings in Relation to Previous Studies
Study
Feature extracted
Classifier
Evaluation type
Accuracy (%)
Recall (%)
Specificity (%)
(Hung-Yu Chang and el. 2020) [37]
Raw ECG
CNN
Shuffled Segments
87.90
81.10
92.00
(Sharan and el. 2020) [38]
RRI, time and frequency domain
CNN
Shuffled Segments
88.23
82.74
91.62
(Mukherjee and el. 2021) [39]
RRI, RAMP, EDR
Trainable Ensemble using MLP
Shuffled Segments
85.58
84.43
88.26
(Bahrami and el. 2021) [33]
RRI, RAMP
LeNet + LSTM
Shuffled Segments
80.67
75.04
84.13
(Rajabrundha and el. 2022) [40]
RRI
LSTM
Shuffled Segments
85.62
82.71
 
(Bahrami and el. 2022) [41]
RRI, R AMP
ZFNet-BiLSTM
Shuffled Segments
88.13
81.49
92.27
(Fang and el. 2022) [42]
RRI
ResNet-Multiscale
Shuffled Segments
86.00
84.10
87.10
(Cheng et al. 2022) [19]
ECG (18.75–25 Hz Subband)
CNN
Shuffled Segments
88.60
83.80
91.50
(Tyagi and el. 2023) [23]
HRV, EDR
FT-EDBN
Shuffled Segments
89.11
83.89
92.28
(Liu and el. 2023) [21]
Raw ECG
CNN-Transformer
Shuffled Segments
88.20
78.50
94.10
(Kollu Praveen Kumar and el. 2024) [43]
Raw ECG
CNN-LSTM
segment-based evaluation
86.18
   
(D. Padovano and el. 2025) [44]
Custom CNN, HRV’s DM
AlexNet
External validation
74.72
73.99
75.17
(J. Gupta and el. 2025) [45]
RRI, RA
MLP with a time window
Shuffled Segments
86.80
82.40
89.50
(M. Scarpetta and el. 2025) [46]
Raw ECG
CNN
Shuffled Segments
81.00
-
-
Proposed Methode
RRI, AMP, EDR
CNN-Transformer-LSTM
Cross record
90.62
84.15
94.44
5. Conclusion
This study proposes a principled framework for improving automated sleep apnea detection by integrating temporal consistency into deep learning pipelines through Hidden Markov Model (HMM)-based postprocessing. Rather than relying solely on architectural modifications, the findings emphasize the pivotal role of temporal modeling in bridging the persistent gap between high algorithmic accuracy and practical clinical usability. By addressing the issue of fragmented and physiologically implausible predictions, the proposed method shifts the focus from static, frame-level classification toward more coherent and clinically interpretable decision-making processes.
The consistent performance gains observed across a variety of deep learning models underscore the generalizability and flexibility of the proposed postprocessing approach. These enhancements suggest that temporal refinement is not a superficial adjustment but rather a core element necessary for building trustworthy, real-world decision-support systems in sleep medicine. The method's compatibility with different network architectures reinforces its potential for widespread adoption across diverse clinical and technical environments.
From a broader perspective, the paradigm of coupling sequence-level temporal smoothing with data-driven models opens new research directions in biomedical signal analysis, especially in domains where temporal fidelity is as crucial as classification accuracy. By establishing a robust foundation for temporal consistency, this work contributes to the development of more reliable, context-aware diagnostic systems capable of functioning in real-time clinical workflows.
Future efforts will aim to further enhance the model’s interpretability and generalization through the incorporation of attention mechanisms and ensemble learning strategies. Additionally, advanced preprocessing techniques—including data augmentation and targeted feature selection—will be explored to mitigate challenges related to dataset size limitations and inter-subject variability. Collectively, these advancements are expected to foster the development of intelligent, adaptive diagnostic platforms with high translational value in both sleep medicine and broader biomedical applications.
Acknowledgements
The authors appreciate the publicly accessible Apnea-ECG database, which enabled the development and evaluation of the proposed methodology.
A
Author Contribution
H.GH. and A.K conceived and supervised the study, defined the research objectives, and guided the methodological framework. S.Q. designed key components of the CNN-Transformer-LSTM architecture, implemented ECG feature extraction modules (RRI, EDR, RAMP), and performed statistical analyses. Also, S.Q. handled data preprocessing, implemented the record-wise partitioning strategy, and conducted all training and evaluation experiments. H.GH. contributed to the development of signal-integration mechanisms and carried out comparative analyses with baseline models. H.GH. and S.Q. interpreted the results and contributed to scientific discussions. All authors participated in manuscript preparation, critically reviewed the content, and approved the final version.
A
Data Availability
The dataset used in this study is publicly available on PhysioNet and can be accessed at: https://physionet.org/content/apnea-ecg/1.0.0/
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
A
Funding
Consent to publish
All authors approve the manuscript and give their consent for submission and publication as open access.
References
1.
Abbasi, A. et al. A comprehensive review of obstructive sleep apnea,., Brazilian Association of Sleep and Latin American Federation of Sleep Societies. (2021). 10.5935/1984-0063.20200056
2.
Salman, L. A., Shulman, R. & Cohen, J. B. Obstructive Sleep Apnea, Hypertension, and Cardiovascular Risk: Epidemiology, Pathophysiology, and Management, Feb. 01, Springer. (2020). 10.1007/s11886-020-1257-y
3.
Goldstein, C. et al. Polysomnography validation of SANSA to detect obstructive sleep apnea. Front. Neurol. 16 10.3389/fneur.2025.1592690 (2025).
4.
Ozkan, H. et al. A portable wearable tele-ECG monitoring system. IEEE Trans. Instrum. Meas. 69 (1), 173–182. 10.1109/TIM.2019.2895484 (Jan. 2020).
5.
Wang, L., Lin, Y. & Wang, J. A RR interval based automated apnea detection approach using residual network. Comput. Methods Programs Biomed. 176, 93–104. 10.1016/j.cmpb.2019.05.002 (Jul. 2019).
6.
Cartwright, R. Obstructive Sleep Apnea: A Sleep Disorder With Major Effects on Health Disease-a-Month ® Information for Readers, [Online]. (2001). Available: www.mosby.com/disamonth
7.
Varon, C., Caicedo, A., Testelmans, D., Buyse, B. & Van Huffel, S. A Novel Algorithm for the Automatic Detection of Sleep Apnea from Single-Lead ECG, IEEE Trans Biomed Eng, vol. 62, no. 9, pp. 2269–2278, Sep. (2015). 10.1109/TBME.2015.2422378
8.
Babaeizadeh, S., White, D. P., Pittman, S. D. & Zhou, S. H. Automatic detection and quantification of sleep apnea using heart rate variability, in Journal of Electrocardiology, Nov. pp. 535–541. (2010). 10.1016/j.jelectrocard.2010.07.003
9.
Pinho, A., Pombo, N., Silva, B. M. C., Bousson, K. & Garcia, N. Towards an accurate sleep apnea detection based on ECG signal: The quintessential of a wise feature selection, Applied Soft Computing Journal, vol. 83, Oct. (2019). 10.1016/j.asoc.2019.105568
10.
Sharma, M., Agarwal, S. & Acharya, U. R. Application of an optimal class of antisymmetric wavelet filter banks for obstructive sleep apnea diagnosis using ECG signals. Comput. Biol. Med. 100, 100–113. 10.1016/j.compbiomed.2018.06.011 (Sep. 2018).
11.
Hassan, A. R. & Haque, M. A. An expert system for automated identification of obstructive sleep apnea from single-lead ECG using random under sampling boosting. Neurocomputing 235, 122–130. 10.1016/j.neucom.2016.12.062 (Apr. 2017).
12.
Feng, K., Qin, H., Wu, S., Pan, W. & Liu, G. A Sleep Apnea Detection Method Based on Unsupervised Feature Learning and Single-Lead Electrocardiogram. IEEE Trans. Instrum. Meas. 70 10.1109/TIM.2020.3017246 (2021).
13.
Viswabhargav, C. S. S. S., Tripathy, R. K. & Acharya, U. R. Automated detection of sleep apnea using sparse residual entropy features with various dictionaries extracted from heart rate and EDR signals. Comput. Biol. Med. 108, 20–30. 10.1016/j.compbiomed.2019.03.016 (May 2019).
14.
Li, K., Pan, W., Li, Y., Jiang, Q. & Liu, G. A method to detect sleep apnea based on deep neural network and hidden Markov model using single-lead ECG signal, Neurocomputing, vol. 294, pp. 94–101, Jun. (2018). 10.1016/j.neucom.2018.03.011
15.
Cheng, M., Sori, W. J., Jiang, F., Khan, A. & Liu, S. Recurrent Neural Network Based Classification of ECG Signal Features for Obstruction of Sleep Apnea Detection, in Proceedings – 2017 IEEE International Conference on Computational Science and Engineering and IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017, Institute of Electrical and Electronics Engineers Inc., Aug. pp. 199–202. (2017). 10.1109/CSE-EUC.2017.220
16.
Urtnasan, E., Park, J. U. & Lee, K. J. Multiclass classification of obstructive sleep apnea/hypopnea based on a convolutional neural network from a single-lead electrocardiogram. Physiol. Meas. 39 (6). 10.1088/1361-6579/aac7b7 (Jun. 2018).
17.
Zhou, H., Chen, S., Xu, Z. & Zheng, W. ) A spatio-temporal learning-based model for sleep apnea detection using single-lead ECG signals.
18.
Yang, Q., Zou, L., Wei, K. & Liu, G. Obstructive sleep apnea detection from single-lead electrocardiogram signals using one-dimensional squeeze-and-excitation residual group network. Comput. Biol. Med. 140 10.1016/j.compbiomed.2021.105124 (Jan. 2022).
19.
Yeh, C. Y., Chang, H. Y., Hu, J. Y. & Lin, C. C. Contribution of Different Subbands of ECG in Sleep Apnea Detection Evaluated Using Filter Bank Decomposition and a Convolutional Neural Network. Sensors 22 (2). 10.3390/s22020510 (Jan. 2022).
20.
Zhou, Y., He, Y. & Kang, K. OSA-CCNN: Obstructive Sleep Apnea Detection Based on a Composite Deep Convolution Neural Network Model using Single-Lead ECG signal, in Proceedings – 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Institute of Electrical and Electronics Engineers Inc., pp. 1840–1845. (2022). 10.1109/BIBM55620.2022.9995675
21.
Liu, H., Cui, S., Zhao, X. & Cong, F. Detection of obstructive sleep apnea from single-channel ECG signals using a CNN-transformer architecture. Biomed. Signal. Process. Control. 82 10.1016/j.bspc.2023.104581 (Apr. 2023).
22.
Qin, H. & Liu, G. A dual-model deep learning method for sleep apnea detection based on representation learning and temporal dependence. Neurocomputing 473, 24–36. 10.1016/j.neucom.2021.12.001 (Feb. 2022).
23.
Kumar Tyagi, P. & Agrawal, D. Automatic detection of sleep apnea from single-lead ECG signal using enhanced-deep belief network model. Biomed. Signal. Process. Control. 80 10.1016/j.bspc.2022.104401 (Feb. 2023).
24.
Chen, Y. et al. RAFNet: Restricted attention fusion network for sleep apnea detection. Neural Netw. 162, 571–580. 10.1016/j.neunet.2023.03.019 (May 2023).
25.
Fan, X., Chen, X., Ma, W. & Gao, W. BAFNet: Bottleneck Attention Based Fusion Network for Sleep Apnea Detection. IEEE J. Biomed. Health Inf. 28 (5), 2473–2484. 10.1109/JBHI.2023.3278657 (May 2024).
26.
Rd, I. N., Penzell, T. & Moody2’, G. B. R. G. Mark2, and A. L. Goldbergec, COMPUTERS CA I(lLOGY 00 Uolume 27 The Apnea-ECG Database.
27.
Zhaokun, Z. & Jinbao, L. Multi-Feature Information Fusion LSTM-RNN Detection for OSA. J. Comput. Res. Dev. 57 (12), 2547–2555. 10.7544/issn1000-1239.2020.20190583 (2020).
28.
Alfaouri, M. & Daqrouq, K. ECG Signal Denoising By Wavelet Transform Thresholding. Am. J. Appl. Sci. 5 (3), 276–281 (2008).
29.
Tompkins, W. J. & Real-Time, A. QRS Detection Algorithm, (1985).
30.
Han, D., Rao, Y. N., Principe, J. C., Computational, K. G. & Lab, N. Real-Time PCA(Principa1 Component Analysis) implementation on DSP.
31.
Faust, O., Barika, R., Shenfield, A., Ciaccio, E. J. & Acharya, U. R. Accurate detection of sleep apnea with long short-term memory network based on RR interval signals. Knowl. Based Syst. 212, 106591 (2021).
32.
Qin, H. & Liu, G. A dual-model deep learning method for sleep apnea detection based on representation learning and temporal dependence. Neurocomputing 473, 24–36. 10.1016/j.neucom.2021.12.001 (Feb. 2022).
33.
Bahrami, M. Detection of Sleep Apnea from Single-Lead ECG: Comparison of Deep Learning Algorithms.
34.
Chen, L. et al. Review of image classification algorithms based on convolutional neural networks, Nov. 01, MDPI. (2021). 10.3390/rs13224712
35.
Pham, D. T. & Mouček, R. Efficient sleep apnea detection using single-lead ECG: A CNN-Transformer-LSTM approach. Comput. Biol. Med. 196 10.1016/j.compbiomed.2025.110655 (Sep. 2025).
36.
Mcshane, B. B. Machine Learning Methods with Time Series Dependence, [Online]. (2010). Available: http://repository.upenn.edu/edissertationshttp://repository.upenn.edu/edissertations/122
37.
Chang, H. Y., Yeh, C. Y., Lee, C. T. & Lin, C. C. A sleep apnea detection system based on a one-dimensional deep convolution neural network model using single-lead electrocardiogram, Sensors (Switzerland), vol. 20, no. 15, pp. 1–15, Aug. (2020). 10.3390/s20154157
38.
Sharan, R. V., Berkovsky, S., Xiong, H. & Coiera, E. ECG-derived heart rate variability interpolation and 1-D convolutional neural networks for detecting sleep apnea, in 42nd annual international conference of the IEEE engineering in medicine & biology society (EMBC), 2020, pp. 637–640., 2020, pp. 637–640. (2020).
39.
Mukherjee, D., Dhar, K., Schwenker, F. & Sarkar, R. Ensemble of deep learning models for sleep apnea detection: An experimental study. Sensors 21 (16). 10.3390/s21165425 (Aug. 2021).
40.
Rajabrundha, A., Lakshmisangeetha, A. & Balajiganesh, A. Analysis of Sleep apnea Considering Electrocardiogram Data Using Deep learning Algorithms, in Journal of Physics: Conference Series, Institute of Physics, (2022). 10.1088/1742-6596/2318/1/012009
41.
Bahrami, M. & Forouzanfar, M. Sleep Apnea Detection From Single-Lead ECG: A Comprehensive Analysis of Machine Learning and Deep Learning Algorithms. IEEE Trans. Instrum. Meas. 71, 1–11. 10.1109/TIM.2022.3151947 (2022).
42.
Fang, H., Lu, C., Hong, F., Jiang, W. & Wang, T. Sleep Apnea Detection Based on Multi-Scale Residual Network. Life 12 (1). 10.3390/life12010119 (Jan. 2022).
43.
Kumar, K. P., Vijay, K., Harshitha, G. B., Kumar, V. H. & Abhishek, P. ECG-based Sleep Apnea Detection using Convolutional and Long Short-Term Memory Networks, in 5th International Conference for Emerging Technology, INCET 2024, Institute of Electrical and Electronics Engineers Inc., 2024., Institute of Electrical and Electronics Engineers Inc., 2024. (2024). 10.1109/INCET61516.2024.10593282
44.
Padovano, D., Martinez-Rodrigo, A., Pastor, J. M., Rieta, J. J. & Alcaraz, R. Deep Learning and Recurrence Information Analysis for the Automatic Detection of Obstructive Sleep Apnea. Appl. Sci. (Switzerland). 15 (1). 10.3390/app15010433 (Jan. 2025).
45.
Gupta, J. & Seeja, K. R. An Explainable AI Approach Towards Automatic Sleep Apnea Detection Based on ECG Signal, in Procedia Computer Science, Elsevier B.V., pp. 937–946. (2025). 10.1016/j.procs.2025.04.332
46.
Scarpetta, M., Ragolia, M. A., Pietro Pau, D., Andria, G. & Giaquinto, N. A Tiny Deep Learning Model for Sleep Apnea Detection Based on ECG Signals, in IEEE International Symposium on Medical Measurements and Applications, MeMeA, Institute of Electrical and Electronics Engineers Inc., (2025). 10.1109/MeMeA65319.2025.11068052
Click here to Correct
Total words in MS: 7587
Total words in Title: 14
Total words in Abstract: 297
Total Keyword count: 6
Total Images in MS: 15
Total Tables in MS: 9
Total Reference count: 46