ECG-Based Automated Detection of Sleep Apnea Using Deep Neural Networks and Hidden Markov Models

Bushehr, Iran.

Email: ghimatgar@pgu.ac.ir

Sara Qnatyan^a,b, Hojat Ghimatgar^a,b, Ahmad Keshavarz^a,b

^a Computational Neuroscience Laboratory, ICT Research Institute, Faculty of Intelligent Systems Engineering and Data Science, Persian Gulf university, 7516913817 Bushehr, Iran

^b Department of Electrical Engineering, Faculty of Intelligent Systems Engineering and Data Science, Persian Gulf University, Bushehr 75169-138, Iran

* Corresponding author:

HOJAT GHIMATGAR

Department of Electrical Engineering,

Faculty of Intelligent Systems Engineering and Data Science, Persian Gulf University

Tel: (+ 98) 77 31222150

Fax: (+ 98) 77 33440376

Abstract

Sleep disorders constitute a substantial global health burden, with more than eighty distinct conditions currently recognized. Among these, obstructive sleep apnea (OSA) represents the most prevalent sleep-related respiratory disorder, characterized by recurrent episodes of complete or partial upper airway obstruction during sleep. Although often asymptomatic, OSA exerts profound detrimental effects on cardiovascular, neurological, and pulmonary systems, thereby necessitating timely and accurate diagnosis. Conventional diagnostic approaches rely on polysomnography (PSG), which, despite its high diagnostic accuracy, remains costly, time-intensive, and dependent on specialized equipment and expert clinical supervision. These limitations in accessibility and operational complexity have prompted the development of more practical alternatives. Electrocardiogram (ECG)-based approaches have consequently attracted considerable attention owing to their capacity for continuous cardiac monitoring and sensitivity to subtle physiological alterations associated with sleep-disordered breathing. However, the inherent nonstationarity of ECG signals and substantial inter-subject variability continue to constrain model generalizability, thereby underscoring the critical need for robust and computationally efficient deep learning architectures. In this study, we propose a deep learning framework that integrates multiple surface-level ECG-derived features—including R-R intervals (RRI), ECG-derived respiration (EDR), and respiratory amplitude (RAMP)—extracted from the PhysioNet Apnea-ECG database. A record-wise data partitioning strategy was implemented to rigorously prevent data leakage across training, validation, and test sets. A hidden Markov model (HMM) was further incorporated as a post-processing module to refine apnea episode detection. Across five repeated hold-out validation experiments with varying training and validation partitions, the CNN-Transformer-LSTM architecture achieved an accuracy of 89.16 ± 0.94%, sensitivity of 81.42 ± 3.27%, and specificity of 94.00 ± 1.16%. Five-fold cross-validation yielded enhanced performance with accuracy of 90.62 ± 1.54%, sensitivity of 84.15 ± 3.80%, and specificity of 94.44 ± 2.22%. Integration of the HMM module further improved classification performance by approximately 5.00–6.00%, demonstrating the efficacy of the proposed framework for reliable and efficient OSA screening in both clinical and home-based monitoring applications.

Keywords:

Obstructive Sleep Apnea

ECG-Derived Respiration

Hybrid Deep Neural Networks

CNN-Transformer-LSTM

Hidden Markov Model

Physiological Signal Processing

1. Introduction

Sleep is essential for maintaining physiological balance, and respiratory disorders such as apnea and hypopnea can severely disrupt it. Obstructive sleep apnea (OSA), the most common sleep-related breathing disorder, is characterized by repeated obstruction of the upper airway and reduction or cessation of airflow during sleep. If left undiagnosed and untreated, it increases the risk of cardiovascular diseases, stroke, cognitive impairments, and chronic fatigue [1], [2].

The apnea index indicates the number of apnea events per hour of sleep and is typically obtained through polysomnography (PSG). In this method, various physiological signals are recorded throughout a night at a sleep center and subsequently interpreted by a specialist or operator. Despite its appropriate accuracy, this process is time-consuming and dependent on individual judgment, and may be prone to subjective errors. Additionally, the unfamiliar conditions of the environment may create a "first-night effect" and alter the patient's sleep quality [3] With the advancement of wearable sensors and IoT technologies, the use of single-lead ECG has emerged as a promising option for OSA detection due to its non-invasive nature, simplicity, and portability [4].

However, individual differences in physiological signals, the impact of concurrent cardiac diseases, and the need to maintain a balance between model complexity and computational efficiency are significant challenges in developing ECG-based diagnostic systems that require special attention in the design and implementation of these systems [5].

In the initial stages of research on sleep apnea detection, traditional machine learning methods were predominantly employed based on manually extracted features from ECG signals. Algorithms such as Support Vector Machine (SVM), Decision Tree, and Artificial Neural Networks (ANN) demonstrated acceptable performance in classifying Obstructive Sleep Apnea (OSA) events. For instance, Cartwright et al. [6], Varon et al. [7], and Babaeizadeh et al. [8] reported accuracies ranging from 84.00% to 85.00%. Furthermore, Pinho et al. [9], combining HRV and EDR features using ANN and SVM, achieved an accuracy of 82.12%, while Sharma et al. [10], utilizing the BAWFB wavelet filter and SVM algorithm, attained an accuracy of 90.11%. Hassan et al. [11] also achieved an accuracy of 88.88% and specificity of 91.49% using TQWT wavelet transform and the RUSBoost algorithm. Nevertheless, the reliance on manually extracted features and high sensitivity to noise resulted in reduced generalizability and robustness of these methods on real-world data.

To address these limitations, several studies shifted toward unsupervised learning methods and hybrid models to achieve more automated and efficient feature extraction. Feng et al. [12] introduced the FSSAE model combined with HMM and the MetaCost algorithm, achieving an accuracy of 85.10%. Viswabhargav et al. [13], using a combination of EDR and SRE features with fuzzy clustering and SVM, reported an accuracy of approximately 78.00%. Li et al. [14], employing sparse autoencoder and HMM model, achieved 85.00% accuracy at the segment level and 100.00% at the record level. Despite improvements in automatic feature extraction, these models still face limitations in generalizability and clinical application due to computational complexity and high sensitivity to parameter tuning.

In recent years, the remarkable advancement of Deep Learning has brought about a fundamental transformation in obstructive sleep apnea detection. Networks such as CNN, RNN, and hybrid CNN-RNN architectures, with their capability to automatically extract spatial and temporal features from raw ECG signals, have achieved considerable accuracies. Cheng et al. [15] attained 97.80% accuracy using RNN, and Urtasun et al. [16] reported 90.80% accuracy employing CNN on 10-second epochs. Chen et al. [17] developed an end-to-end spatio-temporal model based on CNN and BiGRU combination, achieving 91.22% accuracy at the segment level and 97.10% at the record level. Yang et al. [18] using SE networks and Cheng et al. [19] through multi-band decomposition of ECG signals improved model performance. Zhou et al. [20] also converted ECG signals into spectral images and employed modified AlexNet and ResNet architectures to identify visual patterns associated with apnea events.

Despite these advances, data imbalance and difficulty in generalization under clinical conditions remain among the main challenges. Liu et al. [21], despite achieving an overall accuracy of 88.20%, reported a recall of only 78.50%, indicating the difficulty in reliably detecting apnea events in real-world conditions. Qin et al. [22] using a hybrid CNN-BiGRU model achieved 91.10% accuracy, and Tyagi et al. [23] with the FT-EDBN network attained 89.11% accuracy at the segment level. Subsequently, attention-based networks such as RAFNet [24] and BAFNet [25], by considering temporal dependencies between adjacent segments, demonstrated superior performance in contextual learning. RAFNet achieved 91.40% accuracy and BAFNet obtained 91.30% accuracy with perfect accuracy at the record level.

Overall, deep learning-based approaches exhibit superior capability compared to traditional and hybrid methods in modeling temporal dependencies and learning complex patterns from physiological signals. However, challenges such as data imbalance, model interpretability, and clinical validation remain active and open research avenues in this field.

The various factors causing sleep apnea and individual differences in physiological signals, coupled with the complexity of existing methods that require powerful computational resources to achieve high accuracy, motivated us to develop a practical and optimized approach. Our objective is to find a solution that, using relatively modest resources, can deliver acceptable and reliable results in sleep apnea detection for implementation on an individual basis.

The structure of this article is as follows: Section 2 presents the proposed method, including data preprocessing, feature extraction, model architecture, post-processing strategy using Hidden Markov Model, and trainig; Section 3 is dedicated to results; Section 4 includes discussion, limitations, and suggestions for future work; and finally, Section 5 provides conclusions and a summary of key achievements.

To provide a comprehensive overview of the methodology employed in this study, a general workflow is illustrated in the form of a flowchart, delineating the sequential steps from the initial phases to the final analysis. This flowchart encompasses the key stages of ECG signal preprocessing, relevant feature extraction, the design and training of deep learning models, and, ultimately, the evaluation of model performance. Each phase has been meticulously structured to ensure logical coherence and methodological consistency throughout the research process. The overall workflow of the proposed study is depicted in Fig. 1.

Fig. 1

Workflow overview of the OSA detection model

2. Materials and methods

2.1. Dataset Description

This study utilizes the Apnea-ECG database, a publicly available dataset comprising 70 nocturnal recordings of high-quality, single-lead electrocardiogram signals. These recordings were collected from two independent clinical studies and include patients diagnosed with sleep apnea as well as healthy control subjects. Each participant underwent monitoring of biosignals under supervised conditions for one or two consecutive nights.

The raw ECG signals, initially recorded at a sampling frequency of 200 Hz, were downsampled to 100 Hz during preprocessing to maintain consistency with the original experimental sampling rate. Following quality assessment, one or more suitable recordings were selected for each participant. The final dataset consists of 70 recordings: 27 recordings from 9 participants in the first study and 43 recordings from 23 participants in the second study.

Respiratory events in the Apnea-ECG database are annotated using a binary labeling system at one-minute intervals. During expert annotation and subsequent reviews, no distinction was made between apneas and hypopneas, with both marked as abnormal respiratory events (label: A), while intervals of normal breathing were recorded as normal (label: N). Additionally, the cumulative duration of abnormal breathing periods was calculated for each participant to support stratified analyses.

To evaluate model performance across different levels of respiratory disturbance severity, the dataset was divided into three classes based on the total duration of abnormal breathing in each recording:

Class C (control group): less than 5 minutes of abnormal breathing.

Class B (borderline group): 10 to 96 minutes of abnormal breathing.

Class A (apnea group): more than 100 minutes of abnormal breathing.

For training and evaluation, the dataset was split into two equal subsets. The training set comprises 35 labeled recordings from participants with identifiers a01 through c10, while the test set includes 35 independent recordings labeled x01 through x35. The assignment of test recordings to their respective groups is presented in Table 1. This stratified and structured partitioning provides a robust framework for developing and validating machine learning models, ensuring balanced representation of physiological diversity and apnea severity across both training and testing phases. Key characteristics of the dataset are summarized in Table 2 [26].

Table 1

Assignment of Test Set Recordings to Apnea, Borderline, and Control Groups
Group	Identifiers
A	X01, X02, X05, X07, X08, X09, X13, X14, X15, X19, X20, X21, X23, X25, X26, X27, X28, X30, X31, X32
B	X03, X10, X11, X12, X16
C	X04, X06, X17, X18, X22, X24, X29, X33, X34, X35

Table 2

Descriptive Statistics of the Sleep Apnea-ECG Dataset
Dataset Characteristics	Training Data	Testing Data	Overall Mean
Patients	65.71%	65.71%	65.71%
Healthy controls	34.29%	34.29%	34.29%
Male participants	85.71%	77.14%	81.43%
Female participants	14.29%	22.86%	18.57%
Mean participants age	46.00 years	44.00 years	45.00 years
Mean body mass index	28.00	28.20	28.10
Mean recording duration	489.29 min	494.37 min	491.83 min
Mean recording hours	8.15 hrs	8.24 hrs	8.20 hrs

Additional considerations regarding the dataset are as follows:

All apneas in this dataset are either OSA or mixed.

Pure central apneas, including Cheyne-Stokes respiration, are not present.

Hypopneas are treated equivalently to apneas; specifically, they are defined as a ≥ 50% reduction in airflow accompanied by a ≥ 4% decrease in oxygen saturation and subsequent compensatory breaths.

In summary, the majority of recordings correspond to OSA, although some may contain mixed apneas, and no recordings of pure central sleep apnea (CSA) are present.

2.2. Data Preprocessing

Initially, the ECG signals were segmented into one-minute intervals to enable more precise signal analysis and improve the performance of learning models. At this stage, 17,045 one-minute segments were extracted from the training data and 17,268 segments from the test data. Following the methodology proposed by Zhu Zhaokun et al. [27], the preprocessing procedure was meticulously implemented to eliminate interfering noise and facilitate the extraction of robust features.

Given the non-stationary nature of ECG signals, the use of simple linear filters for noise removal exhibits limited efficacy, as these filters cannot adapt to the gradual variations of the signal over time. One of the most significant noise components is baseline wander, which occurs in the frequency range of 0 to 1.5 Hz and is primarily caused by respiratory movements, electrode displacement, and changes in body position. The presence of this drift may adversely affect the accurate detection of critical points such as the R-peak, as well as the morphology of the T-wave.

To address this issue, the discrete wavelet transform (DWT) was employed in this study [28]. This transform enables time-frequency domain analysis of the signal and is highly effective in separating the slow and fast components of the signal. In the preprocessing stage, the signals were decomposed using the Daubechies wavelet of order 6 (db6) up to the sixth level, and the approximation coefficients corresponding to low frequencies were removed. Subsequently, through inverse wavelet transform reconstruction, a filtered version free from baseline wander was obtained, in which the QRS complex structure and other electrophysiological cardiac components were well preserved.

Following signal filtering, three primary features were extracted, namely:

R-R interval (RRI): an indicator of heart rate variability and a reflection of autonomic nervous system activity;

R-wave amplitude (RAMP): associated with ventricular contraction strength and the physiological state of cardiac activity;

ECG-derived respiration (EDR): representing respiratory pattern and depth, which is important for detecting apnea patterns.

For RRI extraction, the robust Pan-Tompkins algorithm [29] was utilized to identify R-peaks, and abnormal intervals with heart rates below 30 or above 180 beats per minute were excluded. The R-wave amplitude was determined by computing the signal maximum within a 25-sample window around each peak. Additionally, the EDR signal was reconstructed from the analysis of dominant QRS complex components and extraction of the principal component using PCA [30].

One of the challenges was the difference in sampling rates among features. RRI and RAMP are event-driven and naturally have much lower sampling rates compared to EDR. To integrate the data and create a coherent input for the model, all three signals were resampled to 4 Hz. To this end, RRI and RAMP were upsampled using quadratic spline interpolation, and EDR was aligned using rational downsampling. This process provided a suitable foundation for use in deep learning models without compromising the dynamic information of the signals.

In the normalization phase, Z-Score-based standardization was initially examined. Subsequently, according to the reference article methodology, the mean of each signal was removed and feature-wise scaling was applied. Due to the inherent amplitude differences among features, only RAMP and EDR underwent amplification scaling to prevent saturation of activation functions in the network and establish balance in feature magnitudes. Table 3 provides an overview of the PhysioNet Apnea-ECG dataset following preprocessing. Figure 2 illustrates the preprocessing pipeline for ECG signals.

Table 3

Description of the PhysioNet Apnea-ECG Dataset After Preprocessing
Dataset	Sleep Apnea	Normal	Total
Train set	6129	9832	15961
Test set	6100	9838	15938
Total	12229	19670	31899

Fig. 2

Preprocessing Pipeline Framework

2.3. Utilized Models

Following the preprocessing stage and extraction of surface-level features, each physiological feature was represented as a 240×1 vector, indicating the variations of that feature over 60 seconds of the ECG signal with a sampling rate of 100 Hz. The three extracted features were concatenated column-wise to form a 240×3 matrix, which, while preserving temporal synchronization, enabled the modeling of interactions between cardiac and respiratory dynamics. This representation provided a standardized input structure for all deep learning models.

Given the sequential nature of ECG signals, recurrent neural networks (RNNs) were initially employed to model temporal dynamics. Long Short-Term Memory (LSTM) architectures [31] were utilized to capture long-term temporal dependencies through their gating mechanisms and memory cells, demonstrating robust performance in detecting apnea-related events. However, the prolonged training time and high computational cost associated with LSTMs constrained their practical applicability. To enhance computational efficiency, alternative architectures including Gated Recurrent Units (GRU) [17] Bidirectional GRU (BiGRU) [32], and Bidirectional LSTM (BiLSTM) [33] were adopted. Several key optimized hyperparameters for these networks are summarized in Table 4, and their architectural schematics are illustrated in Figs. 3–6.

Fig. 3

The architecture of the LSTM

Fig. 4

The architecture of the GRU

Fig. 5

The architecture of the BiGRU

Fig. 6

The architecture of the BiLSTM

Concurrently, Convolutional Neural Networks (CNNs) were employed to extract local features and repetitive morphological patterns from the ECG signals, with the architectural schematic illustrated in Fig. 7[34] Building upon these feature extraction capabilities, four hybrid architectures—CNN–BiLSTM, CNN–GRU, CNN–LSTM, and CNN–BiGRU—were developed to further enhance temporal modeling performance. The key optimized hyperparameters of these hybrid models are summarized in Table 5.

Fig. 7

The architecture of the CNNs

To achieve enhanced computational efficiency and accelerated training, a hybrid CNN–Transformer–LSTM architecture was designed and implemented, inspired by the framework presented in [35]. This architecture integrates the CNN's capability for extracting local features, the Transformer's ability to model long-range dependencies through multi-head attention mechanisms—as illustrated schematically in Fig. 8—and the LSTM's capacity for capturing temporal dynamics within a unified framework. The convolutional component comprises three one-dimensional convolutional layers with 64, 128, and 128 filters, respectively, followed by max-pooling and dropout layers. To preserve temporal information, positional encoding with a dimensionality of 128 was applied. The Transformer encoder incorporates a two-head attention mechanism, a feedforward network, layer normalization, and residual connections. Subsequently, an LSTM layer with 128 units and a dropout layer was employed, followed by fully connected layers with sigmoid activation for binary classification. Model training was performed using the Adam optimizer with binary cross-entropy as the loss function, and performance evaluation was conducted using accuracy as the primary metric.

Fig. 8

The architecture of the Transformer Encoder

Table 4

Hyperparameter Settings of the LSTM, GRU, BiLSTM, and BiGRU Networks Used in This Study.
Parameter	Model
Parameter	LSTM	GRU	BiLSTM	BiGRU
Number of LSTM Layers	3	-	-	-
Number of GRU Layers	-	3	-	-
Number of BiLSTM Layers	-	-	2	-
Number of BiGRU Layers	-	-	-	3
Units per Layer	384, 384, 384	64, 32, 16	64, 64	256, 128, 64
Dropout Rate	0.1, 0.2, 0.3	-	0.5	0.4, 0.3, 0.2
Dense Layers	128, 64, 32	32	32	128, 64 (with L2 regularization)
Batch Normalization	-	-	-	Used (after BiGRU layers)

Table 5

Architectural Hyperparameters of CNN-Based Deep Learning Models.
Parameter	Model
Parameter	CNN	CNN_LSTM	CNN_GRU	CNN_BiLSTM	CNN_BiGRU
Number of Conv 1D Layers	3	1	2	2	2
Filters per conv 1D Layer	32, 64, 128	32	64, 128	64, 128	64, 128
Kernel Size	3	3	3	3	3
Number of LSTM Layers	-	1	-	-	-
Number of GRU Layers	-	-	2	-	-
Number of BiLSTM Layers	-	-	-	2	-
Number of BiGRU Layers	-	-	-	-	2
Units per Layer	-	64	128, 64	128, 64	128, 64
Max pooling	2	2	2	2	2
Dropout Rate	0.3, 0.5	-	0.5	0.5	0.5
Dense Layers	64	1	64	64	64

2.4. Post-Processing with Hidden Markov Model (HMM)

In this section, we introduce the Hidden Markov Model (HMM) as a post-processing technique to refine sleep apnea classification results. As illustrated in Fig. 9, each one-minute ECG segment is associated with a hidden state (Normal or Apnea), with observations represented as feature vectors. We employ an HMM where emission probabilities remain consistent across all records, while transition matrices are record-specific. Maximum Likelihood Estimation (MLE) is used to estimate model parameters, integrating short-term classifier outputs with long-term transition dynamics to improve detection accuracy [36].

An HMM is formally specified by the following elements:

The hidden Markov state sequence over time

$\:t=1,\:\dots\:,\:T$

$\:{Y}_{1:T}=\left\{{Y}_{1},{Y}_{2},...,{Y}_{T}\right\}$

The corresponding sequence of observed variables:

$\:{X}_{1:T}=\left\{{X}_{1},{X}_{2},...,{X}_{T}\right\}$

A discrete state space

$\:S=\left\{\text{0,1},...,k-1\right\}$

, where each hidden state

$\:{Y}_{t}\in\:S$

Each observation

$\:{X}_{t}$

is modeled as a continuous-valued feature vector.

The model parameters include:

Initial state distribution:

$\:\pi\:=\left({\pi\:}_{0,}{\pi\:}_{1,...}{\pi\:}_{k-1}\right)\:\:\:\:\:\:\:\:\:\:where\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:{\pi\:}_{i}=p\left({Y}_{1}=i\right)$

State transition probability matrix:

$\:A={\left[{a}_{i,j}\right]}_{k\times\:k\:}\:\:\:\:\:where\:\:\:\:\:\:{a}_{i,j}=p\left({Y}_{t+1}=j|{Y}_{t}=i\right)$

Emission distribution (observation likelihood given the hidden state):

$\:{\mu\:}_{i}\left(x\right)=p\left({X}_{t}=x|{Y}_{t}=i\right)$

Marginal distribution of hidden states:

$\:p=\left({p}_{0,}{p}_{1,...}{p}_{k-1}\right)\:\:\:\:\:\:\:\:where\:\:\:\:\:\:\:\:\:\:\:\:\:\:{p}_{i}=p\left({Y}_{t}=i\right)$

Posterior probability of hidden states given observations:

$\:{f}_{i}\left(x\right)=p\left({Y}_{t}=i|{X}_{t}=x\right)$

These parameters satisfy the following normalization constraints:

$\:\sum\:_{i=0}^{k-1}{\pi\:}_{i}=1,\:\sum\:_{i=0}^{k-1}{p}_{i}=1,\:\sum\:_{i=0}^{k-1}{a}_{j,i}=1,\:\forall\:\:j\:ϵS$

Applying Bayes’ theorem, the conditional probability of an observation given a hidden state can be written as:

$\:p\left({X}_{t}=x|{Y}_{t}=i\right)=\frac{p\left({Y}_{t}=i|{X}_{t}=x\right)p\left({X}_{t}=x\right)}{p\left({Y}_{t}=i\right)}\propto\:\frac{p\left({Y}_{t}=i|{X}_{t}=x\right)}{p\left({Y}_{t}=i\right)}$

Fig. 9

A Schematic Representation of a Hidden Markov Model

2.5. Training phase

The training procedure was designed to ensure rigorous, unbiased evaluation of the proposed framework while preserving the temporal and subject-specific characteristics of the ECG recordings. All models were trained using a subject-wise validation strategy to prevent data leakage between training and evaluation sets—a critical requirement in physiological signal modeling, where segments from the same subject are inherently correlated.

Two complementary validation schemes were employed. First, a five-fold cross-validation protocol was implemented at the subject level. In each fold, the available recordings were partitioned into mutually exclusive training and validation groups, ensuring that no subject appeared in more than one subset per fold. This approach enabled the model to be exposed to a wide range of physiological variability while maintaining strict independence between training and validation data. The order of subjects within each subset was randomized prior to training to avoid unintentional bias arising from fixed subject sequences, while randomization seeds were controlled to guarantee reproducibility. For each fold, all segments belonging to the selected subjects were concatenated sequentially to form the final training and validation matrices. This procedure preserved the native temporal ordering of the ECG segments, which is essential for learning physiologically meaningful temporal dependencies.

Within each fold, the model was trained iteratively, and its predictions on the validation subjects were used to derive subject-specific temporal statistics, including transition and emission probabilities required for the Hidden Markov Model (HMM) post-processing module. These statistics were aggregated across subjects to construct the final HMM parameters for each fold. All runs—including intermediate model checkpoints, prediction traces, and HMM parameter sets—were archived systematically to ensure complete reproducibility and facilitate downstream comparative analysis.

In addition to cross-validation, a hold-out evaluation was conducted to provide an independent assessment of generalization performance. The dataset was partitioned into an 80% training portion and a 20% test portion at the subject level. To enhance robustness, the training set was further divided into five rotating internal validation subsets, yielding five independent training runs. In each run, the model was trained from scratch, validated on a distinct subset of training subjects, and subsequently applied to the fixed test set. The final performance of the hold-out evaluation was computed as the average across the five runs, ensuring that the reported results were not influenced by a specific sampling configuration.

Throughout all experiments, the test set remained completely unseen during model development, training, and parameter tuning. This strict separation guaranteed an unbiased and clinically meaningful estimate of real-world performance. The combination of subject-wise cross-validation, multi-run hold-out evaluation, controlled randomization, and systematic recording of temporal statistics provided a robust methodological foundation for reliable assessment of the proposed deep learning–HMM framework.

2.6. Performance metrics

The model's performance was comprehensively evaluated using a range of statistical metrics derived from the confusion matrix, which comprises four fundamental quantities: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). Specifically, TP represents correctly identified OSA cases, while FP corresponds to non-OSA instances erroneously classified as OSA. TN denotes correctly classified normal instances, and FN refers to OSA cases misclassified as normal. These metrics are defined as follows:

$\:Acc=\frac{{T}_{P}+{T}_{N}}{{T}_{P}+{T}_{N}+{F}_{P}+{F}_{N}}\times\:100\%$

$\:Sen=\frac{{T}_{P}}{{T}_{P}+{F}_{N}}\times\:100\%$

$\:Spe=\frac{{T}_{N}}{{T}_{N}+{F}_{P}}\times\:100\%$

$\:Pre=\frac{{T}_{P}}{{T}_{P}+{F}_{P}}$

$\:F1-score=2\times\:\frac{Pre\times\:Sen}{pre+Sen}$

Through this comprehensive set of evaluation metrics, a thorough analysis of the model's strengths and limitations in classifying OSA from ECG data was conducted, providing detailed insights into its classification performance.

3. Results

3.1. Per-segment performance

Preliminary analyses indicated that conventional Z-score normalization, despite its widespread adoption, is unsuitable for signals derived from surface-level time-series features. This method reduces the dynamic range of signals and attenuates subtle variations in the RRI, RAMP, and EDR components, leading to distortion of discriminative patterns, convergence difficulties, and reduced classification accuracy in the initial models.

Among the nine initial models reported in Tables 4 and 5, only the results obtained from RRI_RAMP_EDR data with a 240×3 structure, normalized according to the reference method, were utilized for subsequent analyses. This preprocessing approach ensured enhanced model stability and performance. At this stage, only the hold-out validation strategy was applied, and the mean performance of each model across five independent runs is presented in Table 6. Evaluation of models based on recurrent, convolutional, and hybrid architectures indicated that application of a postprocessing stage consistently led to substantial improvement in the performance of all networks. Comparison of mean values and standard deviations of the metrics demonstrates that this stage, in addition to increasing classification accuracy, plays a critical role in stabilizing model outputs and reducing variability across different runs. Overall, the optimal performance was observed in the BiLSTM model following postprocessing, achieving an accuracy of 89.04%, sensitivity of 87.69%, and F1-score of 86.00%. These results demonstrate the inherent capability of the bidirectional LSTM architecture to extract both long-term and short-term temporal patterns, coupled with improved decision-making through the postprocessing stage.

Among single-layer recurrent models, both LSTM and GRU exhibited increased F1-scores following activation of the postprocessing stage—from 77.45% to 84.98% and from 77.18% to 85.96%, respectively. Notably, the standard deviation of the F1-score decreased substantially in both models, indicating that postprocessing enhances decision stability and reduces sensitivity to run-to-run variations. However, GRU outperformed LSTM overall, which may be attributed to its simpler architecture and superior capacity for modeling short-term ECG dependencies.

Among CNN-RNN hybrid models, CNN-GRU and CNN-LSTM demonstrated competitive performance; however, within this group, CNN-GRU outperformed CNN-LSTM, achieving an F1-score of 85.13% and accuracy of 88.68%. This is likely attributable to the GRU's ability to prevent overfitting in deeper architectures, as the reduced parameter count of GRU, when combined with CNN, confers greater robustness against signal variations.

Conversely, bidirectional hybrid architectures such as CNN-BiLSTM and CNN-BiGRU, despite relative performance gains compared to models without postprocessing, exhibited higher metric variability, particularly in the standard deviations of sensitivity and specificity. This phenomenon suggests that bidirectional models, in settings with limited experimental runs, are more susceptible to fluctuations arising from architectural complexity or sensitivity to noisy ECG patterns. Nonetheless, CNN-BiGRU delivered satisfactory performance with 87.56% accuracy and 83.78% F1-score, and demonstrated greater stability than CNN-BiLSTM.

Examination of the poorest performance revealed that the CNN-only model, both without and with postprocessing, produced the lowest results. With an initial F1-score of 73.86% and only modest improvement to 81.47% following postprocessing, this model demonstrated that feature extraction based solely on CNN is insufficient for temporally dependent ECG signals. The high standard deviation of sensitivity confirms that architectures without memory mechanisms cannot reliably identify apnea patterns.

Regarding standard deviation analysis, deeper or multi-component models such as BiGRU and CNN-BiLSTM exhibited the highest variability in sensitivity and specificity, indicating their greater dependence on run-specific patterns and susceptibility to overfitting particular temporal structures. Conversely, GRU and CNN-GRU, following postprocessing, presented the lowest standard deviations, demonstrating high stability, robust generalization, and reduced sensitivity to variations in data distribution.

Overall, the results indicate:

Best overall performance: BiLSTM (with postprocessing)

Most stable models: GRU and CNN-GRU

Weakest model (in terms of error rate and sensitivity): CNN (without postprocessing)

Highest run-to-run variability: CNN-BiLSTM and BiGRU

Largest performance improvement following postprocessing: GRU and CNN-GRU

These findings indicate that combining recurrent networks with short- and medium-term memory capacity (e.g., GRU) with a rule-based postprocessing stage provides the optimal balance between accuracy, sensitivity, and stability for ECG-based apnea detection. Implementation of the proposed model from [35] also confirmed accelerated training and inference while achieving performance comparable to the best initial model, reaching 89.18% accuracy. Considering computational efficiency and processing speed, this model was selected as the final architecture.

Analysis of Table 7 demonstrated that model performance evaluated based on two normalization approaches and the presence or absence of postprocessing reveals that feature extraction quality and the normalization method are crucial for system stability and classification accuracy. The utilization of three features—RRI, RAMP, and EDR—was consistent across all experiments; however, the input standardization method and the application of postprocessing induced significant changes in model behavior.

Using the reference method (without Z-score normalization), the model without postprocessing achieved an F1-score of 79.45% and accuracy of 85.20%. Specificity was relatively high at 91.60%, but sensitivity was lower at 74.89%, indicating superior performance in identifying normal samples while committing more errors in detecting apneic events. With postprocessing enabled, performance improved substantially, with F1-score reaching 85.19%, accuracy 89.18%, and sensitivity 81.42%. The concurrent reduction in standard deviations demonstrates that postprocessing not only increases overall accuracy but also stabilizes model performance across different runs, reflecting the beneficial impact of temporal smoothing and decision correction rules on network outputs.

Conversely, application of Z-score normalization resulted in notable performance degradation for models without postprocessing. Under this condition, F1-score decreased to 73.80% and accuracy to 77.60%; importantly, the standard deviations of sensitivity and specificity increased compared to the reference method. This increase in variability indicates that Z-score normalization cannot stably standardize RRI, RAMP, and EDR features and may disrupt critical dynamic characteristics of ECG signals. Postprocessing improved performance modestly, with F1-score increasing to 78.72% and accuracy to 82.11%; however, overall performance remained inferior to the reference method. Standard deviations of sensitivity and specificity, though reduced compared to the no-postprocessing condition, remained elevated relative to the reference approach, clearly indicating that the primary limitation lies in Z-score normalization, and postprocessing can only partially mitigate its adverse effects.

In summary:

The optimal overall performance corresponds to the reference method with postprocessing, offering the highest F1-score, accuracy, and stability with minimal variability. This demonstrates that RRI, RAMP, and EDR features in their appropriately scaled form provide maximum discriminative power.

Z-score normalization is unsuitable for these features, as it reduces accuracy, decreases sensitivity, and increases standard deviations. This indicates that the statistical distribution of these features is incompatible with standard normalization while preserving temporal or morphological relationships.

Postprocessing constitutes a critical component for enhancing performance, improving both model accuracy and stability across runs. Its effect is substantially more pronounced with the reference method than with Z-score normalization.

Higher specificity relative to sensitivity across all models indicates that detecting normal conditions is less challenging, while apneic events introduce greater complexity in ECG temporal patterns.

Table 8 demonstrates that the type of input features, the normalization method, and the presence or absence of postprocessing all exert direct and significant effects on the performance of ECG-based apnea detection systems. Observed patterns indicate that combining diverse features, avoiding Z-score normalization, and applying postprocessing constitutes the most effective approach for improving model accuracy and stability.

In the optimal scenario—utilizing all three features (RRI, RAMP, and EDR) according to the reference method with active postprocessing—the model achieved an F1-score of 87.25%, specificity of 94.44%, and accuracy of 90.62%. These values surpass all other configurations in both mean performance and standard deviation, indicating that combining complementary features provides maximum discriminative power for apnea patterns. Additionally, postprocessing reduces decision noise, improving model stability across experimental runs. The substantial increase in sensitivity (from 75.36% to 84.15%) further highlights the critical role of postprocessing in reducing false negatives and improving detection of apneic events.

In contrast, when individual features were utilized independently without postprocessing, average performance declined significantly and standard deviations increased. RRI alone, inherently dependent on heart rate, performed considerably worse than the three-feature combination, achieving an F1-score of 75.66% with high variability (± 6.24), indicating poor stability across subjects. RAMP exhibited similar behavior; despite containing respiratory-related information, its standalone performance reached only 73.78% F1-score with relatively low sensitivity (71.77%) and high standard deviation (± 10.07), demonstrating substantial variability in detecting apneic events. EDR alone produced moderate accuracy; however, decreased sensitivity (69.85%) and elevated standard deviations confirmed that absence of complementary features reduces stability and increases model fluctuations.

Application of postprocessing improved performance for all individual features—for instance, EDR sensitivity increased from 69.85% to 81.84%—yet average performance remained inferior to the combined features scenario, and elevated standard deviations indicated the inherent limitation of relying on a single feature to address inter-subject variability.

Z-score normalization consistently induced performance degradation. In the RRI–RAMP–EDR combination without postprocessing, F1-score decreased to 75.39% and sensitivity to 71.85%, indicating that standard normalization not only failed to improve performance but disrupted the statistical structure of the features—particularly for nonlinear, derived features such as EDR and RAMP, which are highly dependent on signal dynamics. Increased standard deviations across all Z-score metrics further demonstrate that this normalization approach is unsuitable for derived ECG features.

Even for individual features such as RRI or EDR, Z-score normalization decreased mean accuracy and increased inter-run variability. Sensitivity values, critical in clinical applications, were lower than those observed with the reference method. Although postprocessing improved performance relative to the no-postprocessing condition, results remained significantly inferior to the reference approach, confirming that the primary limitation lies in the normalization method rather than the decision-making stage.

Overall, increased standard deviations directly reflect instability across subjects and model sensitivity to inherent signal variations. The highest standard deviations were observed when Z-score normalization and single-feature inputs were employed, emphasizing the importance of appropriate preprocessing and feature integration.

In conclusion, the results demonstrate:

The optimal performance corresponds to the RRI + RAMP + EDR combination with the reference method and active postprocessing.

Z-score normalization is unsuitable for surface-level and derived features, resulting in decreased accuracy and stability.

Individual features alone cannot provide sufficient stability, as evidenced by their elevated standard deviations and inter-subject variability.

Postprocessing is effective across all scenarios and exerts the greatest impact on sensitivity, which is critical for clinical apnea detection.

Combining multiple complementary features, particularly RRI, RAMP, and EDR, enhances the discriminative capacity for apnea patterns and minimizes model fluctuations.

Table 6

Mean Classification Metrics Over Five Independent Hold-Out Executions for the Initial Nine Models
Model	Postprocessing stage	Accuracy(%)	Specificity(%)	Sensitivity(%)	F₁-Score(%)
LSTM	off	83.29 ± 0.56	88.23 ± 3.95	75.28 ± 5.54	77.45 ± 1.11
LSTM	on	88.82 ± 0.70	92.52 ± 3.52	82.85 ± 4.89	84.98 ± 0.95
GRU	off	82.97 ± 0.74	87.72 ± 2.06	75.32 ± 2.97	77.18 ± 1.13
GRU	on	89.48 ± 0.70	92.60 ± 1.73	84.44 ± 3.48	85.96 ± 1.61
BiLSTM	off	82.38 ± 0.70	84.62 ± 2.40	78.77 ± 3.12	77.36 ± 0.89
BiLSTM	on	89.04 ± 0.71	89.87 ± 1.38	87.69 ± 2.44	86.00 ± 1.08
BiGRU	off	82.96 ± 0.84	92.21 ± 1.68	68.05 ± 4.51	75.26 ± 2.10
BiGRU	on	87.43 ± 0.97	92.75 ± 4.05	78.84 ± 8.26	82.58 ± 2.51
CNN	off	80.91 ± 0.87	87.14 ± 2.31	70.87 ± 5.54	73.86 ± 2.28
CNN	on	86.08 ± 1.89	89.64 ± 4.89	80.35 ± 7.21	81.47 ± 2.51
CNN-LSTM	off	82.52 ± 0.46	83.85 ± 3.05	80.37 ± 4.23	77.84 ± 0.82
CNN-LSTM	on	88.09 ± 0.77	88.29 ± 2.82	87.79 ± 3.07	84.95 ± 0.68
CNN-GRU	off	82.95 ± 0.87	84.04 ± 2.00	81.19 ± 1.65	78.47 ± 0.81
CNN-GRU	on	88.68 ± 0.46	90.72 ± 2.55	85.38 ± 3.83	85.13 ± 0.77
CNN-BiLSTM	off	82.20 ± 1.04	85.75 ± 3.74	76.48 ± 6.47	76.58 ± 2.13
CNN-BiLSTM	on	86.68 ± 2.25	87.02 ± 5.96	86.12 ± 7.12	83.16 ± 2.53
CNN-BiGRU	off	82.40 ± 1.23	86.22 ± 4.44	76.23 ± 6.13	76.80 ± 1.84
CNN-BiGRU	on	87.56 ± 1.94	89.57 ± 4.51	84.32 ± 6.83	83.78 ± 2.68

Table 7

Hold-Out Classification Results of the Final Model Using RRI_AMP_EDR Inputs Under Two Normalization Schemes
Input Features	Normalization	Postprocessing stage	Accuracy (%)	Sensitivity (%)	Specificity (%)	F1-Score (%)
RRI_Ramp_EDR	Reference Method	off	85.20 ± 0.95	74.89 ± 3.43	91.60 ± 1.48	79.45 ± 1.68
RRI_Ramp_EDR	Reference Method	on	89.18 ± 0.94	81.42 ± 3.27	94.00 ± 1.16	85.19 ± 1.57
RRI_Ramp_EDR	Z-score	off	77.60 ± 1.82	82.38 ± 4.40	74.63 ± 5.09	73.80 ± 1.27
RRI_Ramp_EDR	Z-score	on	82.11 ± 2.73	85.83 ± 3.39	79.86 ± 5.55	78.72 ± 2.41

Table 8

Cross-Validation Results (5-Fold) for the Final Selected Model
Input Features	Normalization	Postprocessing stage	Accuracy (%)	Sensitivity (%)	Specificity (%)	F1-Score (%)
RRI_Ramp_EDR	Reference Method	off	86.35 ± 1.24	75.36 ± 4.71	92.82 ± 1.92	80.65 ± 2.53
RRI_Ramp_EDR	Reference Method	on	90.62 ± 1.54	84.15 ± 3.80	94.44 ± 2.22	87.25 ± 1.74
RRI	Reference Method	off	81.94 ± 4.24	74.40 ± 6.55	86.27 ± 4.29	75.66 ± 6.24
RRI	Reference Method	on	87.55 ± 4.73	80.84 ± 6.62	91.37 ± 4.07	83.08 ± 6.53
Ramp	Reference Method	off	80.99 ± 2.31	71.77 ± 10.07	86.39 ± 4.13	73.78 ± 5.35
Ramp	Reference Method	on	85.49 ± 3.78	77.31 ± 15.59	90.31 ± 4.52	79.16 ± 9.06
EDR	Reference Method	off	83.18 ± 2.09	69.85 ± 4.76	91.24 ± 1.01	75.86 ± 3.53
EDR	Reference Method	on	88.37 ± 2.15	81.84 ± 7.98	92.23 ± 1.39	83.94 ± 4.36
RRI_Ramp_EDR	Z-score	off	82.42 ± 1.92	71.85 ± 7.85	88.24 ± 3.72	75.39 ± 4.14
RRI_Ramp_EDR	Z-score	on	87.57 ± 2.49	83.00 ± 6.44	90.00 ± 3.87	83.57 ± 2.74
RRI	Z-score	off	78.58 ± 4.09	70.58 ± 6.50	83.36 ± 3.62	71.29 ± 6.70
RRI	Z-score	on	83.52 ± 4.31	73.11 ± 8.70	89.86 ± 5.26	76.92 ± 6.76
EDR	Z-score	off	78.61 ± 4.45	64.72 ± 12.07	87.08 ± 2.78	69.22 ± 7.38
EDR	Z-score	on	83.74 ± 3.61	72.73 ± 14.81	90.19 ± 4.28	76.41 ± 7.86

3.2. Feature visualization

The t-SNE algorithm was employed to visualize the original RRI, AMP, and EDR signals, as well as the features extracted by the CNN–Transformer–LSTM model for both training and validation sets (Figs. 10–11). The results indicate that the original signals are widely scattered, making it challenging to distinguish between normal and sleep apnea cases. In contrast, the features extracted by the CNN–Transformer–LSTM model exhibit clear clustering, demonstrating a distinct separation between the two classes in the feature

representation space.

Figure 10: Visualization of raw input signals feature maps using t-SNE visualizing algorithm

Fig. 11

Visualization of extracted feature maps using t-SNE visualizing algorithm

3.3. Sample First-Run

For each run, the order of training records was randomized while maintaining reproducibility. For example, in Run 1 for the apnea group, 32 records were assigned to the training set and 8 to the validation set, where the first training record was a05 and the first validation record was a10.

The model was subsequently executed for each of the remaining records in the borderline and control groups, following the identical procedure applied to the apnea group, and the corresponding transition and emission matrices were computed. Each model run was stored in a dedicated directory to ensure reproducibility and facilitate subsequent analyses. This procedure was repeated across all five folds of the cross-validation process.

3.4. Training and Validation Performance

Figures 12 and 13 illustrate the training and validation accuracy and loss curves for both the hold-out and five-fold cross-validation approaches. These results correspond to the second experimental run, providing detailed insight into the model's learning dynamics. The accuracy curves demonstrate progressive improvement across both training and validation datasets, while the loss curves indicate convergence and stability throughout the training process. Comparison of hold-out and cross-validation approaches highlights the model's consistency and generalization capability across different validation strategies.

Fig. 12

The accuracy and loss of training and validation sets for hold-out validation

Fig. 13

The accuracy and loss of training and validation sets for five-fold cross validation

3.5. Hypnogram comparison

Figure 14 presents two hypnograms from the same subject: one comparing the ground truth with the initial classifier output, and the other comparing the ground truth with predictions refined by a hidden Markov model (HMM). The comparison clearly demonstrates the HMM's effectiveness in smoothing predictions and enhancing temporal consistency. These results confirm that incorporating HMMs into the postprocessing stage improves both the accuracy and coherence of sequential classifications in clinical applications.

Fig. 14

Hypnogram Comparison Before and After HMM Postprocessing

3.6. Confusion matrix

Confusion matrices constitute essential tools for evaluating model performance, as they clearly illustrate the distribution of correct and erroneous predictions across classes. In clinical applications such as apnea detection, accurate classification of both positive and negative cases is critically important, as misclassifications can have significant diagnostic implications.

As demonstrated in Figs. 15, which correspond to the second experimental run of both the hold-out approaches, application of the postprocessing stage resulted in substantial reductions in both false-positive and false-negative predictions. This reduction in classification errors directly contributed to improvements in overall accuracy, sensitivity, and model reliability. Such enhancements are crucial for minimizing diagnostic errors and ensuring reliable identification of apneic events.

Fig. 15

The confusion matrix of Hold_out validation

4. Discussion

The findings of this study, based on the analysis of 70 ECG records from the PhysioNet Apnea-ECG database, demonstrate that deep learning models can overcome the inherent limitations of traditional methods relying on manual feature extraction—limitations that have been repeatedly identified in prior literature as the primary obstacle to accurate and robust detection of apnea and hypopnea events. One of the fundamental strengths of this research is its precise and controlled data-driven design. Unlike many previous studies that employed fixed or sometimes arbitrary percentage-based splits for constructing training and validation sets, the present study managed class distribution, segment counts, and record allocation with high precision and transparency. This approach not only minimized the possibility of data leakage between different subsets but also enabled the models to learn genuine ECG signal patterns and prevented bias arising from random sampling—an issue reported as one of the common shortcomings in physiological signal-based research.

The use of a rigorous hold-out strategy, whereby the test set was kept completely separate from the training process, model selection, and hyperparameter tuning, provided grounds for independent model evaluation. This aspect has been less frequently observed in the existing literature, limiting the validity of comparisons with previous findings. However, one of the main challenges observed in analyzing the results is the confusion arising from label quality and the overlap between apnea and hypopnea events—a problem that inherently affects the model's decision boundaries and in some cases reduces the accuracy of detecting positive (patient) samples. In such an unstable context, the application of a post-processing stage based on Hidden Markov Models (HMM) plays an important role in improving temporal coherence, reducing sensitivity to noisy labels, and enhancing discriminability between respiratory subtypes.

Analysis of Table 6 revealed that GRU and BiLSTM architectures demonstrated superior performance compared to other models, although identifying patients remained challenging. The reduction in F1-score indicates that models tend to identify healthy samples with greater accuracy, while the drop in detection for the positive class stems from the same labeling complexities and event overlaps. Moreover, comparison of model training times shows that although BiLSTM provided accurate results (with approximately 2 to 3 minutes per epoch), CNN-based architectures with significantly shorter training times (13 to 48 seconds) represent attractive options for real-time applications and systems deployable in clinical environments. Given that this study applied more rigorous standards than previous research, the stability of model performance under such challenging conditions holds greater scientific value.

Comparison with previous work demonstrated that despite this study's focus on precise data-driven design and independent evaluation, model performance under noisy conditions and complex data remains competitive and in some cases more stable. Nevertheless, the overlap between different types of respiratory events and inter-individual variations in physiological patterns increasingly highlight the need for developing multi-stage, hierarchical models or multi-task learning approaches—approaches that can establish clearer distinctions between obstructive apnea, hypopnea, and related transient events.

From a methodological perspective, the use of early stopping played an important role in reducing overfitting and improving model generalizability. The results presented in Table 9 demonstrate that the proposed model achieved acceptable performance in terms of the combination of accuracy and sensitivity, confirming the importance of proper selection of validation strategies (5-fold and hold-out), precise data configurations, and the presence of efficient post-processing.

Despite significant achievements, the present study illuminates multiple avenues for future research. First, replacing the Pan-Tompkins algorithm with more advanced methods such as Hamilton, adaptive monitoring filters, or time-frequency domain techniques could enhance the accuracy of surface feature extraction. Second, extending the temporal window length from 60 seconds to longer intervals (3 to 5 minutes) could provide the model with information related to slower physiological dynamics and increase its sensitivity to more complex events. Third, combining deep learning-based models with hierarchical classification, probabilistic models, or Transformer architectures could improve decision boundaries under noisy data conditions and enhance the model's ability to detect subtypes. Finally, conducting cross-database validation using independent datasets could more precisely determine the model's generalizability when confronted with real clinical conditions—an essential requirement for practical deployment of such systems in medical environments.

Overall, the results of this study demonstrate that the combination of precise data design, use of rigorous validation strategies, employment of HMM as post-processing, and utilization of early stopping provides a powerful and efficient framework for ECG-based sleep apnea detection. However, evolution of feature extraction methods, development of multi-stage classifiers, and validation under real-world conditions can lead to significant improvements in the efficiency and stability of automated diagnostic systems in subsequent steps [[37],[38], [39], [33], [40], [41], [42], [23], [19], [21], [43], [44], [45], [46] ].

Table 9

Comparison and Analysis of Research Findings in Relation to Previous Studies
Study	Feature extracted	Classifier	Evaluation type	Accuracy (%)	Recall (%)	Specificity (%)
(Hung-Yu Chang and el. 2020) [37]	Raw ECG	CNN	Shuffled Segments	87.90	81.10	92.00
(Sharan and el. 2020) [38]	RRI, time and frequency domain	CNN	Shuffled Segments	88.23	82.74	91.62
(Mukherjee and el. 2021) [39]	RRI, RAMP, EDR	Trainable Ensemble using MLP	Shuffled Segments	85.58	84.43	88.26
(Bahrami and el. 2021) [33]	RRI, RAMP	LeNet + LSTM	Shuffled Segments	80.67	75.04	84.13
(Rajabrundha and el. 2022) [40]	RRI	LSTM	Shuffled Segments	85.62	82.71
(Bahrami and el. 2022) [41]	RRI, R AMP	ZFNet-BiLSTM	Shuffled Segments	88.13	81.49	92.27
(Fang and el. 2022) [42]	RRI	ResNet-Multiscale	Shuffled Segments	86.00	84.10	87.10
(Cheng et al. 2022) [19]	ECG (18.75–25 Hz Subband)	CNN	Shuffled Segments	88.60	83.80	91.50
(Tyagi and el. 2023) [23]	HRV, EDR	FT-EDBN	Shuffled Segments	89.11	83.89	92.28
(Liu and el. 2023) [21]	Raw ECG	CNN-Transformer	Shuffled Segments	88.20	78.50	94.10
(Kollu Praveen Kumar and el. 2024) [43]	Raw ECG	CNN-LSTM	segment-based evaluation	86.18
(D. Padovano and el. 2025) [44]	Custom CNN, HRV’s DM	AlexNet	External validation	74.72	73.99	75.17
(J. Gupta and el. 2025) [45]	RRI, RA	MLP with a time window	Shuffled Segments	86.80	82.40	89.50
(M. Scarpetta and el. 2025) [46]	Raw ECG	CNN	Shuffled Segments	81.00	-	-
Proposed Methode	RRI, AMP, EDR	CNN-Transformer-LSTM	Cross record	90.62	84.15	94.44

5. Conclusion

This study proposes a principled framework for improving automated sleep apnea detection by integrating temporal consistency into deep learning pipelines through Hidden Markov Model (HMM)-based postprocessing. Rather than relying solely on architectural modifications, the findings emphasize the pivotal role of temporal modeling in bridging the persistent gap between high algorithmic accuracy and practical clinical usability. By addressing the issue of fragmented and physiologically implausible predictions, the proposed method shifts the focus from static, frame-level classification toward more coherent and clinically interpretable decision-making processes.

The consistent performance gains observed across a variety of deep learning models underscore the generalizability and flexibility of the proposed postprocessing approach. These enhancements suggest that temporal refinement is not a superficial adjustment but rather a core element necessary for building trustworthy, real-world decision-support systems in sleep medicine. The method's compatibility with different network architectures reinforces its potential for widespread adoption across diverse clinical and technical environments.

From a broader perspective, the paradigm of coupling sequence-level temporal smoothing with data-driven models opens new research directions in biomedical signal analysis, especially in domains where temporal fidelity is as crucial as classification accuracy. By establishing a robust foundation for temporal consistency, this work contributes to the development of more reliable, context-aware diagnostic systems capable of functioning in real-time clinical workflows.

Future efforts will aim to further enhance the model’s interpretability and generalization through the incorporation of attention mechanisms and ensemble learning strategies. Additionally, advanced preprocessing techniques—including data augmentation and targeted feature selection—will be explored to mitigate challenges related to dataset size limitations and inter-subject variability. Collectively, these advancements are expected to foster the development of intelligent, adaptive diagnostic platforms with high translational value in both sleep medicine and broader biomedical applications.

Acknowledgements

The authors appreciate the publicly accessible Apnea-ECG database, which enabled the development and evaluation of the proposed methodology.

Author Contribution

H.GH. and A.K conceived and supervised the study, defined the research objectives, and guided the methodological framework. S.Q. designed key components of the CNN-Transformer-LSTM architecture, implemented ECG feature extraction modules (RRI, EDR, RAMP), and performed statistical analyses. Also, S.Q. handled data preprocessing, implemented the record-wise partitioning strategy, and conducted all training and evaluation experiments. H.GH. contributed to the development of signal-integration mechanisms and carried out comparative analyses with baseline models. H.GH. and S.Q. interpreted the results and contributed to scientific discussions. All authors participated in manuscript preparation, critically reviewed the content, and approved the final version.

Data Availability

The dataset used in this study is publicly available on PhysioNet and can be accessed at: https://physionet.org/content/apnea-ecg/1.0.0/

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding

Consent to publish

All authors approve the manuscript and give their consent for submission and publication as open access.

References

Abbasi, A. et al. A comprehensive review of obstructive sleep apnea,., Brazilian Association of Sleep and Latin American Federation of Sleep Societies. (2021). 10.5935/1984-0063.20200056

Salman, L. A., Shulman, R. & Cohen, J. B. Obstructive Sleep Apnea, Hypertension, and Cardiovascular Risk: Epidemiology, Pathophysiology, and Management, Feb. 01, Springer. (2020). 10.1007/s11886-020-1257-y

Goldstein, C. et al. Polysomnography validation of SANSA to detect obstructive sleep apnea. Front. Neurol. 16 10.3389/fneur.2025.1592690 (2025).

Ozkan, H. et al. A portable wearable tele-ECG monitoring system. IEEE Trans. Instrum. Meas. 69 (1), 173–182. 10.1109/TIM.2019.2895484 (Jan. 2020).

Wang, L., Lin, Y. & Wang, J. A RR interval based automated apnea detection approach using residual network. Comput. Methods Programs Biomed. 176, 93–104. 10.1016/j.cmpb.2019.05.002 (Jul. 2019).

Cartwright, R. Obstructive Sleep Apnea: A Sleep Disorder With Major Effects on Health Disease-a-Month ® Information for Readers, [Online]. (2001). Available: www.mosby.com/disamonth

Varon, C., Caicedo, A., Testelmans, D., Buyse, B. & Van Huffel, S. A Novel Algorithm for the Automatic Detection of Sleep Apnea from Single-Lead ECG, IEEE Trans Biomed Eng, vol. 62, no. 9, pp. 2269–2278, Sep. (2015). 10.1109/TBME.2015.2422378

Babaeizadeh, S., White, D. P., Pittman, S. D. & Zhou, S. H. Automatic detection and quantification of sleep apnea using heart rate variability, in Journal of Electrocardiology, Nov. pp. 535–541. (2010). 10.1016/j.jelectrocard.2010.07.003

Pinho, A., Pombo, N., Silva, B. M. C., Bousson, K. & Garcia, N. Towards an accurate sleep apnea detection based on ECG signal: The quintessential of a wise feature selection, Applied Soft Computing Journal, vol. 83, Oct. (2019). 10.1016/j.asoc.2019.105568

10.

Sharma, M., Agarwal, S. & Acharya, U. R. Application of an optimal class of antisymmetric wavelet filter banks for obstructive sleep apnea diagnosis using ECG signals. Comput. Biol. Med. 100, 100–113. 10.1016/j.compbiomed.2018.06.011 (Sep. 2018).

11.

Hassan, A. R. & Haque, M. A. An expert system for automated identification of obstructive sleep apnea from single-lead ECG using random under sampling boosting. Neurocomputing 235, 122–130. 10.1016/j.neucom.2016.12.062 (Apr. 2017).

12.

Feng, K., Qin, H., Wu, S., Pan, W. & Liu, G. A Sleep Apnea Detection Method Based on Unsupervised Feature Learning and Single-Lead Electrocardiogram. IEEE Trans. Instrum. Meas. 70 10.1109/TIM.2020.3017246 (2021).

13.

Viswabhargav, C. S. S. S., Tripathy, R. K. & Acharya, U. R. Automated detection of sleep apnea using sparse residual entropy features with various dictionaries extracted from heart rate and EDR signals. Comput. Biol. Med. 108, 20–30. 10.1016/j.compbiomed.2019.03.016 (May 2019).

14.

Li, K., Pan, W., Li, Y., Jiang, Q. & Liu, G. A method to detect sleep apnea based on deep neural network and hidden Markov model using single-lead ECG signal, Neurocomputing, vol. 294, pp. 94–101, Jun. (2018). 10.1016/j.neucom.2018.03.011

15.

Cheng, M., Sori, W. J., Jiang, F., Khan, A. & Liu, S. Recurrent Neural Network Based Classification of ECG Signal Features for Obstruction of Sleep Apnea Detection, in Proceedings – 2017 IEEE International Conference on Computational Science and Engineering and IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017, Institute of Electrical and Electronics Engineers Inc., Aug. pp. 199–202. (2017). 10.1109/CSE-EUC.2017.220

16.

Urtnasan, E., Park, J. U. & Lee, K. J. Multiclass classification of obstructive sleep apnea/hypopnea based on a convolutional neural network from a single-lead electrocardiogram. Physiol. Meas. 39 (6). 10.1088/1361-6579/aac7b7 (Jun. 2018).

17.

Zhou, H., Chen, S., Xu, Z. & Zheng, W. ) A spatio-temporal learning-based model for sleep apnea detection using single-lead ECG signals.

18.

Yang, Q., Zou, L., Wei, K. & Liu, G. Obstructive sleep apnea detection from single-lead electrocardiogram signals using one-dimensional squeeze-and-excitation residual group network. Comput. Biol. Med. 140 10.1016/j.compbiomed.2021.105124 (Jan. 2022).

19.

Yeh, C. Y., Chang, H. Y., Hu, J. Y. & Lin, C. C. Contribution of Different Subbands of ECG in Sleep Apnea Detection Evaluated Using Filter Bank Decomposition and a Convolutional Neural Network. Sensors 22 (2). 10.3390/s22020510 (Jan. 2022).

20.

Zhou, Y., He, Y. & Kang, K. OSA-CCNN: Obstructive Sleep Apnea Detection Based on a Composite Deep Convolution Neural Network Model using Single-Lead ECG signal, in Proceedings – 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Institute of Electrical and Electronics Engineers Inc., pp. 1840–1845. (2022). 10.1109/BIBM55620.2022.9995675

21.

Liu, H., Cui, S., Zhao, X. & Cong, F. Detection of obstructive sleep apnea from single-channel ECG signals using a CNN-transformer architecture. Biomed. Signal. Process. Control. 82 10.1016/j.bspc.2023.104581 (Apr. 2023).

22.

Qin, H. & Liu, G. A dual-model deep learning method for sleep apnea detection based on representation learning and temporal dependence. Neurocomputing 473, 24–36. 10.1016/j.neucom.2021.12.001 (Feb. 2022).

23.

Kumar Tyagi, P. & Agrawal, D. Automatic detection of sleep apnea from single-lead ECG signal using enhanced-deep belief network model. Biomed. Signal. Process. Control. 80 10.1016/j.bspc.2022.104401 (Feb. 2023).

24.

Chen, Y. et al. RAFNet: Restricted attention fusion network for sleep apnea detection. Neural Netw. 162, 571–580. 10.1016/j.neunet.2023.03.019 (May 2023).

25.

Fan, X., Chen, X., Ma, W. & Gao, W. BAFNet: Bottleneck Attention Based Fusion Network for Sleep Apnea Detection. IEEE J. Biomed. Health Inf. 28 (5), 2473–2484. 10.1109/JBHI.2023.3278657 (May 2024).

26.

Rd, I. N., Penzell, T. & Moody2’, G. B. R. G. Mark2, and A. L. Goldbergec, COMPUTERS CA I(lLOGY 00 Uolume 27 The Apnea-ECG Database.

27.

Zhaokun, Z. & Jinbao, L. Multi-Feature Information Fusion LSTM-RNN Detection for OSA. J. Comput. Res. Dev. 57 (12), 2547–2555. 10.7544/issn1000-1239.2020.20190583 (2020).

28.

Alfaouri, M. & Daqrouq, K. ECG Signal Denoising By Wavelet Transform Thresholding. Am. J. Appl. Sci. 5 (3), 276–281 (2008).

29.

Tompkins, W. J. & Real-Time, A. QRS Detection Algorithm, (1985).

30.

Han, D., Rao, Y. N., Principe, J. C., Computational, K. G. & Lab, N. Real-Time PCA(Principa1 Component Analysis) implementation on DSP.

31.

Faust, O., Barika, R., Shenfield, A., Ciaccio, E. J. & Acharya, U. R. Accurate detection of sleep apnea with long short-term memory network based on RR interval signals. Knowl. Based Syst. 212, 106591 (2021).

32.

33.

Bahrami, M. Detection of Sleep Apnea from Single-Lead ECG: Comparison of Deep Learning Algorithms.

34.

Chen, L. et al. Review of image classification algorithms based on convolutional neural networks, Nov. 01, MDPI. (2021). 10.3390/rs13224712

35.

Pham, D. T. & Mouček, R. Efficient sleep apnea detection using single-lead ECG: A CNN-Transformer-LSTM approach. Comput. Biol. Med. 196 10.1016/j.compbiomed.2025.110655 (Sep. 2025).

36.

Mcshane, B. B. Machine Learning Methods with Time Series Dependence, [Online]. (2010). Available: http://repository.upenn.edu/edissertationshttp://repository.upenn.edu/edissertations/122

37.

Chang, H. Y., Yeh, C. Y., Lee, C. T. & Lin, C. C. A sleep apnea detection system based on a one-dimensional deep convolution neural network model using single-lead electrocardiogram, Sensors (Switzerland), vol. 20, no. 15, pp. 1–15, Aug. (2020). 10.3390/s20154157

38.

Sharan, R. V., Berkovsky, S., Xiong, H. & Coiera, E. ECG-derived heart rate variability interpolation and 1-D convolutional neural networks for detecting sleep apnea, in 42nd annual international conference of the IEEE engineering in medicine & biology society (EMBC), 2020, pp. 637–640., 2020, pp. 637–640. (2020).

39.

Mukherjee, D., Dhar, K., Schwenker, F. & Sarkar, R. Ensemble of deep learning models for sleep apnea detection: An experimental study. Sensors 21 (16). 10.3390/s21165425 (Aug. 2021).

40.

Rajabrundha, A., Lakshmisangeetha, A. & Balajiganesh, A. Analysis of Sleep apnea Considering Electrocardiogram Data Using Deep learning Algorithms, in Journal of Physics: Conference Series, Institute of Physics, (2022). 10.1088/1742-6596/2318/1/012009

41.

Bahrami, M. & Forouzanfar, M. Sleep Apnea Detection From Single-Lead ECG: A Comprehensive Analysis of Machine Learning and Deep Learning Algorithms. IEEE Trans. Instrum. Meas. 71, 1–11. 10.1109/TIM.2022.3151947 (2022).

42.

Fang, H., Lu, C., Hong, F., Jiang, W. & Wang, T. Sleep Apnea Detection Based on Multi-Scale Residual Network. Life 12 (1). 10.3390/life12010119 (Jan. 2022).

43.

Kumar, K. P., Vijay, K., Harshitha, G. B., Kumar, V. H. & Abhishek, P. ECG-based Sleep Apnea Detection using Convolutional and Long Short-Term Memory Networks, in 5th International Conference for Emerging Technology, INCET 2024, Institute of Electrical and Electronics Engineers Inc., 2024., Institute of Electrical and Electronics Engineers Inc., 2024. (2024). 10.1109/INCET61516.2024.10593282

44.

Padovano, D., Martinez-Rodrigo, A., Pastor, J. M., Rieta, J. J. & Alcaraz, R. Deep Learning and Recurrence Information Analysis for the Automatic Detection of Obstructive Sleep Apnea. Appl. Sci. (Switzerland). 15 (1). 10.3390/app15010433 (Jan. 2025).

45.

Gupta, J. & Seeja, K. R. An Explainable AI Approach Towards Automatic Sleep Apnea Detection Based on ECG Signal, in Procedia Computer Science, Elsevier B.V., pp. 937–946. (2025). 10.1016/j.procs.2025.04.332

46.

Scarpetta, M., Ragolia, M. A., Pietro Pau, D., Andria, G. & Giaquinto, N. A Tiny Deep Learning Model for Sleep Apnea Detection Based on ECG Signals, in IEEE International Symposium on Medical Measurements and Applications, MeMeA, Institute of Electrical and Electronics Engineers Inc., (2025). 10.1109/MeMeA65319.2025.11068052

Yes