Comparing Algorithm Effectiveness in Health Data Analysis

Northern Border University

Abdulmalik Hazaa Amer Al-Shammari

Arar, Saudi Arabia

lolyham889@gmail.com

Abstract—

Stroke remains one of the leading causes of death and long-term disability worldwide, highlighting the critical need for early detection and prevention. Recent advancements in machine learning (ML) offer promising solutions for identifying individuals at high risk based on clinical and demographic factors. This study presents a comparative analysis of six supervised ML algorithms — Logistic Regression (LR), Random Forest (RF), XGBoost (XGB), K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), and Support Vector Classifier (SVC) — for predicting stroke occurrence using the Healthcare Stroke Dataset from Kaggle. Comprehensive preprocessing steps were applied, including handling missing values, encoding categorical variables, normalization, and addressing class imbalance through the Synthetic Minority Oversampling Technique (SMOTE). Each model was evaluated using five-fold cross-validation based on Accuracy, Precision, Recall, and F1-score. The results show that RF achieved the highest overall accuracy (0.928), while SVC achieved the highest recall (0.790), indicating its superior sensitivity in detecting true stroke cases. The findings demonstrate that integrating data balancing with multi-metric evaluation significantly enhances predictive performance and clinical reliability in stroke prediction systems

Keywords—

Stroke prediction

Machine Learning

Random Forest

Support Vector Classifier

SMOTE

Cross-validation

Healthcare Data

Data Science

Medical Analytics

1.Introduction:

Stroke is among the most severe and widespread medical conditions, representing a major global cause of mortality and long-term disability. According to the World Health Organization (WHO), approximately 15 million people suffer a stroke every year, of which nearly one-third die, while another one-third are left permanently disabled. Stroke occurs when the blood supply to the brain is interrupted or significantly reduced, depriving brain tissue of oxygen and essential nutrients. Since brain cells begin to die within minutes, early detection and prediction are critical to saving lives and preventing permanent neurological damage.

Recent advances in Machine Learning (ML) and data analytics have provided the medical field with new predictive tools capable of identifying high-risk patients based on clinical and demographic features. ML algorithms can analyze large datasets, detect subtle nonlinear relationships between medical variables, and generate early warnings that help healthcare professionals make timely and informed decisions.

In this study, we focus on developing a comprehensive stroke prediction model based on multiple supervised machine learning algorithms. The analysis utilizes the Healthcare Stroke Dataset from Kaggle, which contains various health indicators such as age, gender, hypertension, heart disease, BMI, average glucose level, smoking status, and residence type.

A total of six classification algorithms were implemented and compared to identify the most effective one for stroke prediction:

1.Logistic Regression, a probabilistic linear model that serves as the baseline classifier.

2.Random Forest Classifier, an ensemble model based on decision trees designed to reduce overfitting and improve prediction stability.

3.XGBoost Classifier, a gradient boosting algorithm that enhances accuracy through sequential model optimization.

4.K-Nearest Neighbors (KNN), a distance-based classifier that predicts the outcome based on the similarity to neighboring samples.

5.Multilayer Perceptron (MLP), a neural-network-based model capable of capturing nonlinear dependencies between features.

6.Cross-Validation Classifier (CVC), which ensures that model performance is not biased by data partitioning and provides a robust evaluation of each algorithm.

These models were trained, tested, and evaluated using standard performance metrics including Accuracy, Precision, Recall, F1-Score, and Confusion Matrix. Additionally, preprocessing steps such as missing value handling, outlier detection, and categorical feature encoding were performed to enhance model accuracy and consistency.

The primary objective of this research is to perform a comparative analysis of multiple ML algorithms in predicting stroke occurrence and to determine the most efficient model in terms of both accuracy and computational cost. This study aims to demonstrate the practical role of artificial intelligence in preventive healthcare by providing clinicians with a reliable data-driven system for stroke risk assessment.

data sets

2.Research Question (RQ):

Which supervised machine learning algorithm provides the most accurate and clinically reliable predictions for stroke occurrence when applied to the Healthcare Stroke Dataset after addressing data imbalance through SMOTE and using cross-validation evaluation?

3. Literature Review

In recent years, the application of machine learning (ML) in the healthcare domain has gained significant momentum, particularly in predicting stroke risk and patient outcomes. Stroke is one of the leading causes of mortality and disability worldwide, and early detection of potential cases can drastically improve treatment outcomes. Consequently, researchers have increasingly leveraged machine learning algorithms to develop predictive models that analyze patient demographics, medical history, and lifestyle factors [1],[2].

Wang et al. [1] conducted a systematic review of ML-based models used for stroke outcome prediction. Their study compared logistic regression, support vector machines (SVM), and ensemble methods, concluding that ensemble-based approaches achieved superior generalization across clinical datasets. Similarly, Asadi et al. [2] reviewed over 70 studies between 2019 and 2023 and reported that Random Forest (RF) and XGBoost (XGB) algorithms consistently outperformed traditional classifiers, highlighting their robustness to noise and non-linear feature interactions. Heo et al. [3] discussed the integration of deep learning (DL) and conventional ML models, emphasizing the potential of neural networks to identify subtle medical patterns when sufficient data are available.

Zhou et al. [4] proposed a stacked ML framework that combined multiple classifiers for stroke prediction. Their approach integrated feature selection and preprocessing techniques to handle high-dimensional healthcare data, achieving improved accuracy compared to single-model baselines. Zhang et al. [5] applied ML models to the MIMIC database to predict postoperative stroke risk among elderly patients, finding that RF and logistic regression yielded the best balance between interpretability and predictive power. Choi et al. [6] extended this approach through a multicenter study demonstrating that ML-based stroke outcome prediction models could outperform clinical scoring systems when trained on diverse datasets.

Khosla et al. [7] provided a comprehensive review of AI-driven risk assessment methods for stroke prevention, emphasizing the importance of data quality, feature engineering, and model interpretability. In a similar vein, Miah et al. [8] employed ensemble learning with heterogeneous clinical features, demonstrating that integrating multiple models improved reliability and minimized overfitting. Jani et al. [9] utilized electronic health records (EHR) to build predictive models and found that including behavioral variables such as smoking and BMI significantly improved recall for stroke detection. Jabbar et al. [10] evaluated common ML classifiers for stroke prediction and concluded that tree-based models exhibited higher diagnostic accuracy than linear classifiers, aligning with trends observed in later studies.

3.1 Handling Data Imbalance in Medical Prediction

A common challenge in medical prediction is class imbalance, where stroke cases represent a small fraction of the dataset. The seminal work by Chawla et al. [11] introduced the Synthetic Minority Oversampling Technique (SMOTE) to address this issue by synthetically generating minority samples. Subsequent studies [12], [13] confirmed that SMOTE improves recall and F1-score without significantly compromising accuracy. In healthcare analytics, SMOTE has become a standard pre-processing step for rare disease prediction tasks.

He and Garcia [12] presented one of the earliest evaluations of learning algorithms under imbalanced distributions, showing that rebalancing techniques greatly influence model fairness and sensitivity. Fernández et al. [13] extended this discussion through a comprehensive review, noting that combining SMOTE with ensemble learners—such as RF or boosting algorithms—provides an optimal balance between recall and precision in medical datasets.

3.2 Algorithmic Foundations

Each ML algorithm offers unique advantages for stroke prediction. The Support Vector Classifier (SVC), formulated by Cortes and Vapnik [14], performs well in high-dimensional data and remains robust against overfitting, particularly when kernel functions are applied. Breiman [15] introduced the Random Forest (RF) algorithm, which aggregates multiple decision trees to reduce variance and increase generalization — a property that makes it suitable for heterogeneous medical data.

The Extreme Gradient Boosting (XGBoost) framework proposed by Chen and Guestrin [16] refined traditional boosting by incorporating regularization and parallelization, resulting in superior computational efficiency. The K-Nearest Neighbors (KNN) classifier [17], though simple, is valuable for small and well-separated datasets, while Multilayer Perceptrons (MLP), described by Rumelhart et al. [18], model complex non-linear dependencies and are effective for capturing intricate health patterns. Logistic Regression (LR), a baseline probabilistic model detailed in Hastie et al. [21], remains a benchmark due to its interpretability and efficiency on structured medical data.

3.3 Model Evaluation and Cross-Validation

To ensure model robustness, multiple studies have recommended using cross-validation and performance metrics beyond accuracy [19], [20], [23]. Fawcett [19] introduced ROC analysis as a means to assess classifier performance, whereas Saito and Rehmsmeier [20] argued that Precision-Recall curves are more informative for imbalanced datasets. Kohavi [23] provided an extensive analysis of cross-validation (CVC) techniques, establishing them as the standard for evaluating ML models in medical settings.

Guyon and Elisseeff [22] highlighted the importance of feature selection, showing that irrelevant variables can degrade model generalization and interpretability. Branco et al. [24] and Raschka [25] emphasized that model evaluation and algorithm selection should account for domain-specific class distribution, feature relevance, and metric prioritization. These insights justify the methodological decisions made in this study — specifically the adoption of five-fold cross-validation, SMOTE balancing, and multi-metric evaluation (Accuracy, Precision, Recall, and F1).

3.4 Dataset and Data Sources

This study employed the Healthcare Stroke Dataset published on Kaggle by Alsawy[26]The dataset provides a balanced representation of demographic, clinical, and behavioral attributes, making it suitable for benchmarking ML-based stroke prediction models. Previous works [1], [5], [9] have demonstrated the dataset’s effectiveness in replicating clinical findings, while its open-source nature supports reproducibility and comparison across studies.

3.5 Summary and Research Gap

Across the reviewed literature, a consistent pattern emerges:

ensemble and kernel-based algorithms (RF, XGB, SVC) achieve superior results in stroke prediction compared to simple linear models. However, most prior studies either focused on single-model evaluation or lacked comprehensive treatment of data imbalance and validation consistency. Moreover, while several works explored neural networks [3], [18], few have systematically compared classical ML algorithms under balanced datasets generated by SMOTE.

Another observed gap is the limited integration of cross-validation (CVC) with oversampling methods — many studies evaluated models on a single test split, which may inflate performance estimates.This research addresses these limitations by:

1.Implementing six diverse ML algorithms under consistent preprocessing and validation conditions.

2.Combining SMOTE balancing with five-fold cross-validation for robust evaluation.

3.Prioritizing clinically relevant metrics (Recall and F1) over pure accuracy, aligning with medical practice priorities.

In summary, this work advances prior literature by offering a comparative, reproducible, and clinically interpretable framework for stroke prediction using machine learning.

4. Methodology

This section describes the methodological framework adopted in this study to predict stroke occurrence using multiple supervised machine learning algorithms. The process was systematically divided into data collection, preprocessing, handling class imbalance, model development, and performance evaluation. Figure 1 (conceptual) summarizes the workflow of the study, starting from dataset acquisition to model validation.

4.1 Dataset Description

The dataset used in this study is the Healthcare Stroke Dataset developed by Alsawy [26] and obtained from the Kaggle repository. It consists of 5,110 patient records with 12 attributes, including both numerical and categorical variables.

The attributes encompass demographic, medical, and behavioral factors such as:

•Age (in years)

•Gender (Male, Female, Other)

•Hypertension (0 = No, 1 = Yes)

•Heart disease (0 = No, 1 = Yes)

•Marital status (formerly “ever married”)

•Work type (Private, Self-employed, Government, etc.)

•Residence type (Urban or Rural)

•Average glucose level

•Body mass index (BMI)

•Smoking status (formerly smoked, never smoked, etc.)

•Stroke (Target variable) – 0 for non-stroke and 1 for stroke cases.

The dataset is moderately imbalanced, with approximately 95% non-stroke cases and 5% stroke cases, necessitating specialized techniques to balance the target variable before training.

Fig. 2

Visual representation of the class distribution in the Healthcare Stroke Dataset, showing that 95% of samples belong to the non-stroke class and only 5% represent stroke cases. This imbalance motivated the use of SMOTE for data balancing before training.

4.2 Data Preprocessing

Data preprocessing was a crucial step to ensure quality and consistency. Missing values in BMI and smoking status were imputed using mean and mode imputation techniques respectively.

Categorical features were encoded using Label Encoding for binary variables (e.g., gender, residence type) and One-Hot Encoding for multi-class variables (e.g., work type, smoking status).

All numerical attributes were standardized using Min–Max normalization, ensuring uniform scale across features and improving model convergence.

Furthermore, correlations between numerical variables were examined using a heatmap matrix, confirming moderate linear relationships between features like age, glucose level, and stroke incidence. Redundant or highly correlated variables were removed to minimize multicollinearity and overfitting.

Fig. 3

Correlation Heatmap between Dataset Variables:

The heatmap visualizes the relationships among numerical features such as age, glucose level, BMI, and stroke status, highlighting moderate positive correlations between age and stroke occurrence.

4.3 Handling Data Imbalance

To address the significant class imbalance in the dataset, the Synthetic Minority Oversampling Technique (SMOTE) was employed [11], [13].

SMOTE works by generating synthetic examples of the minority class (stroke) through interpolation between existing minority samples, effectively balancing the target variable without duplicating data.

This approach ensured that the models did not bias toward the majority (non-stroke) class. The balanced dataset after SMOTE achieved approximately a 1:1 ratio, significantly improving recall and F1-score in preliminary testing.

4.4 Model Development

A total of six supervised machine learning algorithms were implemented and compared under identical experimental settings:

1.Logistic Regression (LR) – a linear baseline classifier for comparison.

2.Random Forest (RF) – an ensemble model using multiple decision trees for stability and accuracy.

3.Extreme Gradient Boosting (XGB) – a boosting-based ensemble technique optimizing residual errors iteratively.

4.K-Nearest Neighbors (KNN) – a distance-based non-parametric classifier effective for localized patterns.

5.Multilayer Perceptron (MLP) – a feed-forward neural network trained with backpropagation to capture complex nonlinear relations.

6.Support Vector Classifier (SVC) – a kernel-based method optimized for high-dimensional data with non-linear separation.

All models were implemented using Scikit-learn and XGBoost libraries in Python 3.10. Hyperparameters were initially tuned using grid search and validated through cross-validation to prevent overfitting.

Each model was trained on 80% of the dataset (training set) and tested on the remaining 20% (test set) after SMOTE resampling.

4.5 Model Evaluation

Model performance was assessed using multiple metrics to ensure comprehensive evaluation beyond mere accuracy.

The following evaluation metrics were computed for each model:

•Accuracy (ACC): Overall correctness of predictions.

•Precision: Proportion of correctly predicted stroke cases out of all predicted stroke cases.

•Recall (Sensitivity): Ability of the model to correctly identify actual stroke cases.

•F1-Score: Harmonic mean of precision and recall, providing a balance between the two.

•Confusion Matrix: Visual representation of classification outcomes.

Additionally, a five-fold Cross Validation (CVC) was performed to enhance reliability and robustness of performance estimation [23].

This process splits the dataset into five folds, training on four and testing on one iteratively, ensuring that each data point is used for both training and validation.

All results were tabulated and visualized for comparison (see Table 1), and the best-performing model was selected based on the highest F1-score and recall — the most clinically significant indicators for stroke detection.

4.6 Summary of Methodological Approach

The adopted methodology integrates both data-centric and model-centric improvements.

On the data side, rigorous preprocessing, encoding, and SMOTE balancing ensured dataset reliability.

On the model side, using six distinct algorithms under uniform evaluation settings provided a fair comparison and reproducibility.

The integration of cross-validation and multi-metric evaluation ensured statistical soundness and clinical interpretability, making this framework robust for healthcare predictive modeling.

5. Results and Discussion

This section presents and discusses the experimental results obtained from the six machine learning models applied to the stroke prediction dataset. Each model was evaluated based on Accuracy, Precision, Recall, and F1-Score. The experiments were conducted after balancing the dataset using SMOTE and validating performance through five-fold cross-validation (CVC) to ensure reliability and generalizability.

5.1 Overview of Experimental Results

The overall performance metrics for all models are summarized in Table 1.

The Random Forest (RF) model achieved the highest accuracy (0.928), followed closely by XGBoost (0.918) and MLP (0.891). However, the Support Vector Classifier (SVC) outperformed the other models in recall (0.790), indicating superior sensitivity in detecting actual stroke cases. While Logistic Regression (LR) and KNN achieved relatively lower recall, their precision values remained consistent with the dataset’s characteristics.

Table 1. Performance comparison of ML classifiers for stroke prediction.

Fig. 4

Comparison of machine learning model performance metrics (Accuracy, Precision, Recall, and F1-Score) after applying SMOTE balancing.

As illustrated, ensemble-based models (RF and XGBoost) achieved strong overall performance in terms of accuracy but demonstrated a trade-off with recall, indicating a conservative classification tendency. On the other hand, SVC demonstrated a higher ability to identify stroke cases (Recall = 0.79) at the cost of slightly reduced accuracy.

These variations highlight the importance of metric diversity in medical applications, where recall often holds greater clinical significance than accuracy — detecting true stroke cases is more valuable than minimizing false positives.

5.2 Comparison and Model Insights

The Random Forest model’s superior accuracy can be attributed to its ensemble nature, which reduces variance and captures complex feature interactions.

Its moderate recall indicates that while it classifies most non-stroke cases correctly, it may miss some positive cases due to decision tree bias toward the majority class.

The SVC, conversely, shows exceptional recall, suggesting that the kernel transformation successfully separated minority samples after SMOTE balancing.

However, its lower accuracy and precision imply occasional false positives — a trade-off acceptable in medical prediction contexts where sensitivity is prioritized over specificity.

The KNN classifier achieved the second-highest recall (0.597), reinforcing its effectiveness for locally distributed patterns. However, its accuracy remained relatively low, likely due to the model’s dependence on distance metrics, which can be sensitive to feature scaling and noise.

The MLP exhibited balanced performance across metrics, validating its capability to model non-linear relationships while maintaining stability through regularization.

5.3 Impact of SMOTE Balancing and Cross-Validation

The integration of SMOTE had a measurable impact on model performance.

Before oversampling, the minority class (stroke) was underrepresented, leading to biased models that predicted mostly “non-stroke.”

After applying SMOTE, both recall and F1-scores improved notably across all algorithms — particularly for SVC and KNN, where the recall increased by over 25–40%.

The application of five-fold cross-validation (CVC) ensured that the reported results were not dependent on a single train-test split.

This validation approach produced consistent F1-scores across folds, confirming that the models generalized well to unseen data and that overfitting was minimized.

5.4 Error Analysis and Interpretation

A deeper inspection of the confusion matrices revealed that most errors arose from false negatives (missed stroke cases) in models with high accuracy but lower recall (e.g., RF and XGB).

This observation emphasizes that high accuracy alone can be misleading in imbalanced healthcare datasets.

Models like SVC and KNN, although less accurate, contributed significantly to the correct identification of minority (stroke) cases — an outcome more aligned with medical diagnostic priorities.

These findings align with prior studies [2], [4], [7], which observed similar trade-offs between precision and recall in stroke prediction tasks.

5.5 Summary of Findings

The comparative analysis confirms that no single algorithm universally dominates across all metrics.

However, the Random Forest model offers the best general performance, while SVC demonstrates the highest recall and is thus most clinically reliable.

The integration of SMOTE balancing and CVC significantly improved the robustness and interpretability of all models.

The results validate the effectiveness of multi-algorithm comparison in healthcare predictive modeling and demonstrate how combining recall-focused metrics with balanced training data leads to clinically meaningful insights

6. Conclusion and Future Work

6.1 Conclusion

This study presented a comprehensive machine learning framework for predicting stroke occurrence based on demographic, medical, and behavioral data.

Six supervised learning algorithms — Logistic Regression (LR), Random Forest (RF), XGBoost (XGB), K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), and Support Vector Classifier (SVC) — were implemented, trained, and evaluated under consistent preprocessing and validation conditions.

To address the issue of class imbalance, the SMOTE oversampling technique was applied, and model robustness was further ensured through five-fold cross-validation (CVC).

The experimental results revealed that while Random Forest achieved the highest accuracy (0.928), the SVC exhibited the highest recall (0.790), highlighting its superior ability to identify stroke cases correctly — a critical factor in medical prediction systems where sensitivity outweighs precision.

The findings underscore the importance of combining data balancing, cross-validation, and multi-metric evaluation in healthcare predictive modeling.

Unlike prior studies that focused on a single classifier or lacked structured data validation, this research provides a comparative and reproducible analysis across multiple algorithms, offering practical insights for real-world clinical applications.

Furthermore, this study confirms that traditional ML algorithms, when systematically tuned and evaluated, can deliver clinically meaningful predictions even without deep learning architectures — reinforcing their continued relevance in data-driven healthcare research.

⸻

6.2 Limitations

Despite the promising outcomes, several limitations must be acknowledged:

•The dataset was sourced from a single open-access repository (Kaggle), which may limit model generalizability to broader or regional populations.

•Certain medically relevant attributes (e.g., diet, exercise habits, family history) were unavailable, potentially restricting model interpretability.

•The study focused solely on classical ML algorithms; incorporating deep learning or hybrid frameworks could further improve predictive performance.

•External clinical validation with real patient records remains necessary before deployment in healthcare systems.

Recognizing these limitations helps contextualize the results and guides directions for future research.

⸻

6.3 Future Work

Future research should aim to extend this work in the following directions:

1.Dataset Expansion:

Incorporate larger, multi-center, and longitudinal datasets to enhance external validity and reduce population bias.

2.Advanced Feature Engineering:

Include new medical variables such as blood pressure history, cholesterol level, and lifestyle factors to enrich the predictive model.

3.Integration of Deep Learning and Ensemble Approaches:

Explore hybrid frameworks combining CNN, LSTM, or attention-based architectures with traditional ML models for improved accuracy and adaptability.

4.Explainable AI (XAI):

Employ interpretability tools like SHAP or LIME to help clinicians understand the decision-making process of the models.

5.Deployment in Clinical Decision Support Systems (CDSS):

Implement the trained models in real-time diagnostic systems to assist healthcare professionals in early stroke risk assessment and preventive interventions.

Through these directions, the proposed framework can evolve into a fully interpretable, data-driven clinical tool capable of supporting precision medicine and improving patient outcomes in stroke prevention

7.Funding

This research received no external funding.

8. Consent to Publish

Consent to publish: Not applicable. The study uses publicly available and fully anonymized data.

9. Ethics Approval and Consent to Participate

Ethics approval: Not applicable.

Consent to participate:

Not applicable.

The dataset used in this study is publicly available, fully anonymized, and does not involve any direct human subjects.

Author Contribution

The author solely contributed to the study design, data preprocessing, analysis, model development, interpretation of results, and manuscript writing.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

Data Availability

Data Availability Statement:The dataset used in this study is publicly available on Kaggle at https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset. The data is open access and can be freely used for academic and research purposes.

references

Wang WJ, Kiik M, Peek N, Curcin V, Marshall IJ, Rudd AG, Wang Y, Douiri A, Wolfe CDA. A systematic review of machine learning models for predicting outcomes after stroke. PLoS ONE, 15, 2, e0229016, 2020.

Asadi F, Rahimi M, Daeechini AH, Paghe A. The most efficient machine learning algorithms in stroke prediction: A systematic review. Health Sci Rep, 6, 9, e1188, 2023.

Heo H, Lee D, Nam S, Kim B. Machine Learning and Deep Learning Algorithms in Stroke Medicine. Front Neurol. 2021;12:734345.

Zhou Y, Zhang X, Li H et al. Predicting stroke occurrences: A stacked machine learning approach with feature selection and data preprocessing, BMC Bioinformatics, vol. 25, art. no. 329, 2024.

Zhang Y, Fei X, Zhang X, Wang Q, Fang Z. Machine learning prediction models for postoperative stroke in elderly patients: Analyses of the MIMIC database. Front Aging Neurosci. 2022;14:897611.

Choi SK, Park HH, Byeon DH, et al. Machine learning-based ischemic stroke outcome prediction: A multicenter study. Sci Rep. 2020;10:18707.

Khosla S, Jamthikar AN, et al. AI in stroke risk assessment and prevention: Current state and future directions. Comput Methods Programs Biomed. 2021;198:105754.

Miah AHA, Ahmed MU, et al. Stroke risk prediction with heterogeneous clinical features using ensemble machine learning. IEEE Access. 2021;9:115870–83.

Jani R, Patel P, Singh A. Assessing stroke risk using electronic health records and machine learning. J Biomed Inf. 2020;112:103611.

Jabbar MA, Aluvalu R, Chandra P. Prediction of stroke disease using machine learning algorithms. Procedia Comput Sci. 2020;167:1230–40.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.

He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84.

Fernández A, García S, Herrera F. SMOTE for imbalanced classification: A comprehensive review. Inf Sci. 2018;465:80–101.

Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Chen T, Guestrin C. XGBoost: A scalable tree boosting system, in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 785–794, 2016.

Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–7.

Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323:533–6.

Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27(8):861–74.

Saito S, Rehmsmeier K. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE, 10, 3, e0118432, 2015.

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2nd ed. New York, NY, USA: Springer; 2009.

Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.

Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection, in Proc. IJCAI, pp. 1137–1145, 1995.

Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Comput Surv, 49, 2, 2016.

Raschka S. Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808, 2018.

Alsawy M, Dataset HS. Kaggle, 2023. Available: https://www.kaggle.com/datasets/alsawymohamed/helthcare-stroke

⸻.

Yes