Enhancing Logistic Regression Performance Through Hyperparameter Tuning: A Comparative Evaluation Across Datasets

MueedAhmad1,5Emailahmedmueed822@gmail.com

NomanJaved2Emailnoman19304@gmail.com

AwaisMuzafar4Emailawaismuzafarofficial@gmail.com

MateenMuzafar3,5Emailmateenmuzafar006@gmail.com

HadiaNaseer5Emailhadianaseer902@gmail.com

GuantianHuang1✉Emailhuanggt@mails.neu.edu.cn

DianningHe6✉Emailhedn@cmu.edu.cn

College of Medicine and Biological Information EngineeringNortheastern UniversityWenhua Road, Shenyang 110004, China

2Department of Computer ScienceUniversity of West Scotland, LondonLondon Campus, 58 Waterloo Road, London, United Kingdom

3Department of BioinformaticsUniversité d'Évry Paris-Saclay23 Bd François Mitterrand, 91000 Évry-Courcouronnes, France

4Department of Computer ScienceBeijing Institute of Graphic CommunicationQingyuan Rd, Daxing District, Beijing, China

5Department of Computer ScienceUniversity of Agriculture, FaisalabadJail Road, near Gulberg Police Station, Faisalabad, Punjab, Pakistan

6School of Health ManagementChina Medical UniversityNo. 77 Puhe Road, Shenyang North New Area, Shenyang, Liaoning, China

Mueed Ahmad and Noman Javed: These authors contributed equally to this work.

begin{center}\smallCorresponding author: Dianning He (hedn@cmu.edu.cn)\newlineCo-corresponding author: Guantian Huang (huanggt@mails.neu.edu.cn)\end{center}\newpage

Abstract

vspace{0.5\baselineskip}Keywords: Machine Learning, Logistic Regression, Hyperparameter Tuning, Generalization, Model Performance

newpage

Abstract

Methods: A Python-based experimental framework was developed using Scikit-learn, NumPy, and Pandas to examine how hyperparameters influence LR performance. A combinatorial optimization strategy was applied to tune regularization strength (C), penalty type (L1), solver choice (liblinear, saga), class-weight settings, and maximum iterations. Model evaluation was conducted using both train--test splits (20%, 30%, 40%) and k-fold cross-validation (

$(k = 3, 5, 10)$

). Performance was assessed using accuracy, F1-score, AUC, and cross-validation accuracy. Tableau-based visual analytics were used to compare model behaviors under different configurations.

Results: Optimized hyperparameters consistently improved model performance across all datasets. The breast cancer and digits datasets achieved the most substantial gains, with maximum test accuracies of 97% and 98%, respectively, and AUC values up to 0.99. Cross-validation scores indicated strong generalization, with the best-performing models showing CV accuracies above 0.90. In contrast, performance improvements on heart disease and liver disorder datasets were present but more modest due to noisier features and class imbalance. Hyperparameter combinations involving L1 penalty, balanced class weights, and the liblinear solver produced the highest accuracy and F1-scores across several datasets.

Conclusions: Systematic hyperparameter tuning significantly enhances logistic regression performance, generalization, and discrimination ability. The results demonstrate that even simple models can achieve high accuracy when appropriately optimized. This framework provides practical guidance for improving LR across heterogeneous datasets and highlights the importance of penalty choice, regularization strength, and solver selection. Future work should explore advanced optimization techniques such as Bayesian optimization and evolutionary algorithms to further improve efficiency and performance.

Introduction

Many investigations have compared standard regression models with super-learning models, however, it is still unclear which is better \cite{R6}. Using an ensemble machine learning technique called super learning, the final model performs at least as well as any of its component parts by combining the predictions from multiple algorithms and assigning optimal weights to each of them \cite{R9}. This improves prediction ability by allowing researchers to examine multiple machine learning techniques simultaneously instead of relying solely on one.

Typically, super-learning apps also increase the hyperparameters of the underlying machine learning tools. Frequently modified before training, hyperparameters can have a substantial impact on the behavior of an algorithm by changing its structure or complexity R40. Although many research uses default settings \cite{R5,R9}, the greatest outcomes in the superlearning framework require careful selection of hyperparameter values.

This is particularly important because these hyperparameters have a large impact on how well algorithms like logistic regression work. Iterating through multiple configurations and using techniques such as grid search to find the optimal hyperparameters can improve the performance of the model R34.

In order to apply superlearning to a high-dimensional dataset, we combined logistic regression with other conventional model-building techniques. Due to its interpretability, logistic regression is a widely used and appreciated technique for binary classification problems, such as the prediction of liver disorders, heart disease, and breast cancer R10. By adjusting factors such as regularization strength and solver parameters, hyperparameter tinkering can significantly improve the performance of these models R11. It is unclear, therefore, if hyperparameter modification is applied uniformly across datasets R40.

To tackle this, we use datasets from a variety of domains, including as healthcare and digit recognition, to investigate how hyperparameter tinkering affects logistic regression models. The goal is to identify the optimal hyperparameter combinations to increase accuracy and generalisation capacity. The model's hyperparameters, which regulate different elements like regularisation strength and solver strategies, must be changed in order to enhance performance. This study compares the performance of a ``tuned'' and ``untuned'' super learner, the latter employing default hyperparameters. A different logistic regression model is also used to compare the tuned super learner. We investigate possible combinations of class weights, solver settings, penalty types, and C values using a methodical, combinatorial approach to hyperparameter optimisation.

A thorough assessment method that employs multiple train–test split techniques (with validation sizes of 20%, 30%, and 40%) and cross validation (with 3, 5, and 10 folds) supports this strategy. A more comprehensive analysis of model performance is made possible by the extra assessment metrics provided, which include test accuracy, training accuracy, F1 score, and AUC score. Through the establishment of a methodical approach to logistic regression hyperparameter tweaking, this technique can enhance model fit, accuracy, and generalisation in binary classification scenarios. Both researchers and practitioners will gain from our findings, which offer crucial information for improving logistic regression models in real-world scenarios.

Literature Review

To increase model performance and accuracy, numerous researchers have studied optimisation strategies and machine learning methodologies. A number of studies have concentrated on assessing model parameters, choosing suitable algorithms, and implementing tuning techniques for improved outcomes.

Schratz et al.\investigated the impact of spatial autocorrelation on hyperparameter tuning and model evaluation. They compared GLM, GAM, SVM, RF, and k-NN under random and geographic cross-validation using ecological disease mapping as a case study. The findings demonstrated that whereas spatial CV yields more accurate performance estimates, random CV produces unduly optimistic measures. They concluded that both tuning and evaluation should make use of spatial segmentation R48.

Sun et al.\R49 used Bayesian hyperparameter optimisation to perform a comparative study on landslip susceptibility mapping. To assess how optimised parameters affected model accuracy, they contrasted logistic regression and random forest models. The results demonstrated that both models’ predictive performance was greatly enhanced by Bayesian optimisation, with random forest attaining slightly superior accuracy. The study concluded that improving model reliability for geospatial risk mapping applications requires hyperparameter optimisation.

Raschka \cite{R50} provided a comprehensive tutorial on model assessment, model selection, and performance metrics in machine learning. He emphasised the importance of nested cross-validation for obtaining unbiased performance estimates and discussed the need to select evaluation metrics based on problem type and class imbalance. The tutorial also advised against relying solely on accuracy—particularly for imbalanced datasets—and advocated the use of ROC-AUC, precision–recall curves, and F1-scores. This work serves as a critical resource for ensuring comprehensive model evaluation in machine learning research.

Pfob et al.\R51 focused on model comparison while addressing data preprocessing, hyperparameter tuning, and the practical implementation of machine learning in medical research. To classify breast masses using mammographic imaging features and patient age, they evaluated logistic regression, XGBoost, SVM, and neural networks. All models yielded similar AUROC values (

$\approx$

0.88–0.89), illustrating that simple models can perform as well as complex models when proper preprocessing and hyperparameter tuning are applied.

Ambesange et al.\R52 developed a heart disease prediction system that combined logistic regression with ensemble learning techniques and GridSearch-based hyperparameter optimisation. By refining model parameters and preprocessing methods, the study achieved improved classification accuracy on the UCI heart disease dataset. The authors concluded that effective medical diagnosis systems require careful parameter optimisation and robust data preprocessing.

Erden et al.\R53 investigated advanced hyperparameter optimisation strategies to enhance machine-learning model performance. Compared with traditional grid search and random search, modern Bayesian and evolutionary algorithms achieved superior performance with fewer iterations. These techniques proved particularly effective for complex models that demand substantial computational resources. The study highlighted the efficiency and flexibility of intelligent optimisation algorithms in producing high-performing machine-learning models.

Background

Logistic Regression

The supervised learning method of logistic regression is frequently employed for classification R17. One popular supervised learning technique for classification applications is logistic regression. It works well for binary classification, but it can be extended to support multiclass scenarios. Based on the values of its features, the logistic regression model determines the likelihood that an input instance belongs to a particular class.R23

Because of its interpretability and proven efficacy in binary classification scenarios, logistic regression was chosen as the main classification model for this investigation. Logistic regression is a widely used linear model for predicting binary outcomes.R10 Based on the input attributes, it calculates the probability that the outcome will belong to a specific class. Its simplicity and the ease with which the coefficients of each feature can be interpreted make logistic regression a strong baseline model. This interpretability allows for a deeper understanding of the relationships between input features and the target variable.

The logistic regression model can be expressed as follows:

$P(y=1 \mid x) = g(x^{T}\beta)$

eq:logistic_model

where:

$(P(y=1 \mid x))$

represents the probability of the positive class given the input features.

$(g(x^{T}\beta))$

is the logistic (sigmoid) function that transforms the linear combination of input features.

$(x)$

is the vector of input features.

$(\beta)$

is the vector of coefficients or weights assigned to each input feature.

The logistic (sigmoid) function is defined as:

$g(x) = \frac{1}{1 + e^{-x}}$

eq:sigmoid

Maximum likelihood estimation (MLE) or other optimisation methods are used to estimate the coefficients

$\beta$

in logistic regression. The goal is to find the values of

$\beta$

that maximise the likelihood of the observed data. During training, the model learns the coefficient values that best fit the data. Based on the predicted probability of the positive class, instances can be classified into binary categories using an appropriate threshold. Logistic regression can also be extended to multiclass classification problems using methods such as one-vs-rest or multinomial logistic regression R19.

Model Evaluation Techniques

Using cross-validation, a train–test split technique is suggested for selecting Logistic Regression (LR) models.

Cross-Validation

In cross-validation, the dataset is divided into multiple smaller subsets, or "folds," and the model is trained on some of the folds and tested on others. A reliable measure of the model's capacity for generalisation is obtained by averaging its performance across all folds. Common cross-validation methods such as 3-fold, 5-fold, and 10-fold have been demonstrated to produce reliable performance estimates R12.

Train –Test Split

The train–test split methodology is a widely used method for assessing machine learning models. Model training uses a portion of the dataset, while performance evaluation uses the remaining portion. In practice, this approach is frequently used to assess a model's generalisability to fresh, untested data R20. The model may be tested using test sizes of 20%, 30%, and 40% in various scenarios. R13,R14

Hyperparameter Tuning

Logistic regression and other machine learning models require hyperparameter optimisation. The model's performance is significantly impacted by hyperparameters such as class weights, solver selection, penalty type, and regularisation strength (C). Grid search, random search, and Bayesian optimisation are popular techniques for hyperparameter optimisation. These techniques have been demonstrated to increase the generality and accuracy of models, especially when used consistently across several datasets R21. The degree of regularisation in the model is controlled by the regularisation parameter

$C$

. Better regularisation and reduced overfitting are achieved with smaller

$C$

values R22. Two other hyperparameters that impact the model’s performance are the solver and penalty type. L1 regularisation encourages sparsity and makes it easier to identify significant features \cite{R8}, whereas the solver selection determines the optimisation strategy for obtaining model coefficients R23.

C Parameter

Numerous metrics, such as test accuracy, train accuracy, AUC score, and F1 score, are used to evaluate the model's performance. These measures show how well the model can forecast both known and unknown variables. While train accuracy measures how well the model fits the training set, test accuracy evaluates the model's ability to generalise to new data. The F1 score is particularly useful for imbalanced datasets because it considers both precision and recall. Higher AUC score values, which demonstrate the model's ability to distinguish between positive and negative categories, indicate better performance R22.

Penalty Values

The regularisation applied to the model's coefficients is determined by the penalty term. The "L1" penalty, which corresponds to L1 regularisation, was applied in this work to encourage sparsity in the coefficient estimations. By controlling the model's complexity, this penalty reduces overfitting and enhances the model's capacity to generalise to new data R22.

Solver

When estimating coefficients, logistic regression provides a number of solvers that affect convergence and optimisation behaviour. We selected the \texttt{liblinear} and \texttt{saga} solvers for this investigation to evaluate their effect on model performance. Since the solver selection determines the optimisation procedure used to reduce the model's error function, it can have a substantial impact on convergence speed and accuracy R23.

Class Weight

By altering how various classes impact the model, class weights are utilised to address class imbalances. This study examined how the \texttt{dict} and \texttt{balanced} class weight settings affected the model's capacity to handle unbalanced datasets. In the case of class imbalances, this correction increases model accuracy by reducing bias towards the over-represented class R30.

Maximum Iterations

The maximum number of iterations determines the convergence criterion for the optimisation process. The effect of convergence and altering the number of iterations on the model's performance was investigated for three values: 10{,}000, 100{,}000, and 1{,}000{,}000. R41

Evaluation Classification Model

In addition to cross-validation accuracy, a number of assessment metrics, including train accuracy, test accuracy, F1 score, and AUC score, were used to gauge the effectiveness of the logistic regression model. These metrics provide comprehensive details regarding the predictive performance of the model.

Cross-Validation

By training and testing the model on several data subsets (folds) and averaging the outcomes, cross-validation accuracy is determined. This offers a trustworthy assessment of the generalisability of the model. R46 The following is the formula for cross-validation accuracy:

[\text{Accuracy}_{CV} = \frac{1}{K} \sum_{k=1}^{K} Acc_k\]

Where

$K$

is the number of folds, and

$Acc_k$

is the accuracy of the

$k$

-th fold. This metric evaluates the model’s robustness by testing it on different subsets of the data R24.

Test Accuracy

The model's performance on unseen data is measured by the test accuracy. It is computed as follows:

[\text{Test Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

Here,

$TP$

$TN$

$FP$

, and

$FN$

represent the counts of true positives, true negatives, false positives, and false negatives, respectively. A higher test accuracy indicates that the model can generalise well to new data R32.

Train Accuracy

The model's performance on the training data is gauged by the train accuracy. It is calculated as follows:

[\text{Train Accuracy} = \frac{\text{Correctly Predicted Instances}}{\text{Total Instances in the Training Set}}\]

If the train accuracy is high, the model has effectively learned the patterns in the training data. However, this does not always imply proper generalisation to unseen data, so test accuracy must also be considered R33.

F1 Score

The F1 score combines precision and recall into a single metric, providing a balanced evaluation of model performance, especially for unbalanced datasets. It is defined as:

[F1 = 2 \times \frac{(\text{Precision} \times \text{Recall})}{\text{Precision} + \text{Recall}}\]

An F1 score close to 1 indicates a better balance between precision and recall, reflecting a well-performing model R47.

AUC Score

The Area Under the Curve (AUC) is a performance metric for binary classification that measures how well the model distinguishes between positive and negative classes. The Receiver Operating Characteristic (ROC) curve is obtained by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) for various classification thresholds.R25 The AUC is calculated as:

[AUC = \int (FPR(T) \times TPR(T)) \, dT\]

AUC values range from 0 to 1, with 0.5 denoting no discriminatory ability and 1 indicating perfect discrimination. Higher AUC values represent better model performance and class separability.\cite{R4}

Materials and Methods

Methodology

A new model was developed using Python programming to enhance the performance of Logistic Regression (LR) on four distinct datasets. Numerous important Python libraries, including \texttt{NumPy}, \texttt{Pandas}, and \texttt{Scikit-learn}, were used in the model implementation. To evaluate model performance, two model selection methods were used: cross-validation and train–test split.

Hyperparameter adjustments were made to a number of crucial components that have a major influence on the accuracy of Scikit-learn's Logistic Regression implementation. These comprised the

$C$

parameter, maximum iterations, solver, class weight, and L1 penalty. The best configurations for each hyperparameter were found via combinatorial optimisation.

The study combines model selection techniques with the tuning of these hyperparameters in order to comprehensively evaluate each possible combination. This process identified the model selection techniques and hyperparameter setups that enhanced the performance of Logistic Regression across all datasets.

A flow diagram illustrating the model concept is presented in Figure 1.

Fig. 1

A comprehensive framework for optimizing Logistic Regression performance.

Dataset Description

The initial step in our inquiry was selecting pertinent datasets for evaluation. Since each dataset had unique characteristics and challenges that made it suitable for classification tasks, we chose to work with four distinct datasets. These datasets were selected because of their range of features, accessibility, and classification-related value:

Heart Disease Dataset: The heart disease dataset consists of characteristics associated with heart disease patients, including age, sex, blood pressure, cholesterol, and the presence of other risk factors. It is primarily numerical and often used to predict cardiovascular risk. The dataset is accessible at: https://www.openml.org/search?type=data &status=active &id=43823

Liver Disorder Dataset: This dataset includes a number of enzymes as well as factors linked to liver issues, such as patient age, gender, alcohol consumption, and blood test results, including albumin and bilirubin levels. The dataset presents a variety of challenges for classification algorithms because it includes both numerical and categorical data. It is accessible at: https://archive.ics.uci.edu/ml/datasets/liver +disorders

Digit Dataset: Images of handwritten numbers from 0 to 9 make up the digit dataset. Each image is represented by pixel values that correspond to the greyscale intensity of the digit. Since the majority of this dataset is made up of image data, categorisation presents a special difficulty, especially when it comes to image processing. The dataset is accessible at: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html

Fig. 2

Datasets samples features.

Figure 2 illustrates the sample distribution across all four datasets used in this study. The Breast Cancer dataset consists of 569 samples, each with 30 features, where the target variable indicates whether the tumor is malignant or benign. The Digits dataset includes 1,797 samples with 64 features, representing handwritten digits, with the target variable corresponding to the digit in the image. The Heart Disease dataset has 270 samples with 13 features, where the target variable denotes the presence of heart disease. The Liver Disorder dataset predicts whether a patient has a liver disorder using six variables from 345 samples. Each dataset can be utilised for classification problems because it contains instances with distinctive features. Logistic regression, a binary classification technique that can be used to predict the existence of cardiac illness or malignant vs.\benign tumours, was employed to examine all datasets. The Digits dataset is used to illustrate a multi-class classification task. The investigation was conducted using a Python script. It provides hyperparameters for the logistic regression model, including C, penalty, solver, class weight, max iter, and CV values, in addition to importing libraries like CSV, Pandas, Numpy, and scikit-learn.

Experimental Setup

Using Python tools like \texttt{Scikit-learn}, \texttt{NumPy}, and \texttt{Pandas}, a Logistic Regression model was created and assessed in this study. To thoroughly evaluate the model's performance and determine the ideal configuration, test sizes and hyperparameters were carefully selected. The following steps were taken:

Importing Libraries

Prior to training the machine learning models, the datasets were preprocessed to ensure data compatibility and quality. This process included feature scaling, normalization, missing value resolution, and, if required, categorical variable encoding. The following Python libraries were essential for these tasks:

NumPy:

A core Python toolbox for numerical computation, \texttt{NumPy} efficiently utilises arrays and matrices in data management. It was used to perform mathematical calculations, handle missing values, and convert data into numerical representations that machine learning models could utilise R42.

Pandas:

texttt{Pandas} is an open-source library for data analysis and manipulation. Its \texttt{DataFrame} and \texttt{Series} data structures simplify data cleaning, filtering, and reshaping. \texttt{Pandas} played an essential role in preprocessing and preparing datasets for model training R44.

Scikit-learn:

For applications such as regression and classification, this widely used machine learning library offers efficient implementations of key algorithms. It was used in this study to train, evaluate, and tune the hyperparameters of the Logistic Regression models. The \texttt{LogisticRegression} class from \texttt{Scikit-learn} was used to construct and assess Logistic Regression models for the selected datasets R41.

Model Selection Method

The \texttt{LogisticRegression} class was used to implement the Logistic Regression model. The \texttt{Scikit-learn} version also applies the sigmoid function in its mathematical formulation. The model learns the optimal values of coefficients

$(b_0, b_1, \ldots, b_n)$

from the training data in order to fit the logistic regression equation to the observed data. R28

A Python script was created to extract four datasets using the functions \texttt{load_breast_cancer()}, \texttt{load_digits()}, \texttt{fetch_openml()}, and \texttt{pd.read_csv()}. The program iterates through each dataset, and for each one, it tests all combinations of hyperparameters. It performs k-fold cross-validation using the \texttt{cross_val_score()} function and calculates the mean cross-validation accuracy (\texttt{cv_acc}) and F1 score (\texttt{f1_acc}). The results are then exported to a CSV file for further analysis.

Visualization

The model evaluation results were visualized using \texttt{Tableau}R39. This allowed for clear interpretation and comparison of model performance across various datasets and experimental conditions.

Results

Cross-Validation

Logistic Regression can also be applied to multi-class classification problems, such as determining the digit represented in the Digits dataset R29. In such cases, the model is trained independently for each class, selecting the class with the highest predicted probability. K-fold cross-validation was used to evaluate the model’s performance, with the \texttt{cross_val_score()} function determining accuracy and the \texttt{f1_score()} function calculating the F1 score, which incorporates both precision and recall R43,R46.

Fig. 3

Average CV Accuracies and F1 Scores.

The maximum cross-validation (CV) accuracy achieved across the datasets was 0.961 for Breast Cancer, 0.7014 for Liver Disorder, 0.8407 for Heart Disease, and 0.930 for Digits. The corresponding average maximum F1 score was 0.6911 (Figure 3). These values demonstrate the overall performance of the Logistic Regression models; however, it is important to note that performance can vary depending on factors such as dataset size, data quality, preprocessing techniques, and feature engineering.

The models' mean F1 score was 0.6911, which shows a balanced performance between precision and recall, and their average CV accuracy was 0.7559, indicating an overall accuracy of 75.59% during cross-validation. Higher F1 scores and CV accuracies suggest strong generalisation capacity and accurate class prediction, while lower values may indicate imbalance or underfitting.

Since the F1 score accounts for both precision and recall, its harmonic mean provides a more comprehensive evaluation of model performance, with higher values representing more robust predictive ability.

Fig. 4

Average CV Accuracy and F1 Score with different CV fold values.

The number of cross-validation (CV) folds directly influences the F1 score, as increasing the number of folds generally provides a more precise estimate of model performance. As shown in Figure 4, the effect varies across datasets. For instance, a CV value of 10 yields the highest F1 score for the Heart Disease dataset (0.5148), while the Digits dataset also achieves its peak F1 score at CV\,{=}\,10. In contrast, the Breast Cancer dataset maintains consistently high F1 scores across all tested CV values.

In general, higher CV values can improve the reliability of performance estimates on unseen data but at the cost of increased computational complexity. However, the performance gain may not always be substantial, and the optimal CV choice often depends on the dataset and model configuration.

The overall mean F1 score across all datasets and CV values was 0.6584, while the average maximum F1 score reached 0.6911. These results highlight the trade-off between precision and recall: precision reflects the proportion of correctly identified positive predictions, while recall measures the ability to detect all actual positive cases. Together, the harmonic mean of these metrics, represented by the F1 score, offers a balanced evaluation of model performance.

Fig. 5

Minimum and Maximum F1 Scores with different CV fold values.

A higher F1 score reflects stronger model performance, with a value of 1 representing perfect precision and recall, while a score of 0 indicates poor performance. As shown in Figure 5, the average F1 scores varied across datasets and cross-validation (CV) folds. For example, the Breast Cancer dataset achieved consistently high F1 scores ranging from 0.897 to 0.959, whereas the Digits dataset showed greater variability, with scores ranging from 0.143 to 0.929. The overall mean F1 score of 0.6584 suggests moderate accuracy with room for improvement, while the average F1 score of 0.6911 indicates balanced performance across all datasets and CV values.

It is important to interpret these values in the context of the specific classification problem, as the acceptable threshold for F1 may vary depending on application requirements. Notably, the Heart Disease dataset exhibited comparatively lower F1 scores across all CV values, whereas the Liver Disorder dataset demonstrated relatively higher scores under certain CV configurations.

Fig. 6

F1 Score and Cross-Validation (CV) Accuracy with varied CV fold values.

A higher F1 score reflects stronger model performance, while higher cross-validation (CV) values generally indicate better generalisation to unseen data. As shown in Figure 6, the model performed poorly on the Heart Disease and Liver Disorder datasets, with CV accuracy values between 0.645 and 0.651. In contrast, performance was stronger on the Digits and Breast Cancer datasets, with CV accuracies ranging from 0.772 to 0.925. F1 scores ranged from 0.515 to 0.918, with the Breast Cancer dataset achieving the highest value (0.91).

Overall, these results suggest that the model achieved only moderate performance, particularly on the medical datasets. To gain deeper insights, precision and recall should be considered alongside the F1 score, and future improvements may require hyperparameter refinement or the adoption of alternative algorithms.

Fig. 7

Optimal hyperparameter combinations and CV fold values for maximum scores.

The mean cross-validation (CV) accuracy of Logistic Regression models across various hyperparameter settings is displayed in Figures 6 and 7. The parameters evaluated included regularisation penalty (L1 or L2), class weights to handle unbalanced data, inverse regularisation strength (

$C$

), and solver algorithms (\texttt{liblinear} or \texttt{saga}). The findings show that prediction performance is significantly impacted by the choice of hyperparameters. For the Digits dataset, the highest accuracy was achieved with a

$C$

value of 0.001, balanced class weights, an L1 penalty, and either the \texttt{liblinear} or \texttt{saga} solver. CV accuracies ranged from 0.287 to 0.918.

When the class weights were balanced at 10 and the

$C$

values were 0.001 or 0.01, the Heart Disease dataset performed at its best, with accuracies between 0.533 and 0.556. The \texttt{liblinear} or \texttt{saga} solvers were used with an L1 penalty. Similarly, the Liver Disorder dataset performed best when

$C$

was set to 0.01 or 10 and class weights were set to 5 or 10, achieving accuracies ranging from 0.579 to 0.679. The solver was either \texttt{liblinear} or \texttt{saga}.

The Breast Cancer dataset consistently outperformed the others when trained using the L1 penalty and \texttt{liblinear} solver under 10-fold CV, with an average CV accuracy of 0.9315 and an F1 score of 0.9246. The average F1 score for all datasets combined was 0.6911, while the mean CV accuracy was 0.7559. These results demonstrate that selecting the appropriate penalty and solver is essential for optimising model performance and that hyperparameter tuning is highly dataset-dependent. The specific dataset characteristics and the required degree of regularisation dictate the optimal hyperparameter configuration for Logistic Regression models.

Using more complex models or refining hyperparameters can significantly increase predictive precision. The penalty hyperparameter defines the model’s regularisation technique—L1 for sparsity, L2 for smoothness—while the solver hyperparameter determines the optimisation algorithm used during model training. In this study, both \texttt{liblinear} and \texttt{saga} solvers were evaluated, demonstrating robust performance across different configurations. The data analysis indicates that the Breast Cancer dataset achieved the highest predictive stability under these tuned settings.

Fig. 8

Cross-validation (CV) accuracy and F1-score scatter plot with power trend line.

While cross-validation (CV) accuracy reflects the model's average performance across multiple training and testing splits, the F1 score provides a balanced measure of model accuracy by considering both precision and recall. In general, higher values of both metrics indicate better model performance. However, the outcomes exhibit substantial variation, as shown in Figure 8. Certain models perform considerably better than others, as evidenced by the wide range of CV accuracy values from 0.281 to 30.556 and F1 scores from 0.1428 to 0.8998.

Inconsistencies between the two metrics are further revealed upon closer inspection. The model that achieves the highest F1 score of 0.8998 also reports a strong CV accuracy of 0.8954, illustrating close alignment between the two measures. Conversely, the model with the highest CV accuracy (30.556) exhibits a comparatively low F1 score of 0.3571, indicating poor predictive reliability despite high validation accuracy. These discrepancies underscore the importance of evaluating multiple performance metrics rather than relying solely on a single indicator, as a comprehensive assessment ensures more accurate insights into model behaviour and generalisation capability.

Fig. 9

Cross-validation (CV) accuracy and F1 score with minimum and maximum output values.

Figure 9 presents the minimum and maximum CV accuracy and F1 score values across the models.The percentage of variance in the dependent variable that can be explained by the independent variables is measured by the coefficient of determination (

$R^{2}$

). In this study,

$R^{2}$

values ranged from 0.833 to 0.959, with higher values indicating a stronger model fit. Similarly, the p-value evaluates the statistical significance of associations between variables, with smaller values (typically

$<0.05$

) suggesting more reliable relationships. The observed p-values varied widely, from 0.7072 to 30.55, reflecting differences in variable significance.

Collectively, higher

$R^{2}$

values, alongside improvements in cross-validation (CV) accuracy and F1 scores, indicate better model performance, while smaller p-values reinforce the validity of the estimated coefficients.

Train--Test Evaluation

To evaluate the model's efficacy, the dataset was divided into training and testing subsets according to a defined test size. The test size represents the portion of data reserved exclusively for testing, while the remaining data is used for model training. This separation ensures that model performance is assessed on unseen data, thereby providing an unbiased measure of generalisation capability.

Fig. 10

Test sizes used for model evaluation.

Specifically, test size proportions of 20%, 30%, and 40% were selected, taking into account dataset size and model complexity (Figure 10). With a 20% test size, 80% of the data is used for training while 20% is reserved for testing. Although this split provides more data for training, it may increase the risk of overfitting and yield overly optimistic results.

In contrast, a 30% test size offers a balanced approach by allocating 70% of the data for training and 30% for testing, which is widely regarded as a standard practice in model evaluation. This configuration produces more robust performance estimates by ensuring there is sufficient data for both training and validation.

When the test size is increased to 40%, the model is trained on 60% of the data and evaluated on the remaining 40%. This approach reduces the likelihood of overfitting by employing a larger test set; however, it may also lead to increased score variability and slightly weaker generalisation due to reduced training data. By combining these different splits, the study provides a comprehensive evaluation of the Logistic Regression model's resilience across datasets with varying feature characteristics.

Fig. 11

Average AUC and F1 scores with optimal hyperparameter configurations.

Regardless of the classification criterion, the Area Under the Curve (AUC) statistic evaluates a model's ability to discriminate between positive and negative classes. As shown in Figure 11, the Breast Cancer dataset achieved the best performance, with an F1 score of 96% and an AUC of 99% across various test sizes. The Digits dataset also performed strongly, recording an AUC of 98% and an F1 score of 88%, though both metrics were slightly lower than those of the Breast Cancer dataset.

In contrast, the Heart Disease dataset demonstrated the weakest results, with an AUC of 62% and an F1 score of only 26%. The Liver Disorder dataset achieved moderate results, with an AUC of 68% and an F1 score of 35%, surpassing Heart Disease but remaining below Breast Cancer and Digits.

These differences can be attributed to dataset characteristics. The Breast Cancer dataset contains a balanced distribution of samples and highly informative features, enabling strong class discrimination. The Digits dataset's high AUC is supported by its large number of observations, simple numerical features, and easily distinguishable categorical targets. Conversely, the Heart Disease dataset's lower discriminatory strength likely stems from overlapping or less distinct features. Although it still performs below Breast Cancer and Digits, the Liver Disorder dataset outperforms Heart Disease due to clearer class separation and fewer, more meaningful attributes that help mitigate overfitting.

Fig. 12

Average and median F1 scores across different test size values.

The Breast Cancer dataset achieved the highest F1 score of 96% across various test sizes, likely due to its highly relevant features, substantial data variability, and balanced class distribution. With an F1 score of 81% at a 0.2 test size, the Digits dataset outperformed both the Liver Disorder and Heart Disease datasets (Figure 12). This superior performance can be attributed to its numerical features, larger sample size, and categorical target variable, all of which facilitate easier learning and classification.

In contrast, the Heart Disease dataset recorded the lowest F1 score of 24% at a 0.4 test size. This poor performance may result from less informative features, class imbalance, overlapping feature spaces, and limited data variability, all of which restrict generalisation. The Liver Disorder dataset demonstrated intermediate performance, achieving an F1 score 45% higher than Heart Disease at a 0.2 test size, yet remaining below Breast Cancer and Digits. This relative improvement may be explained by clearer class separation and a more balanced feature distribution.

On average, the F1 score across all datasets was 0.5995, with a median of 0.6151, indicating moderate overall performance of the Logistic Regression model under varying test size configurations.

Fig. 13

Train and test accuracy of each dataset.

Figure 13 presents the training and testing accuracies for the four datasets: Breast Cancer, Digits, Heart Disease, and Liver Disorder. The Breast Cancer dataset achieved 97% accuracy on both training and testing, indicating strong generalisation to unseen data. The Digits dataset reached 100% training accuracy, suggesting potential overfitting, although its high test accuracy demonstrates that it can still perform well on new inputs.

For the Heart Disease dataset, both training and testing accuracies were 90%, reflecting consistent performance across seen and unseen data. In contrast, the Liver Disorder dataset achieved 69% training accuracy and 79% testing accuracy, indicating possible underfitting. However, the relatively higher test accuracy suggests modest generalisation capability.

Overall, balanced training and testing accuracies are essential for reliable model performance, ensuring that the model captures meaningful patterns without overfitting or underfitting.

Fig. 14

AUC, F1 score, test accuracy, and train accuracy with optimal hyperparameter configurations.

Figure 14 shows that the Breast Cancer and Digits datasets achieved higher AUC, F1, test, and train accuracies compared to the Heart Disease and Liver Disorder datasets when using the penalty (\texttt{l1}), class weight (\texttt{balanced}, \texttt{dict}), and solver (\texttt{liblinear}, \texttt{saga}) parameters. The Breast Cancer dataset achieved accuracies ranging from 90% to 99%, while the Digits dataset ranged from 70% to 98%. In contrast, the Heart Disease and Liver Disorder datasets performed considerably lower, with accuracies between 26%–64% and 35%–67%, respectively. On the Digits dataset, the \texttt{liblinear} solver consistently outperformed \texttt{saga}, whereas \texttt{saga} proved more effective for the Breast Cancer dataset. For Heart Disease and Liver Disorder, performance remained relatively weaker across parameter configurations.

These findings highlight the significant influence of penalties, class weights, and solver choices on model performance. Appropriate hyperparameter tuning, such as applying \texttt{class_weight = balanced} for imbalanced datasets, is essential for improving generalisation and maintaining a balance between bias and variance. Overall, the optimal combination of penalty and solver parameters plays a crucial role in maximising the Logistic Regression model's predictive stability and interpretability.

Fig. 15

Maximum AUC, F1 score, and train--test accuracy across datasets.

With superior AUC, F1, training, and testing accuracies, the Breast Cancer dataset performed best overall, as shown in Figure 15. The Liver Disorder dataset exhibited the lowest performance, followed by the Digits and Heart Disease datasets, both of which demonstrated moderate results. The Breast Cancer dataset showed strong generalisation to unseen data with an exceptional test accuracy of 0.9766, while the Digits dataset achieved a maximum training accuracy of 1.0000, suggesting well-optimised parameter selection.

The model's ability to effectively balance sensitivity and specificity is demonstrated by the highest AUC value of 0.992, observed for the Digits dataset. Similarly, the Breast Cancer dataset achieved the highest F1 score of 0.9815, reflecting an ideal balance between recall and precision. These findings emphasise the critical role of meticulous hyperparameter tuning in enhancing Logistic Regression model performance, ensuring that predictions remain accurate and reliable across diverse datasets.

Fig. 16

Train and test accuracy scatter plot with power trend line.

A fitted power trend line illustrating the general pattern of the data is shown in the scatter plot in Figure 16, which represents the model's performance. With a closely matched test accuracy of 0.94 and a strong training accuracy of 0.985, the model exhibited minimal overfitting and excellent generalisation. Although a small cluster of points between 0.55 and 0.80 indicates moderate performance in certain cases, the majority of observations are concentrated between 0.85 and 1.00, suggesting high predictive reliability.

The sparse distribution of points between 0.25 and 0.50 further highlights specific regions where model optimisation could yield improvements. The statistical indicators reinforce the model's robustness: an

$R^{2}$

value of 0.95 indicates that 95% of the variance in the outcome variable is explained by the predictors, demonstrating an excellent model fit. Furthermore, the low p-value of 0.0001 establishes the statistical significance of this relationship, confirming that the observed results are highly unlikely to have occurred by chance.

Fig. 17

Test accuracy, AUC, and F1 score scatter plot with polynomial trend line.

AUC and F1 are key metrics for evaluating classification models, where the F1 score reflects the balance between precision and recall, and AUC measures the model's ability to distinguish between classes. In regression analysis,

$R^{2}$

quantifies the proportion of variance in the outcome explained by the predictors, while a p-value below 0.05 indicates statistical significance. Our analysis shows moderate model performance, with an AUC of 0.81 and an F1 score of 0.59. After applying a polynomial trend line, the

$R^{2}$

value improved to 0.88, suggesting a strong model fit, while the very low p-value (

$<0.0001$

) confirms the statistical significance of the predictor–outcome relationship.

As illustrated in Figure 17, the scatter plot further depicts this relationship: most data points cluster between 0.68 and 1.00, indicating relatively strong performance across these instances, while fewer points fall within the 0.50–0.67 range, reflecting weaker predictive accuracy in certain cases.

Discussion

Our study demonstrates that hyperparameter tuning plays a critical role in improving the performance of Logistic Regression models across multiple datasets, including Breast Cancer, Heart Disease, Liver Disorder, and Digits. By carefully adjusting hyperparameters such as the regularisation parameter (

$C$

), penalty functions, class weight settings, solver algorithms, and maximum iteration values, we consistently improved accuracy, generalisation, and discrimination. Prior research supports that fine-tuning is essential for maximising predictive performance in both medical and computational applications \cite{R9}.

The results showed that, for every evaluation method, hyperparameter optimisation significantly enhanced model fit. Cross-validation consistently yielded higher accuracy and F1 scores when test splits were 20%, 30%, and 40%, and when folds were 3, 5, and 10. Success was measured using multiple assessment metrics, including AUC, F1 score, test accuracy, train accuracy, and cross-validation (CV) accuracy.

The Digits dataset also exhibited promising results, achieving high training accuracy and AUC values of up to 0.98, despite occasional signs of overfitting. However, the Liver Disorder and Heart Disease datasets showed comparatively lower overall accuracies, with CV values ranging from 0.53 to 0.70. The Breast Cancer dataset, with F1 scores exceeding 0.92 and CV accuracies above 0.93, demonstrated strong generalisation capability to new data. These findings indicate that Logistic Regression performance is highly sensitive to hyperparameter configurations, particularly the choice of penalty type and solver method.

For instance, the \texttt{liblinear} solver with an L1 penalty and balanced class weights achieved the best results on the Digits dataset, whereas the \texttt{saga} solver performed better for larger datasets like Breast Cancer. Class weight balancing proved especially beneficial in mitigating dataset imbalance, leading to more equitable class representation and improved F1 scores R25.

Despite these positive results, certain limitations must be acknowledged. As the study focused solely on Logistic Regression models, it remains uncertain how these findings would generalise to more complex algorithms such as Support Vector Machines, Random Forests, or Deep Learning models. Additionally, although datasets from different domains were utilised, expanding to a broader dataset range would allow for a more comprehensive analysis of tuning effects in diverse scenarios. Finally, future optimisation strategies, such as Bayesian or evolutionary approaches may yield superior results compared to the grid-search method employed here R18.

In conclusion, our findings underscore that hyperparameter optimisation markedly enhances the performance of Logistic Regression models across multiple datasets and evaluation metrics. Through improved fit, generalisation, and discrimination, hyperparameter tuning ensures more accurate and reliable predictions, reinforcing its importance in both practical machine learning applications and academic research.

Conclusion

Model performance was evaluated using train accuracy, test accuracy, F1 score, AUC, and CV accuracy. The results demonstrate that Logistic Regression performance is significantly improved by tuning hyperparameters such as

$C$

, penalty type, class weight, number of iterations, and solver choice. This highlights the crucial role of hyperparameter optimisation in improving predictive precision and model robustness. For a more comprehensive assessment, future research should explore alternative tuning strategies and larger, more diverse datasets to further validate and generalise these findings.

List of Abbreviations

begin{center}\begin{tabular}{ll}LR & Logistic Regression \\CV & Cross-Validation \\AUC & Area Under the Curve \\F1 & F1 Score (Harmonic mean of Precision and Recall) \\MLE & Maximum Likelihood Estimation \\ROC & Receiver Operating Characteristic \\TPR & True Positive Rate \\FPR & False Positive Rate \\TP & True Positives \\TN & True Negatives \\FP & False Positives \\FN & False Negatives \\R² & Coefficient of Determination \\SVM & Support Vector Machine \\RF & Random Forest \\k-NN & k-Nearest Neighbors \\ANN & Artificial Neural Network \\\end{tabular}\end{center}

Declarations

begin{itemize}\item Funding:\newline This work was partly supported by the National Natural Science Foundation of China under Grant 82001781, the Science and Technology Foundation of Liaoning Province under Grant 2023MSBA-096, and the Fundamental Research Funds for the Central Universities under Grant N2419003.\item Competing interests:\newline The authors declare that they have no competing interests.\item Ethics approval and consent to participate:\newline Not Aplicable\item Consent for publication:\newline Not Aplicable\item Data availability:\newline All datasets used in this study are publicly available and were obtained from open-access machine learning repositories. The breast cancer, heart disease, liver disorder, and handwritten digit datasets analysed during the current study can be accessed at the following sources:\begin{itemize} \item Breast Cancer Dataset: available at the Scikit-learn repository\\ https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.htmlitem Heart Disease Dataset: available on OpenML (Data ID: 43823)\\ https://www.openml.org/search?type=data &status=active &id=43823item Liver Disorder Dataset: available at the UCI Machine Learning Repository\\ https://archive.ics.uci.edu/ml/datasets/liver +disordersitem Digits Dataset: available in the Scikit-learn package (Digits dataset)\\ https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.htmlend{itemize}All data used in this study are openly accessible, and no restrictions apply. No additional datasets were generated during the study.\item Author Contributions:\newline M. A. and N.J. contributed equally by designing the study, implementing the models, conducting experiments, and writing the manuscript; A. M. and M. M. supported data preparation, validation, and evaluation; H. N. assisted with literature review and manuscript editing; and G.H. and D.H. supervised the research and provided critical guidance.\item Acknowledgements:\newline The authors would like to thank all colleagues and collaborators for their valuable discussions and feedback. The authors also gratefully acknowledge **Professor Dianning He** for providing funding support that made this research possible, as well as the use of open-source machine learning libraries and computational resources used in this work.\end{itemize}\bibliography{sn-bibliography}

Author Contribution

M. A. and N.J. contributed equally by designing the study, implementing the models, conducting experiments, and writing the manuscript; A. M. and M. M. supported data preparation, validation, and evaluation; H. N. assisted with literature review and manuscript editing; and G.H. and D.H. supervised the research and provided critical guidance.

Data Availability

All datasets used in this study are publicly available and were obtained from open-access machine learning repositories. The breast cancer, heart disease, liver disorder, and handwritten digit datasets analysed during the current study can be accessed at the following sources:Breast Cancer Dataset: available at the Scikit-learn repositoryhttps://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.htmlHeart Disease Dataset: available on OpenML (Data ID: 43823)https://www.openml.org/search?type=data&status=active&id=43823Liver Disorder Dataset: available at the UCI Machine Learning Repositoryhttps://archive.ics.uci.edu/ml/datasets/liver+disordersDigits Dataset: available in the Scikit-learn package (Digits dataset)https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.htmlAll data used in this study are openly accessible, and no restrictions apply. No additional datasets were generated during the study.

References:

BUPA Medical Research, Ltd. Liver {Disorders} {Dataset}. 1990, May, UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/liver +disorders

Gok, Ceren E. Heart {Disease} {Prediction}. 2022, March, OpenML, https://www.openml.org/search?type=data &status=active &id=43823

scikit-learn developers. sklearn.datasets.load\_breast\_cancer — scikit-learn documentation. scikit-learn.org, https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load\_breast\_cancer.html

Ambesange, Sateesh and Vijayalaxmi, A. and Sridevi, S. and {Venkateswaran} and Yashoda, B. S. (2020) Multiple {Heart} {Diseases} {Prediction} using {Logistic} {Regression} with {Ensemble} and {Hyper} {Parameter} tuning {Techniques}. IEEE, London, United Kingdom, 827--832, July, 2020 {Fourth} {World} {Conference} on {Smart} {Trends} in {Systems}, {Security} and {Sustainability} ({WorldS4}), 2025-11-03, 10.1109/WorldS450073.2020.9210404, https://ieeexplore.ieee.org/document/9210404/, 9781728168234, https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html

Sun, Deliang and Xu, Jiahui and Wen, Haijia and Wang, Danzhou (2021) Assessment of landslide susceptibility mapping based on {Bayesian} hyperparameter optimization: {A} comparison between logistic regression and random forest. Engineering Geology 281: 105972 https://doi.org/10.1016/j.enggeo.2020.105972, February, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S001379522031869X, Assessment of landslide susceptibility mapping based on {Bayesian} hyperparameter optimization, 00137952

Christodoulou, Evangelia and Ma, Jie and Collins, Gary S. and Steyerberg, Ewout W. and Verbakel, Jan Y. and Van Calster, Ben (2019) A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology 110: 12--22 https://doi.org/10.1016/j.jclinepi.2019.02.004, June, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0895435618310813, 08954356

(2004) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, Oxford, Pepe, Margaret Sullivan, 31, eng, 9780198509844 9780191588617, Oxford statistical science series

Zhang, Zhongheng (2016) Model building strategy for logistic regression: purposeful selection. Annals of Translational Medicine 4(6): 111--111 https://doi.org/10.21037/atm.2016.02.15, March, 2025-11-03, http://atm.amegroups.com/article/view/9400/10262, Model building strategy for logistic regression, 23055839, 23055847

Wong, Jenna and Manderson, Travis and Abrahamowicz, Michal and Buckeridge, David L and Tamblyn, Robyn (2019) Can {Hyperparameter} {Tuning} {Improve} the {Performance} of a {Super} {Learner}?: {A} {Case} {Study}. Epidemiology 30(4): 521--531 https://doi.org/10.1097/EDE.0000000000001027, July, 2025-11-03, en, https://journals.lww.com/00001648-201907000-00009, Can {Hyperparameter} {Tuning} {Improve} the {Performance} of a {Super} {Learner}?, 1044-3983, http://creativecommons.org/licenses/by-nc-nd/4.0/

Bagley, Steven C and White, Halbert and Golomb, Beatrice A (2001) Logistic regression in the medical literature:. Journal of Clinical Epidemiology 54(10): 979--985 https://doi.org/10.1016/S0895-4356(01)00372-9, October, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0895435601003729, Logistic regression in the medical literature, 08954356, https://www.elsevier.com/tdm/userlicense/1.0/

Hutter, F. and Kotthoff, L. and Vanschoren, J. (2019) The {Springer} {Series} on {Challenges} in {Machine} {Learning}. Springer, https://library.oapen.org/bitstream/handle/20.500.12657/23012/1/1007149.pdf\#page=15

Nand Kumar, Et Al. (2023) Enhancing {Robustness} and {Generalization} in {Deep} {Learning} {Models} for {Image} {Processing}. Power System Technology 47(4): 278--293 https://doi.org/10.52783/pst.193, December, 2025-11-03, https://powertechjournal.com/index.php/journal/article/view/193, 1000-3673

D., Yogatama and G., Mann (2014) Efficient {Transfer} {Learning} {Method} for {Automatic} {Hyperparameter} {Tuning}. PMLR, 1077--1085, April, Proceedings of the 31st {International} {Conference} on {Machine} {Learning} ({ICML} 2014), https://proceedings.mlr.press/v33/yogatama14.html

Zhang, Chun-Xia and Xu, Shuang and Zhang, Jiang-She (2019) A novel variational {Bayesian} method for variable selection in logistic regression models. Computational Statistics & Data Analysis 133: 1--19 https://doi.org/10.1016/j.csda.2018.08.025, May, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0167947318302081, 01679473

Bertinetto, Luca and Henriques, Jo ão F. and Torr, Philip H. S. and Vedaldi, Andrea. Meta-learning with differentiable closed-form solvers. arXiv:1805.08136. Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Statistics - Machine Learning, 2019, July, arXiv, 2025-11-03, 10.48550/arXiv.1805.08136, http://arxiv.org/abs/1805.08136

L'Heureux, Alexandra and Grolinger, Katarina and Elyamany, Hany F. and Capretz, Miriam A. M. (2017) Machine {Learning} {With} {Big} {Data}: {Challenges} and {Approaches}. IEEE Access 5: 7776--7797 https://doi.org/10.1109/ACCESS.2017.2696365, 2025-11-03, https://ieeexplore.ieee.org/document/7906512/, Machine {Learning} {With} {Big} {Data}, 2169-3536, https://creativecommons.org/licenses/by/3.0/legalcode

A., Singh and N., Thakur and A., Sharma (2016) A {Review} of {Supervised} {Machine} {Learning} {Algorithms}. IEEE, https://ieeexplore.ieee.org/abstract/document/7724478

Shanthi, D. L. and Chethan, N. (2022) Genetic {Algorithm} {Based} {Hyper}-{Parameter} {Tuning} to {Improve} the {Performance} of {Machine} {Learning} {Models}. SN Computer Science 4(2): 119 https://doi.org/10.1007/s42979-022-01537-8, December, 2025-11-03, en, https://link.springer.com/10.1007/s42979-022-01537-8, 2661-8907

Xu, Zhe and Tao, Dacheng and Zhang, Ya and Wu, Junjie and Tsoi, Ah Chung Architectural {Style} {Classification} {Using} {Multinomial} {Latent} {Logistic} {Regression}. In: Fleet, David and Pajdla, Tomas and Schiele, Bernt and Tuytelaars, Tinne (Eds.) Computer {Vision} – {ECCV} 2014, 600--615, 10.1007/978-3-319-10590-1\_39, 2014, Springer International Publishing, 2025-11-03, en, http://link.springer.com/10.1007/978-3-319-10590-1\_39, 9783319105895 9783319105901, http://www.springer.com/tdm, 8689, Cham

Salazar, Jose J. and Garland, Lean and Ochoa, Jesus and Pyrcz, Michael J. (2022) Fair train-test split in machine learning: {Mitigating} spatial autocorrelation for improved prediction accuracy. Journal of Petroleum Science and Engineering 209: 109885 https://doi.org/10.1016/j.petrol.2021.109885, February, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0920410521015023, Fair train-test split in machine learning, 09204105

Huang, Robert J. and Kwon, Nicole Sung-Eun and Tomizawa, Yutaka and Choi, Alyssa Y. and Hernandez-Boussard, Tina and Hwang, Joo Ha (2022) A {Comparison} of {Logistic} {Regression} {Against} {Machine} {Learning} {Algorithms} for {Gastric} {Cancer} {Risk} {Prediction} {Within} {Real}-{World} {Clinical} {Data} {Streams}. JCO Clinical Cancer Informatics (6): e2200039 https://doi.org/10.1200/CCI.22.00039, June, 2025-11-03, en, https://ascopubs.org/doi/10.1200/CCI.22.00039, 2473-4276

Ifrim, Georgiana and Bakir, G ökhan and Weikum, Gerhard (2008) Fast logistic regression for text categorization with variable-length n-grams. ACM, Las Vegas Nevada USA, 354--362, August, Proceedings of the 14th {ACM} {SIGKDD} international conference on {Knowledge} discovery and data mining, 2025-11-03, en, 10.1145/1401890.1401936, https://dl.acm.org/doi/10.1145/1401890.1401936, 9781605581934

Mu, Fanglin and Gu, Yu and Zhang, Jie and Zhang, Lei (2020) Milk {Source} {Identification} and {Milk} {Quality} {Estimation} {Using} an {Electronic} {Nose} and {Machine} {Learning} {Techniques}. Sensors 20(15): 4238 https://doi.org/10.3390/s20154238, July, 2025-11-03, en, https://www.mdpi.com/1424-8220/20/15/4238, 1424-8220

Chao, Cheng-Min and Yu, Ya-Wen and Cheng, Bor-Wen and Kuo, Yao-Lung (2014) Construction the {Model} on the {Breast} {Cancer} {Survival} {Analysis} {Use} {Support} {Vector} {Machine}, {Logistic} {Regression} and {Decision} {Tree}. Journal of Medical Systems 38(10): 106 https://doi.org/10.1007/s10916-014-0106-1, October, 2025-11-03, en, http://link.springer.com/10.1007/s10916-014-0106-1, 0148-5598, 1573-689X

A., Kulkarni and F.A., Batarseh Foundations of {Data} {Imbalance} and {Solutions} for a {Data} {Democracy}. Data {Democracy}: {At} the {Nexus} of {Artificial} {Intelligence}, {Software} {Development} and {Knowledge} {Engineering}, 83--106, 2020, Academic Press, https://www.sciencedirect.com/science/article/pii/B9780128183663000058

Seddik, Ahmed F. and Shawky, Doaa M. (2015) Logistic regression model for breast cancer automatic diagnosis. IEEE, London, United Kingdom, 150--154, November, 2015 {SAI} {Intelligent} {Systems} {Conference} ({IntelliSys}), 2025-11-03, 10.1109/IntelliSys.2015.7361138, https://ieeexplore.ieee.org/document/7361138, 9781467376068

Saw, Montu and Saxena, Tarun and Kaithwas, Sanjana and Yadav, Rahul and Lal, Nidhi (2020) Retraction {Notice}: {Estimation} of {Prediction} for {Getting} {Heart} {Disease} {Using} {Logistic} {Regression} {Model} of {Machine} {Learning}. IEEE, Coimbatore, India, 1--1, January, 2020 {International} {Conference} on {Computer} {Communication} and {Informatics} ({ICCCI}), 2025-11-03, 10.1109/ICCCI48352.2020.10467199, https://ieeexplore.ieee.org/document/10467199/, Retraction {Notice}, 9781728145143, https://doi.org/10.15223/policy-029

Wu, Chieh-Chen and Yeh, Wen-Chun and Hsu, Wen-Ding and Islam, Md. Mohaimenul and Nguyen, Phung Anh (Alex) and Poly, Tahmina Nasrin and Wang, Yao-Chin and Yang, Hsuan-Chia and (Jack) Li, Yu-Chuan (2019) Prediction of fatty liver disease using machine learning algorithms. Computer Methods and Programs in Biomedicine 170: 23--29 https://doi.org/10.1016/j.cmpb.2018.12.032, March, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0169260718315724, 01692607

Farooq, Faizan and Tandon, Siddhant and Parashar, Pankaj and Sengar, Prateek (2016) Vectorized code implementation of {Logistic} {Regression} and {Artificial} {Neural} {Networks} to recognize handwritten digit. IEEE, Delhi, India, 1--5, July, 2016 {IEEE} 1st {International} {Conference} on {Power} {Electronics}, {Intelligent} {Control} and {Energy} {Systems} ({ICPEICES}), 2025-11-03, 10.1109/ICPEICES.2016.7853346, http://ieeexplore.ieee.org/document/7853346/, 9781467385879

Al Nabki, Mhd Wesam and Fidalgo, Eduardo and Alegre, Enrique and De Paz, Ivan (2017) Classifying {Illegal} {Activities} on {Tor} {Network} {Based} on {Web} {Textual} {Contents}. Association for Computational Linguistics, Valencia, Spain, 35--43, Proceedings of the 15th {Conference} of the {European} {Chapter} of the {Association} for {Computational} {Linguistics}: {Volume} 1, {Long} {Papers}, 2025-11-03, en, 10.18653/v1/E17-1004, http://aclweb.org/anthology/E17-1004

Graja, Omar and Azam, Muhammad and Bouguila, Nizar (2018) Breast {Cancer} {Diagnosis} using {Quality} {Control} {Charts} and {Logistic} {Regression}. IEEE, Rabat, Morocco, 215--220, November, 2018 9th {International} {Symposium} on {Signal}, {Image}, {Video} and {Communications} ({ISIVC}), 2025-11-03, 10.1109/ISIVC.2018.8709214, https://ieeexplore.ieee.org/document/8709214/, 9781538681732

Egaji, Oche Alexander and Evans, Gareth and Griffiths, Mark Graham and Islas, Gregory (2021) Real-time machine learning-based approach for pothole detection. Expert Systems with Applications 184: 115562 https://doi.org/10.1016/j.eswa.2021.115562, December, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0957417421009684, 09574174

Jain, Sanyam. {DeepSeaNet}: {Improving} {Underwater} {Object} {Detection} using {EfficientDet}. arXiv:2306.06075. Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, 2024, January, arXiv, 2025-11-03, 10.48550/arXiv.2306.06075, http://arxiv.org/abs/2306.06075, {DeepSeaNet}

Belete, Daniel Mesafint and Huchaiah, Manjaiah D. (2022) Grid search in hyperparameter optimization of machine learning models for prediction of {HIV}/{AIDS} test results. International Journal of Computers and Applications 44(9): 875--886 https://doi.org/10.1080/1206212X.2021.1974663, September, 2025-11-03, en, https://www.tandfonline.com/doi/full/10.1080/1206212X.2021.1974663, 1206-212X, 1925-7074

Ahmad, Ghulab Nabi and {Shafiullah} and Fatima, Hira and Abbas, Mohamed and Rahman, Obaidur and {Imdadullah} and Alqahtani, Mohammed S. (2022) Mixed {Machine} {Learning} {Approach} for {Efficient} {Prediction} of {Human} {Heart} {Disease} by {Identifying} the {Numerical} and {Categorical} {Features}. Applied Sciences 12(15): 7449 https://doi.org/10.3390/app12157449, July, 2025-11-03, en, https://www.mdpi.com/2076-3417/12/15/7449, 2076-3417

Hossain, Md. Imam and Maruf, Mehadi Hasan and Khan, Md. Ashikur Rahman and Prity, Farida Siddiqi and Fatema, Sharmin and Ejaz, Md. Sabbir and Khan, Md. Ahnaf Sad (2023) Heart disease prediction using distinct artificial intelligence techniques: performance analysis and comparison. Iran Journal of Computer Science 6(4): 397--417 https://doi.org/10.1007/s42044-023-00148-7, December, 2025-11-03, en, https://link.springer.com/10.1007/s42044-023-00148-7, Heart disease prediction using distinct artificial intelligence techniques, 2520-8438, 2520-8446

Battineni, Gopi Machine {Learning} and {Deep} {Learning} {Algorithms} in the {Diagnosis} of {Chronic} {Diseases}. In: Bandyopadhyay, Mainak and Rout, Minakhi and Chandra Satapathy, Suresh (Eds.) Machine {Learning} {Approaches} for {Urban} {Computing}, 141--164, 10.1007/978-981-16-0935-0_7, 2021, Springer Singapore, 2025-11-03, en, https://link.springer.com/10.1007/978-981-16-0935-0_7, 9789811609343 9789811609350, 968, Singapore

Nusinovici, Simon and Tham, Yih Chung and Chak Yan, Marco Yu and Wei Ting, Daniel Shu and Li, Jialiang and Sabanayagam, Charumathi and Wong, Tien Yin and Cheng, Ching-Yu (2020) Logistic regression was as good as machine learning for predicting major chronic diseases. Journal of Clinical Epidemiology 122: 56--69 https://doi.org/10.1016/j.jclinepi.2020.03.002, June, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0895435619310194, 08954356

Liu, Lei (2018) Research on {Logistic} {Regression} {Algorithm} of {Breast} {Cancer} {Diagnose} {Data} by {Machine} {Learning}. IEEE, Changsha, China, 157--160, May, 2018 {International} {Conference} on {Robots} & {Intelligent} {System} ({ICRIS}), 2025-11-03, 10.1109/ICRIS.2018.00049, https://ieeexplore.ieee.org/document/8410258/, 9781538665800

Schratz, Patrick and Muenchow, Jannes and Iturritxa, Eugenia and Richter, Jakob and Brenning, Alexander (2019) Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecological Modelling 406: 109--120 https://doi.org/10.1016/j.ecolmodel.2019.06.002, August, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0304380019302145, 03043800

Varoquaux, G. and Buitinck, L. and Louppe, G. and Grisel, O. and Pedregosa, F. and Mueller, A. (2015) Scikit-learn: {Machine} {Learning} {Without} {Learning} the {Machinery}. GetMobile: Mobile Computing and Communications 19(1): 29--33 https://doi.org/10.1145/2786984.2786995, June, 2025-11-03, en, https://dl.acm.org/doi/10.1145/2786984.2786995, Scikit-learn, 2375-0529, 2375-0537

Brownlee, Jason (2021) Machine learning mastery with {Python}: understand your data, create accurate models and work projects end-to-end. Jason Brownlee, [Australia], Python (Computer program language), Neural networks (Computer science), OCLC: 1348139196, {Machine Learning Mastery}, eng, Machine learning mastery with {Python}, 9798540446273, Edition: v1.20

Sokolova, Marina and Lapalme, Guy (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4): 427--437 https://doi.org/10.1016/j.ipm.2009.03.002, July, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0306457309000259, 03064573, https://www.elsevier.com/tdm/userlicense/1.0/

Stepanek, Hannah Loading and {Normalizing} {Data}. Thinking in {Pandas}, 65--108, 10.1007/978-1-4842-5839-2\_4, 2020, Stepanek, Hannah, Apress, 2025-11-03, en, http://link.springer.com/10.1007/978-1-4842-5839-2\_4, 9781484258385 9781484258392, Berkeley, CA

Bailly, Alexandre and Blanc, Corentin and Francis, Élie and Guillotin, Thierry and Jamal, Fadi and Wakim, B échara and Roy, Pascal (2022) Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models. Computer Methods and Programs in Biomedicine 213: 106504 https://doi.org/10.1016/j.cmpb.2021.106504, January, 2025-11-03, en, https://linkinghub.elsevier.com/retrieve/pii/S0169260721005782, 01692607

Roberts, David R. and Bahn, Volker and Ciuti, Simone and Boyce, Mark S. and Elith, Jane and Guillera ‐Arroita, Gurutzeta and Hauenstein, Severin and Lahoz ‐Monfort, Jos é J. and Schr öder, Boris and Thuiller, Wilfried and Warton, David I. and Wintle, Brendan A. and Hartig, Florian and Dormann, Carsten F. (2017) Cross ‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8): 913--929 https://doi.org/10.1111/ecog.02881, August, 2025-11-03, en, https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/ecog.02881, 0906-7590, 1600-0587

Gamb äck, Bj örn and Sikdar, Utpal Kumar (2017) Using {Convolutional} {Neural} {Networks} to {Classify} {Hate}-{Speech}. Association for Computational Linguistics, Vancouver, BC, Canada, 85--90, Proceedings of the {First} {Workshop} on {Abusive} {Language} {Online}, 2025-11-03, en, 10.18653/v1/W17-3013, http://aclweb.org/anthology/W17-3013

Patrick Schratz and Jannes Muenchow and Eugenia Iturritxa and Jakob Richter and Alexander Brenning (2019) Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecological Modelling 406: 109-120 https://doi.org/https://doi.org/10.1016/j.ecolmodel.2019.06.002, Spatial modeling, Machine-learning, Spatial autocorrelation, Hyperparameter tuning, Spatial cross-validation, https://www.sciencedirect.com/science/article/pii/S0304380019302145, 0304-3800

Deliang Sun and Jiahui Xu and Haijia Wen and Danzhou Wang (2021) Assessment of landslide susceptibility mapping based on Bayesian hyperparameter optimization: A comparison between logistic regression and random forest. Engineering Geology 281: 105972 https://doi.org/https://doi.org/10.1016/j.enggeo.2020.105972, Landslide, Bayesian hyperparameter optimization, Logical regression, Random forest, Landslide susceptibility mapping, https://www.sciencedirect.com/science/article/pii/S001379522031869X, 0013-7952

Andr{\'e} Pfob and Lu, \{Sheng Chieh\} and Chris Sidey-Gibbons (2022) Machine learning in medicine: a practical introduction to techniques for data pre-processing, hyperparameter tuning, and model comparison. BMC Medical Research Methodology 22(1) https://doi.org/10.1186/s12874-022-01758-8, BioMed Central, 1471-2288, English (US), December, Funding Information: None. None. Publisher Copyright: {\textcopyright} 2022, The Author(s)., Artificial intelligence, Guideline, Machine learning, Medicine

Ambesange, Sateesh and Vijayalaxmi, A. and Sridevi, S. and Venkateswaran and Yashoda, B. S. (2020) Multiple Heart Diseases Prediction using Logistic Regression with Ensemble and Hyper Parameter tuning Techniques. 10.1109/WorldS450073.2020.9210404, Diseases;Heart;Logistics;Principal component analysis;Prediction algorithms;Tuning;Predictive models;Heart disease prediction;Grid search;Random Search;Logistic regression;Turkey fence outlier;Feature Selection;Univariate Data Analysis;Power Transformation;Kernel PCA;Precision-Recall Curve;Extra Trees Classifier, 827-832, , , 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4)

Caner Erden and Halil Ibrahim Demir and Abdullah Hulusi Kokccam (2023) Enhancing Machine Learning Model Performance with Hyper Parameter Optimization: A Comparative Study. ArXiv abs/2302.11406https://api.semanticscholar.org/CorpusID:257079151

Yes

Abstract

Background: Logistic regression (LR) is widely used in binary and multi-class classification tasks, yet its predictive performance is highly sensitive to hyperparameter configuration. Suboptimal choices can lead to overfitting, underfitting, reduced generalization, and inconsistent model behavior across datasets. This study aims to systematically enhance LR performance by applying a comprehensive hyperparameter optimization framework and evaluating its impact across four diverse datasets: breast cancer, heart disease, liver disorders, and handwritten digits. Methods: A Python-based experimental framework was developed using Scikit-learn, NumPy, and Pandas to examine how hyperparameters influence LR performance. A combinatorial optimization strategy was applied to tune regularization strength (C), penalty type (L1), solver choice (liblinear, saga), class-weight settings, and maximum iterations. Model evaluation was conducted using both train--test splits (20%, 30%, 40%) and k-fold cross-validation (k = 3, 5, 10). Performance was assessed using accuracy, F1-score, AUC, and cross-validation accuracy. Tableau-based visual analytics were used to compare model behaviors under different configurations. Results: Optimized hyperparameters consistently improved model performance across all datasets. The breast cancer and digits datasets achieved the most substantial gains, with maximum test accuracies of 97% and 98%, respectively, and AUC values up to 0.99. Cross-validation scores indicated strong generalization, with the best-performing models showing CV accuracies above 0.90. In contrast, performance improvements on heart disease and liver disorder datasets were present but more modest due to noisier features and class imbalance. Hyperparameter combinations involving L1 penalty, balanced class weights, and the liblinear solver produced the highest accuracy and F1-scores across several datasets. Conclusions: Systematic hyperparameter tuning significantly enhances logistic regression performance, generalization, and discrimination ability. The results demonstrate that even simple models can achieve high accuracy when appropriately optimized. This framework provides practical guidance for improving LR across heterogeneous datasets and highlights the importance of penalty choice, regularization strength, and solver selection. Future work should explore advanced optimization techniques such as Bayesian optimization and evolutionary algorithms to further improve efficiency and performance.