Developing a Logistic Regression-Based Scoring Tool for Coronary Heart Disease Using the Cleveland Dataset: A Methodological Study

Logistic regression scoring tool for CHD

HebaRamadan¹1

1Pharmacy Department, Agamy Medical DistrictBachelor of Pharmacy ¹Pharmacist, Ministry of Health and PopulationAgamyAlexandriaEgypt

Heba Ramadan¹

Bachelor of Pharmacy

¹Pharmacist, Pharmacy Department, Agamy Medical District, Ministry of Health and Population, Agamy, Alexandria, Egypt

Abstract

This study proposes a methodological approach for transforming statistical models into a practical screening tool for coronary heart disease (CHD). The Cleveland Heart Disease dataset, which includes 13 clinical and demographic variables, was analyzed using logistic regression. Two models were developed: a binary logistic regression to predict the presence or absence of CHD, and an ordinal logistic regression to examine disease severity across angiographic categories. Regression coefficient were then adapted into a simplified points-based system, suitable for implementation in quiz-style Google Forms. Each response option was assigned a weighted score, enabling automated calculation of risk upon completion of the form. The aim was to create an educational, doctor-oriented tool that highlights short-term risk factors and provides immediate feedback on both disease presence and potential severity. While this approach demonstrates how statistical modeling can be translated into an interactive and user-friendly format, limitations include the modest size and the historical nature of the Cleveland dataset, and the absence of external validation. The tool is not proposed as a validated instruments but as an example of how statistical models can be transformed into interactive and user-friendly screening formats for educational use.

Keywords

Cleveland dataset

logistic regression

coronary heart disease

screening tool

questionnaire

1. Introduction

Coronary heart disease (CHD) remains one of the leading causes of morbidity and mortality worldwide, and its early detection continue to be a challenge for both clinicians and health systems (Celermajer David, Chow Clara, Marijon, Anstey Nicholas, & Woo Kam, 2012). A variety of risk prediction models and questionnaires have been developed over the years to support screening and guide preventive interventions. These models often rely on long-term cohort data and are designed for predicting outcomes such as 10-year cardiovascular risk. While clinically valuable, such approaches may not directly address short-term diagnostic or educational needs.

The Heart Disease dataset collection provides a unique opportunity for developing alternative approaches. This collection comprises four subsets (Cleveland, Hungary, Switzerland, and AV Long Beach), each based on patients undergoing coronary angiography. Among them, the Cleveland dataset has become the most frequently used in research, largely because it contains 303 well-documented cases with minimal missing values and higher internal consistency compared with the other subsets. According to the UCI Machine Learning Repository, it has been cited more than 60 times and viewed thousands of times as a benchmark dataset. however, in nearly all these instances it has served primarily as a testbed for statistical and machine learning algorithms, with little emphasis on its medical context or potential application in clinical education ( Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R., 1989).

Revisiting the Cleveland dataset from a medical perspective offers an opportunity to bridge this gap. By applying logistic regression to all 13 variables and translating the results into a scoring-based questionnaire, it is possible to demonstrate how traditional statistical modeling can be repurposed into a practical screening and educational tool. Such as approach may provide doctors with a simple way to explore the short-term correlates of CHD, while also serving as a proof of concept for integrating data-driven methods into user-friendly formats.

2. Methods

2.1. Data sources

Data were obtained from the Cleveland Heart Disease dataset, which includes 303 patients evaluated by coronary angiography. The primary outcome was the presence of coronary heart disease, defined as ≥ 50% narrowing in at least one major coronary vessel. In addition, disease severity was coded on an ordinal 0–4 scale, where 0 represented no disease, 1 mild disease (one vessel affected), 2 moderate disease (two vessels affected), 3 significant disease (three vessels affected), and 4 severe disease (four vessels affected, including cases with left main coronary artery stenosis).

2.2. Independent variables

Thirteen variables were considered:

age: in years

sex: (0 = female, 1 = male)

Cp: chest pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)

trestbps: resting systolic blood pressure (mmHg)

chol: serum cholesterol (mg/dl)

fbs: fasting blood sugar (1 = having fasting blood sugar > 120 mg/dl, 0 = normal blood sugar)

restecg: resting electrocardiographic results (0 = normal, 1 = ST-T abnormality, 2 = left ventricular hypertrophy)

thalach: maximum heart rate achieved (measured during exercise) bpm (beats per minute)

exang: exercise-induced angina (0 = no, 1 = yes)

oldpeak: ST depression induced by exercise relative to rest (0 = no ST depression, 0–1 mm = healthy individuals, ≥ 1 mm suggestive for ischemia, ≥ 2 mm strong indication of coronary artery disease)

slope: slope of peak exercise ST segment (1 = upsloping; normal, 2 = flat; early sign of ischemia, 3 = down-sloping; strongly associated with myocardial ischemia or coronary artery disease)

ca: number of major vessels colored by fluoroscopy (0–3)

thal: thalassemia status (3 = normal, 6 = fixed defect, 7 = reversible defect)

2.3. Statistical analysis

All analyses were conducted using IBM SPSS 25. Two models were applied:

Binary logistic regression with dependent variables coded as 0 (no disease) versus 1 (disease present, categories 1–4 combined).

Ordinal logistic regression with dependent variable coded on the original 0–4 severity scale

All 13 variables were entered as independent predictors. For categorical variables, dummy coding was applied. Model performance was evaluated using goodness-of-fit statistics, odds ratios, and predictive accuracy measures. Regression coefficients were then rescaled into integer points to create a simplified scoring system suitable for implementation in an interactive questionnaire. This scoring system was designed to provide immediate feedback regarding the presence of coronary heart disease.

2.4. Questionnaire development

Regression-based weights were adapted into a simplified scoring system. Each variable response was assigned a point value, and total scores were divided into categories representing disease risk. The system was implemented in quiz-style Google Forms, allowing automatic scoring and immediate feedback.

3. Results

3.1. Descriptive statistics

Among the 303 participants, about two-third were male (68.0%), while females represented 32.0%. The mean age was 54.4 ± 9.0 years (range: 29–77). Nearly half of the sample (47.5%) reported asymptomatic chest pain, whereas 28.4% experienced non-anginal pain, 16.5% atypical angina, and only 7.6% typical angina.

Most participants had normal fasting blood sugar levels (85.1%), with 14.9% showing elevated levels (> 120 mg/dl). Resting electrocardiographic results were almost equally distributed between normal findings (49.8%) and left ventricular hypertrophy (48.8%), while ST-T abnormalities were uncommon (1.3%).

The mean resting systolic blood pressure was 131.7 ± 17.6 mmHg (range: 94–200). Categorically, 44.6% had normal systolic pressure (< 130 mmHg), 23.1% had stage 1 hypertension (130–139 mmHg), and 32.3% had stage 2 hypertension (≥ 140 mmHg). Mean serum cholesterol was 246.7 ± 51.8 mg/dl, with 16.2% <200 mg/dl, 32.3% between 200–239 mg/dl, and 51.5% ≥240 mg/dl. Maximum heart rate averaged 149 ± 22.9 bpm (range: 71–202). The mean ST-segment depression during exercise was 1.04 ± 1.16 mm, with 32.7% showing no depression, 22.1% with < 1 mm, and 45.2% with ≥ 1 mm depression.

Exercise-induced angina was reported in about one-third of patients (32.7%), while the majority did not experience angina on exertion (67.3%). The slope of the peak exercise ST segment was either upsloping (46.9%) or flat (46.2%) in most patients, with only 6.9% demonstrating the clinically adverse down-sloping pattern.

Fluoroscopy revealed no major vessels colored in 179 patients (59.1%), while one, two, and three vessels were colored in 66 (21.8%), 38 (12.5%), and 20 (6.6%) patients, respectively.

Thalassemia-related results showed that 55.1% of participants were classified as normal, 38.9% exhibited a reversible defect, and 5.9% a fixed defect.

Regarding coronary heart disease (CHD) outcomes, 54.1% had no angiographic evidence of disease, while 45.9% had some degree of disease. On the ordinal severity scale, 18.2% had mild, 11.9% moderate, 11.6% significant, and 4.3% severe disease (Tables 1, 2).

Table 1
Descriptive statistics of categorical variables in the Cleveland Heart Disease dataset (N = 303)
Variable	Category	Frequency (n)	Percent (%)
Sex	Female	97	32.0
Sex	Male	206	68.0
Chest pain type	Typical angina	23	7.6
	Atypical angina	50	16.5
	Non-anginal pain	86	28.4
	Asymptomatic	144	47.5
Fasting blood sugar	Normal (< 120 mg/dl)	258	85.1
Fasting blood sugar	Elevated (> 120 mg/dl)	45	14.9
Resting ECG results	Normal	151	49.8
	ST-T abnormality	4	1.3
	Left ventricular hypertrophy	148	48.8
Resting systolic blood pressure	Normal	135	44.6
	Stage 1 hypertension	70	23.1
	Stage 2 hypertension	98	32.3
Serum cholesterol	normal	49	16.2
	elevated	98	32.3
	High levels	156	51.5
ST depression	No ST-depression	99	32.7
	0–1 mm	67	22.1
	≥ 1 mm	137	45.2
Exercise-induced angina	No	204	67.3
Exercise-induced angina	Yes	99	32.7
Slope of ST-segment	Upsloping (normal)	142	46.9
	Flat	140	46.2
	Down-sloping	21	6.9
Thalassemia status	Normal	167	55.1
	Fixed defect	18	6.0
	Reversible defect	118	38.9
No. of major vessels (fluoroscopy)	0	179	59.1
	1	66	21.8
	2	38	12.5
	3	20	6.6
CHD severity (0–4 scale)	No disease (0)	164	54.1
	Mild (1 vessel)	55	18.2
	Moderate (2 vessels)	36	11.9
	Significant (3 vessels)	35	11.6
	Severe (4 vessels)	13	4.3
CHD presence (binary)	Absent	164	54.1
CHD presence (binary)	Present	139	45.9

Table 2
Descriptive statistics of continuous variables
Variable	Mean ± SD	Median	Minimum	Maximum
Age (years)	54.44 ± 9.04	56.00	29	77
Resting systolic blood pressure mmHg	131.69 ± 17.60	130	94	200
Serum cholesterol mg/dl	246.69 ± 51.78	241	126	564
Maximum heart rate achieved bpm	149.61 ± 22.88	153	71	202
ST depression mm	1.04 ± 1.16	0.8	0	6.2
No. of major vessels (fluoroscopy)	0.67 ± 0.93	0	0	3

3.2. Logistic regression

3.2.1. Assumption checks

Prior to regression analysis, data were screened for suitability (Schober & Vetter, 2021). All the categorical predictors and dependent variables were pre-coded in SPSS v25, and no missing values were present in the dataset. the independence of observations was assumed, as each case represented a unique patient.

For continuous variables (age, resting blood pressure, serum cholesterol, maximum heart rate, and ST-segment depression), the linearity of the logit was assessed using the Box-Tidwell approach. None of the interaction terms were statistically significant (p > 0.05), indicating that the assumption of linearity was met.

Multicollinearity among predictors was assessed using variance inflation factors (VIF) derived from an auxiliary linear regression. All predictors showed tolerance values greater than 0.4 and VIF values below 2.5, confirming the absence of problematic multicollinearity among predictors.

3.2.2. Binary logistic regression

The overall binary logistic regression model fit was highly satisfactory. The Omnibus test of model coefficients indicated that the logistic regression model significantly improved prediction compared to the null model (χ²(18) = 224.197, p < 0.001), confirming that the included predictors collectively explained the presence of coronary heart disease. The Cox& Snell R² was 0.523 and a Nagelkerke R² was 0.699, suggesting that the predictors explained approximately 52% to 70% of the variance in disease status. Model calibration was also acceptable, as evidenced by the Hosmer-Lemeshow test (χ²(8) = 7.093, p = 0.527), which indicated no evidence of poor fit. The classification table demonstrated excellent discriminatory performance, with an overall accuracy of 87.5%, sensitivity of 82.7%, and specificity of 91.5%.

The reference groups for categorical variables were: asymptomatic (chest pain type), reversible defect (thalassemia status), left ventricular hypertrophy (resting ECG results), down-sloping (ST segment slope), yes (exercise-induced angina), high blood sugar (fasting blood sugar), and male (sex).

Significant predictors of CHD were sex, chest pain type, systolic blood pressure, number of major vessels, and thalassemia status.

Sex: females had 78.3% lower odds of CHD compared to males (B=-1.526, OR = 0.217, p = 0.004).

Chest pain: compared to asymptomatic patients, those with typical angina had 87.9% lower odds (B= -2.11, OR = 0.121, p = 0.001), and those with anginal pain had 84.7% lower odds (B=-1.876, OR = 0.153, p < 0.001).

Resting systolic blood pressure: each 1 mmHg increase was associated with a 2.5% increase in the odds of CHD (B = 0.024, OR = 1.025, p = 0.031).

Number of major vessels: each additional vessel increased the odds of CHD by 271.2% (B = 1.312, OR = 3.712, p < 0.001).

Thalassemia status: patients with normal perfusion had 74.7% lower odds of CHD compared to those with a reversible defect (B= -1.373, OR = 0.253, p = 0.001).

3.2.3. Ordinal logistic regression

An ordinal logistic regression model was performed to identify predictors of CHD severity, which was categorized as no disease (54.1%), mild (18.2%), moderate (11.9%), significant (11.6%), and severe (4.3%). Among the clinical variables, the number of major vessels visualized by fluoroscopy (Estimate = 0.890, p < 0.001) was the strongest predictor, with higher vessel counts markedly increasing the likelihood of progressing to more severe CHD categories. Chest pain type was also significantly associated with CHD severity: compared to asymptomatic patients, those presenting with typical angina (Estimate= -1.715, p = 0.002), atypical angina (Estimate = -1.090, p = 0.023), and non-anginal pain (Estimate= -1.536, p < 0.001) had significantly lower odds of being in a higher severity category. Male sex was independently associated with more severe (Estimate= -0.997, p = 0.006). thalassemia status showed a strong relationship with severity, where a reversible perfusion defect was associate with greater odds of more severe disease compared to normal scan (Estimate=-1.310, p < 0.001). Resting systolic blood pressure (B = 0.013, p = 0.099), ST-segment depression during exercise (oldpeak, Estimate = 0.276, p = 0.054), and thalassemia with fixed defect (Estimate= -0.946, p = 0.058) approached statistical significance. Other variables, including age, serum cholesterol, fasting blood sugar, maximum heart rate, resting ECG results, exercise-induced angina, and slope of the ST segment, did not show significant associations.

3.3. Development of a scoring system from ordinal logistic regression

Based on the results of the ordinal logistic regression model, the significant or approached significance predictors of CHD severity were identified as number of major vessels visualized by fluoroscopy, chest pain type, sex, thalassemia status, Resting systolic blood pressure, and ST-segment depression during exercise. To translate these findings into a practical scoring system, regression coefficients were rescaled into integer points to reflect their relative contribution to CHD severity. Male sex was assigned + 3 points, while females received 0 points. For chest pain type, asymptomatic patients had the highest risk and were assigned + 6 points, whereas typical angina was scored 0, atypical angina + 1, and non-anginal pain + 1. Thalassemia status was scored as reversible defect + 4 points (highest risk), fixed defect + 1, and normal perfusion 0. The number of major vessels was weighted as + 3 points for each affected vessel (range 0–9). Points in the scoring system were assigned by approximating each 0.3 increase in the log-odds (B coefficient) from the ordinal logistic regression as one point. In this way, stronger predictors with larger B coefficient contributed proportionally more points to the total score. For categorical variables, the reference category was always assigned zero points, and the other categories were scored relative to it. For SBP (B = 0.013, p = 0.099); each 1 mmHg increase adds 0.013 to the log-odds. To translate this into the scoring system (≈ 1 point for every 0.3 log-odds), we need about 23 mmHg (0.3/0.013 ≈ 23) to each 1 point. For simplicity, it was rounded to 20 mmHg = 1 point. For oldpeak (B = 0.276, p = 0.054); each 1 unit increase adds 0.276 to the log-odds. That’s already very close to 0.3, so it naturally translates to 1 point per 1 unit increase. This produced a total score range from 0 (lowest risk: female, typical angina, normal perfusion, no vessel involvement, resting systolic blood pressure < 120 mmHg, ST-depression during exercise < 1.3 mm) to 28 (highest risk: male, asymptomatic, reversible defect, three-vessel involvement, resting systolic blood pressure > 160 mmHg, ST-depression during exercise > 3.5mm) (Table 3).

Table 3
Weighted scoring system for predictors of cardiovascular risk
Predictor	Category	Points
Sex	Female	0
Sex	Male	+ 3
Chest pain type	Typical angina	0
	Atypical angina	+ 1
	Non-anginal pain	+ 1
	Asymptomatic	+ 6
Thalassemia status	Normal perfusion	0
	Fixed defect	+ 1
	Reversible defect	+ 4
Number of major vessels	0	0
	1	+ 3
	2	+ 6
	3	+ 9
Resting systolic blood pressure mmHg	< 120	0
	120–140	1
	140–160	2
	> 160	3
ST-depression during exercise mm	< 1.3	0
	1.3–2.3	1
	2.3–3.5	2
	> 3.5	3

After applying this scoring system to the 303 patients, the score distribution within each diagnostic was examined. Patients with no disease had a mean score of 6.64 and a median of 6.00 (range: 1–22). Those with mild disease had a mean of 12.58 and a median of 12.00 (range: 4–24), while moderate disease corresponded to a mean of 16.44 and a median of 17.00 (range: 9–24). For significant disease, the mean score was 17.63 and the median was 17.00 (range: 8–27), and patients with severe disease had the highest score, with a mean of 18.82 and a median of 19.00 (range: 10–25). Using these observed distributions, cutoff points were derived to reflect the progression of CHD severity: scores of 0–9 were categorized as no disease, 10–13 as mild disease, 14–16 as moderate disease, 17–18 as significant disease, and 19–28 as severe disease. These cutoffs were therefore determined empirically from the patient data, ensuring that the scoring system align with the actual distribution of disease categories in the study population (Table 4).

Table 4
Disease severity classification based on total score
Category	Cutoff point
No disease	0–9
Mild disease	10–13
Moderate disease	14–16
Significant disease	17–18
Severe disease	19–28

The significant variables and categories have been converted into questions in this Google Form: https://docs.google.com/forms/d/e/1FAIpQLSfT1-pi_WM5lSH0OwNkzn-Iu1SHajNxaMm00kivlzpPkhbTOg/viewform. Each choice is assigned a weight or point value. Google Forms, by default, does not support weighted questions, so the grades or final scores are not displayed immediately after submission. However, there are add-ons for Google Forms, such as Formfacade, that can be linked to the form to provide immediate, interactive, and user-friendly results.

4. Discussion

This study illustrates how logistic regression model can be transformed into interactive questionnaires for educational use in CHD screening. The binary model allows distinction between individuals with and without disease, while the ordinal model provides additional insights into severity categories. Embedding the scoring system into Google Forms demonstrates how statistical results can be translated into practical, user-friendly formats accessible to doctors in routine settings.

Compared with established tools such as the Framingham Risk Score and QRISK, the proposed questionnaire differs in both scope and intention. Framingham and QRISK are validated risk calculators designed to estimate long-term cardiovascular risk. In contrast, this tool is exploratory, based on a smaller dataset, and targets short-term indicators of angiographically defined CHD. Its primary value lies in demonstrating methodological feasibility rather than replacing existing instruments.

Key limitations of this study should be acknowledged. First, the dataset itself is modest in size, with only 303 patients, and originates from 1980s. as a historical dataset, it reflects the medical practices and population characteristics of that era, which may differ from contemporary lifestyles, risk factor distributions, and diagnostic standards. Second, regression coefficients were simplified into integer points to allow scoring, a step that may reduce precision. Third, the model has not undergone external validation which restricts its generalizability. Finally, several predictors in the Cleveland dataset require specialized clinical testing (e.g., exercise ECG, fluoroscopy), limiting the practicality of the derived tool outside professional or research contexts.

5. Conclusion

Transforming logistic regression models into questionnaire-based scoring systems provides a practical example of how statistical results can be applied in user-friendly formats. While this tool is limited by data size, simplification, and lack of validation, it serves as a proof of concept that bridges statistical modeling with interactive screening approaches.

Conflict of interest

No conflict of interest

Funding sources

No funding was received for this research

Acknowledgement

Nil

References

Celermajer David S, Chow Clara K, Marijon E, Anstey Nicholas M, Kam W, S (2012) Cardiovascular Disease in the Developing World. JACC 60(14):1207–1216. https://doi:10.1016/j.jacc.2012.03.074

Janosi A, Steinbrunn W, Pfisterer M, Detrano R (1989) Heart Disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X

Schober P, Vetter TR (2021) Logistic Regression in Medical Research. Anesth Analgesia 132(2):365–366. https://doi:10.1213/ane.0000000000005247

Yes