A
Risk factors for osteoporotic fractures in postmenopausal women: Evidence from the China Health and Nutrition Survey
XiaoJuanZeng1
YiQiangZhang2Email
XiaLiu1✉Email
XiaoJuanZeng1
Email
1
A
A
A
Department of Gynecology, Fujian Provincial HospitalFuzhou University, Fuzhou City Fujian ProvincialNo. 134, Dongjie Street, Gulou District350001China
2Department of RehabilitationFuzhou First General HospitalNo. 190, Dadao Road, Taijiang District350009Fuzhou CityFujian ProvincialChina
XiaoJuan Zeng1 +YiQiang Zhang 2+ Xia Liu 1*
A
Abstract
Objective
To develop and validate a machine learning model that integrates indicators of educational level and nutritional intake to predict fractures associated with postmenopausal osteoporosis, while also clarifying the roles these factors play in disease prediction.
Methods
Data were sourced from the China Health and Nutrition Survey, with a focus on important aspects such as nutritional intake and educational levels. To improve the model's accuracy, additional factors like physical body shape indicators, blood biochemical markers, and pain conditions were also included. To simplify the data and uncover underlying patterns, principal component analysis (PCA) was applied to a dataset that included various variables. The models constructed for analysis comprised decision trees, k-nearest neighbors (KNN), logistic regression, Gaussian naive Bayes, random forests, and support vector machines (SVM). To prevent overfitting, a ten-fold cross-validation method was utilized to systematically evaluate and compare the performance of these models. Furthermore, SHapley Additive exPlanation (SHAP) values were calculated to assess the predictive contribution of each feature in the model that performed the best.
Results
This analysis involved 1,157 participants, among whom 558 experienced fractures related to postmenopausal osteoporosis. Following a principal component analysis, a machine learning model was employed to evaluate five key features. The random forest classifier achieved the highest accuracy, recorded at 0.6695, along with the best area under the receiver operating characteristic curve, which was 0.6852. Additionally, the random forest model showed balanced sensitivity and specificity, both nearing 68%. Furthermore, SHAP analysis revealed that educational level and nutritional intake indicators were the most significant factors influencing the outcomes.
Conclusion
The random forest model proved to be the most effective tool for predicting the risk of fractures related to postmenopausal osteoporosis. The analysis using SHAP values underscored the significance of educational level and nutritional intake as key factors influencing the model's predictions.
A
Background
Postmenopausal osteoporosis (PMOP) primarily arises from a significant decline in estrogen levels in women following menopause(1). This hormonal shift results in a decrease in bone mineral
Click here to Correct
+XiaoJuan Zeng and YiQiang Zhang contributed equally to this work.
1Department of Gynecology, Fujian Provincial Hospital, Fuzhou University ,No. 134, Dongjie Street, Gulou District, Fuzhou City Fujian Provincial ,350001 China
2Department of Rehabilitation, Fuzhou First General Hospital,No. 190, Dadao Road, Taijiang District, Fuzhou City,Fujian Provincial ,350009 China
*Corresponding author.
E- mail addresses:
2351830540@qq.com(XiaoJuanZeng), 1040863673@qq.com(YiQiang Zhang), liuxia200808@163.com(Xia Liu)
content and adversely affects the intricate microstructure of bones, leading to diminished bone strengthand an elevated risk of fractures and associated complications. In the initial stages, many patients may remain asymptomatic; however, as the condition advances, a range of symptoms can emerge. Patients often experience a reduction in height due to vertebral body compression, and over time, this can lead to kyphosis, characterized by a hunched-back deformity. Additionally, generalized pain, particularly in the back and hips, is a common complaint. The most severe consequence of osteoporosis is fractures, which typically result in intense pain and can significantly restrict a patient's mobility. Together, these symptoms profoundly affect the physical and mental well-being of postmenopausal women. As the population continues to age, the incidence and mortality rates associated with osteoporosis and its resultant fractures are rising annually, creating a substantial and increasing economic burden on healthcare systems worldwide(25).
A
This study analyzed risk factors for osteoporosis-related fractures in postmenopausal women using data from the China Health and Nutrition Survey (CHNS) from 1993 to 2015 to enhance early monitoring and intervention.
Materials and methods
Data source
The China Health and Nutrition Survey (CHNS) is a longitudinal study carried out by an international research consortium. It began in 1989 and has since included follow-up surveys in 1991, 1993, 1997, 2000, 2004, 2006, 2009, 2011, and 2015, with each survey wave typically lasting seven days. The sampling strategy for the CHNS employs a multi-stage random clustering process, which encompasses approximately 7,200 households and more than 30,000 individuals across 15 provinces and municipalities in China (6). The study integrates various disciplines such as health science, nutrition, sociology, demography, economics, and public policy. It utilizes a comprehensive dataset that encompasses community, household, individual, and health surveys, as well as nutrition and physical fitness assessments, food market evaluations, and questions pertaining to health and family planning (7). The CHNS instruments and protocols, along with the informed consent process, have received approval from the appropriate institutional review boards, ensuring that all participants provide written consent prior to the commencement of the surveys. The CHNS is an essential database that delivers comprehensive information regarding health, nutrition, and socio-economic factors in China. Established in 1989, it has become an invaluable resource for researchers and policymakers alike, offering critical insights into dietary patterns, nutritional health, and the occurrence of chronic diseases within the Chinese population.
Study population
This study focused on women who had gone through menopause by 1993. We conducted eight follow-up surveys from 1993 to 2015, specifically in the years 1993, 1997, 2000, 2004, 2006, 2009, 2011, and 2015. Women with missing osteoporosis data during these follow-ups were excluded. Since menopausal status was only recorded in the 1993 survey, we used the average menopausal age for Chinese women, which is 50 years, to define our study population, selecting women aged 50 and older. The CHNS is a comprehensive nationwide study that includes several provinces across China, such as Guizhou in the west, Hunan, Hubei, and Henan in the central region, and Liaoning, Jiangsu, Shandong, and Guangxi in the east. The study received approval from the Ethics Review Committee of the Provincial Hospital Affiliated with Fuzhou University, as it utilized data from the publicly accessible CHNS database.
Outcome measures and data extraction
To calculate body mass index (BMI), one must divide weight in kilograms by the square of height in meters. In this study, participants were categorized into two groups based on the occurrence of postmenopausal osteoporosis-related fractures: a control group (Group A) and a fracture group (Group B). The variables collected included age in years, height in centimeters, weight in kilograms, and educational attainment, which was classified into categories such as no education, preschool education, junior high school, senior high school, or university and above. Additionally, the study extracted data on years of education and dietary intake, which was averaged over three days. This dietary intake included energy intake measured in kilocalories per day, as well as macronutrient intakes such as carbohydrates in grams per day, fats in grams per day, and proteins in grams per day. Other important variables included levels of apolipoprotein A (APO-A in mg/dL), high-density lipoprotein cholesterol (HDL-C in mg/dL), low-density lipoprotein cholesterol (LDL-C in mg/dL), apolipoprotein B (APO-B in mg/dL), and the presence of pain conditions over the past four weeks, which were classified as either yes or no.
Statistical analysis
We used principal component analysis (PCA) and survey-weighted statistical models to characterize participants with and without postmenopausal osteoporosis-related fractures (89). In this study, we compared normally distributed measurement data using t-tests, presenting the results as mean ± standard deviation (
Click here to download actual image
± S). We compared abnormally distributed measurement data using the Mann-Whitney U rank-sum test, expressing the results as median and interquartile range [M (Q1, Q3)]. We compared categorical data, presented as counts and percentages [n (%)], using either the chi-square test or Fisher's exact test.
PCA was employed to reduce the dimensionality of a dataset that encompasses various variables, including age, education level, dietary intake (such as calories, carbohydrates, fat, and protein), body shape indicators (like height, weight, and BMI), blood biochemistry markers (including APO-A, HDL-C, LDL-C, and APO-B), as well as pain conditions. To evaluate the adequacy of the sample, the Kaiser-Meyer-Olkin (KMO) test was conducted, while Bartlett's test of sphericity was utilized to ascertain the suitability of the data for factor analysis(10). PCA effectively diminishes the dimensionality of the dataset by taking into account both categorical and continuous variables, which helps to reduce multicollinearity and enhances the efficiency of the model. The features extracted through PCA were subsequently utilized for the development of further models.
This study utilized a variety of machine learning algorithms, such as decision trees, k-nearest neighbors (KNN), logistic regression, Gaussian naive Bayes, random forests, and support vector machines (SVM). To avoid overfitting, we implemented a ten-fold cross-validation technique, where 90% of the data was allocated for training and the remaining 10% was set aside for validation. This process was repeated ten times to ensure robustness. During the training phase, we fine-tuned the hyperparameters of all algorithms to enhance their performance and further reduce the risk of overfitting. We assessed the effectiveness of the models using metrics like accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC). Once the models were developed, we performed a feature importance analysis on the random forest model to confirm its interpretability.
We utilized SHapley Additive exPlanations (SHAP) values to assess the global feature importance in our leading machine learning model(11). SHAP is a notable advancement in the interpretability of tree-based models, utilizing a game-theoretic framework that aggregates local feature contributions to offer a thorough understanding of model behavior. Experts regard this approach as superior to other global approximation methods. The SHAP algorithm effectively quantifies feature importance, shedding light on the influence of each feature in specific predictions. All statistical tests conducted were two-sided, with p<0.05 deemed statistically significant. The statistical analyses were carried out using SAS software (SAS Institute Inc., Cary, NC, USA; version 9.4).
Results
Baseline characteristics of study population
This study involved 1157 participants: 599 in the group A(control group) and 558 in the group B (those with postmenopausal osteoporosis-related fractures)༈Fig. 1༉. Compared to non-fracture controls, the fracture group demonstrated significantly lower height (155.51 ± 5.86 vs 156.50 ± 5.68 cm), weight (59.44 ± 9.77 vs 57.54 ± 9.12 kg), BMI (23.76 ± 3.32 vs 24.23 ± 3.52 kg/m²), and years of education (5.0 ± 4.6 vs 6.9 ± 4.6), lower fat intake (66.50 ± 34.80 vs 72.53 ± 35.80 g/day), and higher protein intake (64.18 ± 21.72 vs 61.90 ± 21.20 g/day). Energy intake (2134.23 ± 577.75 vs 2025.00 ± 567.14 kcal/day), carbohydrate intake (317.83 ± 105.07 vs 280.08 ± 94.35 g/day), LDL-C levels (3.25 ± 0.97 vs 3.12 ± 0.92 mmol/L), APO-B levels (0.98 ± 0.27 vs 0.96 ± 0.25 mmol/L), and pain condition ( 2.84vs 0.03 ± 5.91༉ showed no significant differences (all p < 0.05). There were no significant differences in age, APO-A, HDL-C levels between the groups. The baseline characteristics of the study population are listed in Table 1.
Table 1
Baseline characteristics of women included different postmenopausal osteoporosis-related fractures groups and without fractures groups, n (%)
Characteristics
The guop A
The guop B
t/Z/F/χ2
p
Total(n = 599)
Total(n = 558)
Age, years,M(Q1,Q3)
51(50,52)
51(50,51)
1.344
0.179
Hight,cm(
Click here to download actual image
± S)
156.499 ± 5.6830
155.513 ± 5.8635
2.905
<0.05
Weight,kg(
Click here to download actual image
± S)
59.436 ± 9.7690
57.536 ± 9.1207
3.413
<0.05
BMI, kg/m2 (
Click here to download actual image
± S)
24.2298 ± 3.5189
23.7571 ± 3.3204
2.346
<0.05
Underweight (< 18.5)
16(2.67)
14(2.50)
  
Normal weight (18.5–24.0)
286(47.7)
303(55.29)
  
Overweight (>24.0)
297(49.58)
241(43.80)
  
Educational year(
Click here to download actual image
± S)
6.93 ± 4.652
5.05 ± 4.598
6.913
<0.05
Educational levels
  
6.625
<0.05
Illiteracy
151(25.21)
224(40.14)
  
Primary school
126(21.04)
150(26.88)
  
Junior high school
187(31.22)
108(19.35)
  
Senior high school
87(14.52)
46(8.24)
  
University or above
48(8.01)
30(5.38)
  
energy intake (kcal)(
Click here to download actual image
± S)
2025.0016 ± 567.1406
2134.2314 ± 577.7516
−3.244
<0.05
carbohydrate intake (g)(
Click here to download actual image
± S)
280.0829 ± 94.3517
317.8323 ± 105.0676
−6.438
<0.05
fat intake (g)(
Click here to download actual image
± S)
72.5305 ± 35.8002
66.5599 ± 34.8003
2.857
<0.05
protein intake (g)(
Click here to download actual image
± S)
61.8956 ± 21.2001
64.1831 ± 21.7181
−1.812
<0.05
APO-A(
Click here to download actual image
± S)
1.2161 ± 0.5193
1.1804 ± 0.2894
1.428
0.154
HDL-C(
Click here to download actual image
± S)
1.4673 ± 0.4149
1.5000 ± 0.6721
−1.003
0.316
LDL-C(
Click here to download actual image
± S)
3.1170 ± 0.9156
3.2511 ± 0.9708
−2.417
<0.05
APO-B(
Click here to download actual image
± S)
0.9568 ± 0.2548
0.9844 ± 0.2710
−1.782
<0.05
Pain condition
17(2.84)
33(5.91)
6.610
<0.05
PCA
Before building the machine learning model, we visualized the feature distributions of the participants' survey data. Supplementary Fig. 1 displays the distribution of categorical features, while Supplementary Fig. 2 shows the distribution of continuous features.We extracted the variable loadings of each observation index using principal component analysis. This process yielded five main components through Kaiser normalization and varimax rotation.The rotated component matrix reveals the main components along with their high-loading variables. For example, PCA 1 includes calories (0.980), carbohydrates (0.740), and protein (0.871), which are related to dietary nutrition intake. PCA 2: educational attainment (0.950), years of education (0.951), reflecting the main component of educational level. PCA 3: height (0.437), weight (0.979), BMI (0.876), reflecting the main component of physical body shape indicators. PCA 4: APO-A (0.789), HDL-C (0.642), involving the main component of blood biochemical indicators. PCA 5: Pain condition (0.601), representing the main component of the physical pain condition. The rotated component matrix for the principal component analysis, highlighting key indicators, is presented in Table 2.
Fig. 1
Participants screening flow chart.
Click here to Correct
Table 2
Principal Component Analysis of Key Indicators: Rotated Component Matrix
Rotated Component Matrix
Characteristics
PCA1
PCA2
PCA3
PCA4
PCA5
Age
0.044
-0.261
0.008
0.173
0.272
Hight
-0.228
0.449
0.193
0.105
-0.208
Weigh
-0.290
0.453
0.818
0.060
-0.025
BMI
-0.210
0.279
0.822
0.008
0.085
Educational year
-0.455
0.629
-0.459
-0.330
-0.036
Educational levels
-0.455
0.629
-0.459
-0.330
-0.036
energy intake
0.863
0.468
0.014
-0.014
0.013
carbohydrate intake
0.837
0.039
0.016
-0.149
0.030
fat intake
0.263
0.651
0.003
0.166
-0.016
protein intake
0.695
0.534
-0.008
-0.022
-0.024
APO-A
-0.129
0.249
-0.336
0.629
0.223
HDL-C
0.023
0.149
-0.266
0.621
-0.110
LDL-C
-0.015
-0.059
-0.040
0.320
-0.561
APO-B
-0.272
0.282
0.029
0.273
0.461
Pain condition
0.095
0.008
-0.050
-0.095
0.596
Table 3 and Fig. 2 show the contribution of each principal component to the overall variance of the dataset. Nutritional intake indicators had an eigenvalue of 2.685, explaining 17.901% of the total variance; educational level had an eigenvalue of 2.454, accounting for 16.361%; physical constitution indicators had an eigenvalue of 1.982, explaining 13.213%; biochemical indicators had an eigenvalue of 1.283, accounting for 8.555%; and pain condition had an eigenvalue of 1.073, explaining 7.156%. After extraction and rotation, the cumulative variance explained by the first five components reached 63.186%. To enhance the flow of ideas and clarify the connection between the components and their representation of the data.This indicates that these five components together captured a significant portion of the information from the original variables, allowing for a more concise representation of the data.
Table 3
Principal Component Analysis of Key Indicators: Variance Explained and Rotated Loadings
Variables
Variance Explanined
Rotated loadings
Eigen value
Variance
(%)
Eigen value(%)
Eigen value
Variance
(%)
Cumulative variance(%)
Nutritional intake indicators
2.685
17.901
17.901
2.621
17.472
17.472
Educational level
2.454
16.361
34.262
2.178
14.518
31.990
Physical constitution indicators
1.982
13.213
47.475
2.123
14.152
46.1
Biochemical indicators
1.283
8.555
56.030
1.467
9.781
55.923
Pain condition
1.072
7.156
63.186
1.089
7.263
63.186
*:Select components with eigenvalues greater than 1 as principal components.
Fig. 2
The clustering heatmap after principal component analysis
Click here to Correct
Model performance
We developed and validated six machine learning models to assess the risk of osteoporotic fractures in postmenopausal women. The receiver operating characteristic (ROC) curves in Table 4 and Fig. 3 show how well each model discriminates the risk of osteoporotic fractures in postmenopausal women. Among the evaluated models, the Random Forest classifier demonstrated the highest accuracy (0.6695), sensitivity (0.6548), specificity (0.6833), and AUC (0.6852), outperforming the Decision Tree, KNN, Logistic Regression, Gaussian Naive Bayes, and SVM models. Additionally, Logistic Regression showed competitive performance with an AUC of 0.6526 and high specificity (0.6556), while KNN exhibited the lowest performance across all metrics.
The random forest model excels in several key areas, starting with its powerful nonlinear modeling ability: Firstly, the random forest has a powerful nonlinear modeling ability. It combines multiple decision trees to automatically capture complex nonlinear relationships and high-order interactions among features. This significantly enhances the model's predictive ability (11). In contrast, linear models like Logistic Regressio struggle with complex nonlinear relationships and fail to effectively capture interaction effects among features. In addition, one of the remarkable features of the random forest lies in the fact that each decision tree within it operates independently and remains unaffected by others. This characteristic gives the random forest high adaptability, making it ideal for training on multi-core processors or in distributed computing environments. Leveraging parallel processing capabilities, it can not only significantly enhance training efficiency but also substantially reduce model training time. The advantages of parallel computing in random forest training become especially clear and essential when working with large and complex datasets.
Table 4
Model performance metrics
Model
Accuracy
Sensitivity
Specificity
AUC
Decision Tree
0.5920
0.6012
0.5833
0.5923
KNN
0.5575
0.5611
0.5611
0.5831
Logistic Regression
0.6034
0.5417
0.6556
0.6526
Gaussian Naive Bayes
0.5776
0.4017
0.7333
0.6367
Random Forest
0.6695
0.6548
0.6833
0.6852
SVM
0.5948
0.5298
0.6660
0.6139
Fig. 3
Comparison of the ROC curve for multiple models
Click here to Correct
ROC curves are used to compare the performance of various machine learning models in predicting osteoporotic fractures in postmenopausal women. The AUC for each model is indicated in the legend. Random Forest (AUC = 06825) showed the highest discriminative ability, closely followed by Logistic Regressio (AUC = 0.6526). The KNN model performed the worst, with an AUC of 0.5831.
*ROC, receiver operating characteristic; AUC, area under the curve; KNN, k-nearest neighbors; SVM, support vector machine.
Feature importance analysis
The SHAP plots (Fig. 4 and Fig. 5) clearly and intuitively demonstrate the significant status and key roles of each feature in the machine learning model for predicting comorbidities. SHAP values quantify the impact of each feature on the model's predictions. Larger SHAP values for educational level (0.1017), nutritional intake indicators (0.0472), physical constitution indicator (0.0366), biochemical indicators (0.0315), and pain condition (0.0220) demonstrate a more significant impact on the model's predictions. Larger values indicate that the feature has a relatively more significant impact on the model's predictions.
Fig. 4
SHAP value distribution for each feature
Click here to Correct
This figure shows the distribution of SHAP values for each feature in the random forest model, aiming to clarify the importance and impact of each feature in predicting the risk of osteoporosis-related fractures in postmenopausal women. Each subplot reflects the influence of a specific feature on the model's output, with the magnitude of the SHAP value being positively correlated with the feature's contribution to the model's decision-making. Educational level and nutritional intake indicators exhibit notably high SHAP values, highlighting their crucial roles in predicting the risk of osteoporosis-related fractures in postmenopausal women. Conversely, the box plot for pain condition is primarily concentrated around zero, indicating that this feature has a relatively small impact on the model's predictions.
Fig. 5
SHAP value impact on the random forest model output
Click here to Correct
The SHAP plot depicts the impact of various features on the prediction of osteoporosis-related fractures in postmenopausal women by the random forest model. The horizontal axis represents SHAP values, where the magnitude of positive or negative values indicates the strength and direction of each feature's influence on the model's output. Features such as educational level and nutritional intake indicators have the most significant impact on the predictions.
Discussion
We used interpretable machine learning methods to investigate the association between factors such as educational level, nutritional intake, and physical body shape indicators, and osteoporotic fractures in postmenopausal women. Our analysis utilized data from eight waves of follow-up surveys (1993–2015) from the China Health and Nutrition Survey (CHNS) database. Among the six machine learning models considered, the random forest model performed the best. he random forest model achieved an average AUC of 0.685, demonstrating excellent classification efficiency and stability. We used the SHAP game-theoretic approach to clarify the importance of each selected feature in the model. The key contributing factors identified were educational level, nutritional intake, and physical body shape.
This study differs from previous research, which focused mainly on clinical prediction models for osteoporotic fractures in postmenopausal women, by incorporating educational and dietary factors to assess fracture risk, thus offering a new perspective and methodology(12). Our main focus was on analyzing the roles of education and diet. However, the model also included easily accessible demographic characteristics and biochemical test results, which improved its predictive performance. Additionally, we used cross-validation to compare the performances of the various models.
Machine learning models are increasingly used to study factors related to postmenopausal osteoporotic fractures. For instance, Chen YC et al. used the XGBoost algorithm to create a predictive model that examines differences in immune cell profiles based on xCell signatures between osteoporotic patients with and without vertebral fractures (13). Similarly, Qing Wu and JingYuan Dai used four machine learning (ML) models—SVM, random forest, XGBoost, and artificial neural network (ANN)—to improve the predictive accuracy for major osteoporotic and hip fractures (14). Alessandro de Sire and colleagues conducted a cross-sectional study using machine learning methods to assess the predictive value of handgrip strength and the Short Physical Performance Battery for fragility fractures. They also evaluated the correlation with the Fracture Risk Assessment Tool. Their findings suggest that incorporating precise assessments of muscle strength and physical performance may be considered in the multidisciplinary evaluation of fracture risk in postmenopausal women(15).These studies demonstrate the application of machine learning in exploring risk factors for postmenopausal osteoporosis-related fractures, providing valuable insights for this field. Although multiple models were used, these studies generally lacked cross-validation for a comparative evaluation of their performance.
We selected KNN, SVM, decision trees, random forests, logistic regression, and Gaussian naive Bayes to create predictive models. We then used cross-validation to evaluate each model's discriminative characteristics and identify the best one for predicting fractures related to postmenopausal osteoporosis. Compared to traditional statistical methods like regression analysis, machine learning techniques offer deeper insights into this research area. Machine learning algorithms automatically identify complex patterns in a data-driven manner, offering significant advantages over traditional statistical methods. They can directly model nonlinear relationships and high-dimensional features interactions, reducing reliance on manual assumptions. These algorithms enhance modeling efficiency by supporting automated feature engineering and optimizing hyperparameters. They excel in large-scale data scenarios, typically achieving higher prediction accuracy, and enable real-time predictions through distributed computing and edge deployment. When paired with interpretability tools like SHAP and LIME, these algorithms not only maintain predictive performance but also provide valuable insights for decision-making. This makes them well-suited for complex domains such as image recognition, natural language processing, and medical diagnosis. These strengths position machine learning as a core technology for addressing modern scientific and engineering challenges.
Our research indicates that the Random Forest model is the most effective. The Random Forest algorithm is a powerful ensemble learning method, primarily known for enhancing model performance by constructing multiple decision trees and integrating their predictions. Compared to single models, the Random Forest has three main advantages. First, it reduces the risk of overfitting and improves generalization by using Bootstrap sampling and random feature selection. Second, the algorithm supports parallel computing, which increases training efficiency and allows it to manage high-dimensional data effectively. Most importantly, the Random Forest automatically evaluates feature importance and effectively handles non-linear relationships and missing data. This capability offers a degree of interpretability while ensuring high prediction accuracy(16). These advantages make it demonstrate excellent application value in fields such as medical diagnosis, financial prediction, and bioinformatics(17).
We used SHAP values to enhance the explainability of our random forest model, making it more interpretable and intuitive while evaluating the impact of key features. HAP values are well-known in the machine learning field, especially in medical applications like predicting cardiovascular disease. They are valued for their ability to provide strong model interpretability. They quantitatively assess the contribution of each feature to the model’s output(18). SHAP decision plots provide explicit visualization of instance-level prediction mechanisms in the random forest model. The results indicated that major factors included nutritional intake indicators (such as calories, carbohydrates, and protein), educational level (measured in years of education), physical constitution indicators (like height, weight, and BMI), and biochemical indicators (such as APO-A and HDL-C). These parameters significantly influence outcomes and align with current clinical practices, which highlight the importance of educational level and nutritional intake in preventing fractures related to postmenopausal osteoporosis (19). Osteoporosis, the most common bone disorder, profoundly impacts women's health, especially during postmenopausal phases(20). Clinicians can dynamically monitor these parameters to assess the risk of fractures related to postmenopausal osteoporosis and adjust treatment strategies accordingly. For instance, early dietary pattern modifications can assist clinicians in optimizing treatment plans and improving patient outcomes.
In this study, the main components affecting postmenopausal osteoporosis-related fractures are classified into five categories: nutritional intake, educational level, physical constitution, biochemical factors, and pain conditions. Each of these components influences the development of fractures in distinct ways.
Bone growth and maintenance, as living tissue, rely on various essential nutrients.(21). Good nutrition is crucial for maintaining optimal bone health. Small daily benefits accumulated over several decades can significantly reduce fracture risk (22༉. Nutrition has become a key factor in reducing bone loss and fracture risk. Numerous studies recommend a diet high in calcium, protein, vitamin C, and vitamin D to prevent osteoporosis༈23–25༉. Previous research has substantiated the correlation between micronutrients and the prevention of osteoporosis, emphasizing the significance of dietary factors༈26༉. Conversely, poor nutrition has been identified as a contributing factor to osteoporosis and fractures ༈27༉.
A study involving 1,424 Mexican-American women aged 67 and older identified a significant link between higher education levels and a reduced incidence of osteoporosis (OR = 1.13, 95% CI: 1.05–1.20) (28). Okbay et al. (29) revealed that the polygenic index of educational attainment (EA PGI) was a significant predictor of osteoporosis (OR = 0.030, 95% CI: 0.017–0.050, p = 2.985E-08). It is important to note that much of the predictive power of the EA PGI arises from factors outside of its direct influence. Research indicates that individuals with lower education levels are more likely to engage in unhealthy behaviors related to osteoporosis, such as inadequate dairy consumption and insufficient exercise (3031).The Global Longitudinal Study of Osteoporosis in Women revealed significant associations between body height, weight, BMI and incident clinical fractures. Compared to men, women—particularly those with taller stature—demonstrated a greater increase in fracture risk, with hip fractures being especially prevalent(32).Body weight and BMI are significant factors influencing bone mineral density (BMD). While higher BMI may generally correlate with greater BMD, excessive weight can paradoxically increase fracture risk. Conversely, low BMI—particularly in postmenopausal women—is strongly associated with osteopenia and osteoporosis(3334). In summary, obesity is associated with a higher risk of fractures among postmenopausal women.
Multiple studies have indicated a positive correlation between HDL-C levels and the risk of lumbar osteoporosis in postmenopausal women, and this association remains stable in female populations(3536). Apolipoprotein A1 (ApoA1) is a key regulatory component in lipid metabolism. As the primary apolipoprotein of high-density lipoprotein cholesterol (HDL-C), it has been demonstrated to possess multiple cardiovascular protective effects༈37༉. Studies have demonstrated that ApoA1 plays crucial roles in various biological processes, including the regulation of bone metabolic homeostasis, systemic inflammatory responses, nitric oxide production, and oxidative stress༈38༉. Additionally, research has shown that ApoA1 may serve as a predictor for the onset and progression of osteoporosis (OP) ༈39༉. ApoA1 deficiency may alter osteoprogenitor cell populations and impair bone metabolism, which remodels the phenotypic and molecular characteristics of bone marrow adipocytes. Serum ApoA1 levels reflect osteoblast activity and may consequently modulate osteoporosis through the regulation of osteoblast function༈40–41༉. Osteoporotic fractures can induce both acute and chronic nociceptive and neuropathic pain༈42༉.
This study presents several clinically relevant findings with potential applications in medical practice. First, our predictive model demonstrated satisfactory performance in disease risk assessment. This suggests that dietary composition analysis could serve as a non-invasive method for evaluating individual disease susceptibility. Second, the results underscore the importance of dietary antioxidant properties. As a modifiable risk factor, these findings could inform the development of evidence-based dietary interventions aimed at reducing disease risk.
Despite the strong predictive ability of this model, it had several limitations. First, this study was a single-center retrospective study, and the data are mainly derived from survey research in a specific region, which may introduce selection bias and limit the generalizability of the research findings. However, these results may not fully reflect the situations in other populations or environments. Multicenter studies are needed in the future to validate the external validity of this model. Second, although various data imputation techniques and outlier handling methods were used, there may be deviations between data processing and actual clinical data, potentially affecting the predictive accuracy of the model. In addition, some variables in the CHNS database may have incomplete or inconsistent records, which in turn affect data reliability. Due to the retrospective nature of this study, we could not control all potential confounding factors, and unmeasured variables may have influenced the results. Moreover, changes in clinical practice over time that may limit the applicability of these findings in the current context. It is worth noting that although a higher AUC value in some models can indicate strong predictive performance within the current dataset, it may also reflect overfitting of the model to specific patterns in the training data, thus indicating a risk of overfitting. This phenomenon can limit the generalizability of the model to independent datasets or other patient populations. Future studies should use multicenter datasets for external validation, rigorously assess both the stability and applicability of the model, and optimize its training and validation processes to reduce the risk of overfitting and enhance its potential for application in a broader clinical setting. Finally, although SHAP analysis improves model interpretability, the complexity of machine learning models still hinders their widespread clinical application, especially in decision-making processes that require transparency. Future research should focus on improving the interpretability of models to enhance their clinical utility.
Future research should explore the integration of additional observation indicators (such as inflammatory factors and metabolic products) to improve the accuracy and reliability of predictions. In addition, incorporating imaging features and using artificial intelligence to analyze imaging data will help enable comprehensive multimodal assessment methods. Optimizing the model's ability to provide personalized predictions for specific patient groups (such as perimenopausal women or patients with comorbidities) may offer more precise clinical guidance.
Conclusions
This study combined principal component analysis with machine learning (ML) techniques, particularly the random forest model, to develop a potential method for predicting the risk of postmenopausal osteoporosis-related fractures. The results showed that the random forest model had strong predictive performance in identifying high-risk patients, which provides valuable insights for personalized management strategies and early intervention. However, the clinical applicability of this model requires further investigation. Therefore, future prospective multicenter studies involving larger and more diverse patient populations are needed to assess the robustness and generalizability of the model, from which scientific evidence can be provided for its translation into a practical clinical tool.
CRediT authorship contribution statement
XiaoJuan Zeng: Writing – review & editing, Writing – original draft,Software, Methodology, Data curation, Conceptualization. YiQiang Zhang: Writing – original draft, Data curation. Xia Liu : Writing – review & editing, Methodology, Data curation, Conceptualization.
A
A
Data Availability
Data Availability StatementBeyond the clarifications presented heretofore, it is imperative to specify that the datasets utilized and/or analyzed in the present study are primarily sourced from the **China Health and Nutrition Survey (CHNS)**. For comprehensive access to the original CHNS datasets, detailed documentation, and the most recent data releases, reference should be made to the official CHNS website: https://www.cpc.unc.edu/projects/china.While selected processed data supporting the key findings of this study are accessible within the main text of the article or its supplementary materials, the complete set of raw data and baseline data may be obtained via the aforementioned official CHNS platform, provided that the access protocols and guidelines stipulated by the database are adhered to.
A
Funding information
Project of Fujian Provincial Department of Finance (0060092364).
Electronic Supplementary Material
Below is the link to the electronic supplementary material
A
Author Contribution
Xiaojuan Zeng, Yiqiang Zhang, and Xia Liu wrote the main text of the manuscript, and Xia Liu prepared the figures and tables. All authors have reviewed the manuscript.
Reference
1.
Bujoreanu, F. C. et al. Cutaneous Changes Beyond Psoriasis: The Impact of Biologic Therapies on Angiomas and Solar Lentigines. Med. (Kaunas). 61 (4), 565 (2025).
2.
Clynes, M. A. et al. The epidemiology of osteoporosis. Br. Med. Bull. 133 (1), 105–117 (2020).
3.
Barcelos, A., Gonçalves, J., Mateus, C., Canhão, H. & Rodrigues, A. M. Costs of incident non-hip osteoporosis-related fractures in postmenopausal women from a payer perspective. Osteoporos. Int. 34 (12), 2111–2119 (2023).
4.
Office of the Surgeon G. Reports of the Surgeon General. Bone Health and Osteoporosis: A Report of the Surgeon General (Office of the Surgeon General (US), 2004).
5.
van Staa, T. P., Dennison, E. M., Leufkens, H. G. & Cooper, C. Epidemiology of fractures in England and Wales. Bone 29, 517–522 (2001).
6.
Popkin, B. M., Du, S., Zhai, F. & Zhang, B. Cohort Profile: The China Health and Nutrition Survey–monitoring and understanding socio-economic and health change in China, 1989–2011. Int. J. Epidemiol. 39 (6), 1435–1440 (2010).
7.
Chen, C. & Lu, F. C. The guidelines for prevention and control of overweight and obesity in Chinese adults. Biomed. Environ. Sci. 17 (Suppl), 1–36 (2004).
8.
Qi, X. et al. Machine learning and SHAP value interpretation for predicting comorbidity of cardiovascular disease and cancer with dietary antioxidants. Redox Biol. 79, 103470 (2025).
9.
Torralba, V., Maradei, F. & Castellanos, J. Ergonomic criteria related to perceived comfort when using by-pass-type cutting tools. Work 80 (3), 1040–1052 (2025).
10.
Chowdhury, S. U. et al. Shapley-Additive-Explanations-Based factor analysis for dengue severity prediction using machine learning. J. Imaging ;8(9). (2022).
11.
Yan, L. et al. Random forest-based model for the recurrence prediction of borderline ovarian tumor: clinical development and validation. J. Cancer Res. Clin. Oncol. 151 (5), 160 (2025).
12.
Wu, Y. et al. Nomogram for predicting cemented vertebral refracture after percutaneous kyphoplasty in postmenopausal women with osteoporotic vertebral compression fractures. Clin. Neurol. Neurosurg. 250, 108789 (2025).
13.
Chen, Y. C. et al. Immune cell profiles and predictive modeling in osteoporotic vertebral fractures using XGBoost machine learning algorithms. BioData Min. 18 (1), 13 (2025).
14.
Wu, Q. & Dai, J. Enhanced osteoporotic fracture prediction in postmenopausal women using Bayesian optimization of machine learning models with genetic risk score. J. Bone Min. Res. 39 (4), 462–472 (2024).
15.
de Sire, A. et al. Influence of hand grip strength test and short physical performance battery on FRAX in post-menopausal women: a machine learning cross-sectional study. J. Sports Med. Phys. Fit. 64 (3), 293–300 (2024).
16.
Yu, F., Wei, C., Deng, P., Peng, T. & Hu, X. Deep exploration of random forest model boosts the interpretability of machine learning studies of complicated immune responses and lung burden of nanoparticles. Sci. Adv. 7 (22), eabf4130 (2021).
17.
Cabrera, A. et al. Use of random forest machine learning algorithm to predict short term outcomes following posterior cervical decompression with instrumented fusion. J. Clin. Neurosci. 107, 167–171 (2023).
18.
Li, X. et al. Development of an interpretable machine learning model associated with heavy metals' exposure to identify coronary heart disease among US adults via SHAP: Findings of the US NHANES from 2003 to 2018. Chemosphere 311 (Pt 1), 137039 (2023).
19.
Kurt, S. et al. Prevalence and Risk Factors of Osteoporosis: A Cross-Sectional Study in a Tertiary Center. Med. (Kaunas). 60 (12), 2109 (2024).
20.
Yordanov, A., Vasileva-Slaveva, M., Tsoneva, E., Kostov, S. & Yanachkova, V. Bone Health Gynaecologists Med. (Kaunas) ;61(3):530. (2025).
21.
Li, M. et al. Trends and hotspots in research on osteoporosis and nutrition from 2004 to 2024: a bibliometric analysis. J. Health Popul. Nutr. 43 (1), 204 (2024).
A
22.
Weaver, C. M. Nutrition and bone health. Oral Dis. 23 (4), 412–415 (2017).
A
23.
Shi, Y., Zhan, Y., Chen, Y. & Jiang, Y. Effects of dairy products on bone mineral density in healthy postmenopausal women: a systematic review and meta-analysis of randomized controlled trials. Arch. Osteoporos. 15 (1), 48 (2020).
A
24.
Park, S. J., Jung, J. H., Kim, M. S. & Lee, H. J. High dairy products intake reduces osteoporosis risk in Korean postmenopausal women: A 4 year follow-up study. Nutr. Res. Pract. 12 (5), 436–442 (2018).
A
25.
Ilesanmi-Oyelere, B. L. & Kruger, M. C. Nutrient and Dietary Patterns in Relation to the Pathogenesis of Postmenopausal Osteoporosis-A Literature Review. Life (Basel). 10 (10), 220 (2020).
A
26.
Malmir, H., Saneei, P., Larijani, B. & Esmaillzadeh, A. Adherence to Mediterranean diet in relation to bone mineral density and risk of fracture: a systematic review and meta-analysis of observational studies. Eur. J. Nutr. 57 (6), 2147–2160 (2018).
A
27.
Feng, L. et al. Preoperative malnutrition as an independent risk factor for the postoperative mortality in elderly Chinese individuals undergoing hip surgery: a single-center observational study. Ther. Adv. Chronic Dis. 13, 20406223221102739 (2022).
28.
Kanis, J. A. et al. Algorithm for the management of patients at low, high and very high risk of osteoporotic fractures. Osteoporos. Int. 31 (1), 1–12 (2020).
29.
Okbay, A. et al. Polygenic prediction of educational attainment within and between families from genome-wide association analyses in 3 million individuals. Nat. Genet. 54 (4), 437–449 (2022).
30.
Lanyan, A., Marques-Vidal, P., Gonzalez-Rodriguez, E., Hans, D. & Lamy, O. Postmenopausal women with osteoporosis consume high amounts of vegetables but insufficient dairy products and calcium to benefit from their virtues: the CoLaus/OsteoLaus cohort. Osteoporos. Int. 31 (5), 875–886 (2020).
31.
Crawford, S. L. Parity, education, and postmenopausal cognitive function. Menopause 27 (12), 1348–1349 (2020).
32.
Compston, J. E. et al. GLOW Investigators. Relationship of weight, height, and body mass index with fracture risk at different sites in postmenopausal women: the Global Longitudinal study of Osteoporosis in Women (GLOW). J. Bone Min. Res. 29 (2), 487–493 (2014).
33.
De Laet, C. et al. Body mass index as a predictor of fracture risk: a meta-analysis. Osteoporos. Int. 16 (11), 1330–1338 (2005).
34.
Liu, H. F., Meng, D. F., Yu, P., De, J. C. & Li, H. Y. Obesity and risk of fracture in postmenopausal women: a meta-analysis of cohort studies. Ann. Med. 55 (1), 2203515 (2023).
35.
Wang, Y. et al. Association between serum cholesterol level and osteoporotic fractures. Front. Endocrinol. 9, 30 (2018).
36.
Jeong, I. et al. Lipid profiles and bone mineral density in pre- and postmenopausal women in Korea. Calcif Tissue Int. 87, 507–512 (2010).
A
37.
Sun, X. & Wu, X. Association of apolipoprotein A1 with osteoporosis: a cross-sectional study. BMC Musculoskelet. Disord. 24 (1), 157 (2023).
A
38.
Xu, J. et al. Proteome-wide profiling reveals dysregulated molecular features and accelerated aging in osteoporosis: a 9.8‐year prospective study. Aging Cell. ;23. (2023).
A
39.
Sun, X. & Wu, X. Association of apolipoprotein A1 with osteoporosis: a cross-sectional study. BMC Musculoskelet. Disord. 24, 157 (2023).
A
40.
Fan, L. et al. The diagnostic value of the combined application of blood lipid metabolism markers and interleukin-6 in osteoporosis and osteopenia. Lipids Health Dis. 24 (1), 38 (2025).
A
41.
Wang, W., Chen, Z. Y., Lv, F. Y., Tu, M. & Guo, X. L. Apolipoprotein A1 is associated with osteocalcin and bone mineral density rather than high-density lipoprotein cholesterol in Chinese postmenopausal women with type 2 diabetes mellitus. Front. Med. (Lausanne). 10, 1182866 (2023).
A
42.
Vellucci, R. et al. Understanding osteoporotic pain and its pharmacological treatment. Osteoporos. Int. 29 (7), 1477–1491 (2018).
Abstract
Objective: To develop and validate a machine learning model that integrates indicators of educational level and nutritional intake to predict fractures associated with postmenopausal osteoporosis, while also clarifying the roles these factors play in disease prediction. Methods: Data were sourced from the China Health and Nutrition Survey, with a focus on important aspects such as nutritional intake and educational levels. To improve the model's accuracy, additional factors like physical body shape indicators, blood biochemical markers, and pain conditions were also included. To simplify the data and uncover underlying patterns, principal component analysis (PCA) was applied to a dataset that included various variables. The models constructed for analysis comprised decision trees, k-nearest neighbors (KNN), logistic regression, Gaussian naive Bayes, random forests, and support vector machines (SVM). To prevent overfitting, a ten-fold cross-validation method was utilized to systematically evaluate and compare the performance of these models. Furthermore, SHapley Additive exPlanation (SHAP) values were calculated to assess the predictive contribution of each feature in the model that performed the best. Results: This analysis involved 1,157 participants, among whom 558 experienced fractures related to postmenopausal osteoporosis. Following a principal component analysis, a machine learning model was employed to evaluate five key features. The random forest classifier achieved the highest accuracy, recorded at 0.6695, along with the best area under the receiver operating characteristic curve, which was 0.6852. Additionally, the random forest model showed balanced sensitivity and specificity, both nearing 68%. Furthermore, SHAP analysis revealed that educational level and nutritional intake indicators were the most significant factors influencing the outcomes. Conclusion: The random forest model proved to be the most effective tool for predicting the risk of fractures related to postmenopausal osteoporosis. The analysis using SHAP values underscored the significance of educational level and nutritional intake as key factors influencing the model's predictions. Background Postmenopausal osteoporosis (PMOP) primarily arises from a significant decline in estrogen levels in women following menopause(1). This hormonal shift results in a decrease in bone mineral
Total words in MS: 4884
Total words in Title: 16
Total words in Abstract: 298
Total Keyword count: 0
Total Images in MS: 6
Total Tables in MS: 4
Total Reference count: 42