An interpretable machine learning model for assessing the risk of Talaromycosis in HIV patients lacking skin lesions
JiaguangHu1,3,9,10
WenmingHe1,4
QunTian7
YanqiuLu5
PengZhang1,4
JinyuQin1,4
ChuanQin1,4
YingWu4
ChengHuang8
XuLi4
LuhuaiFeng11
LinghuaLi6✉Email
ZhongshengJiang1,4✉Email
JianningJiang2,12✉Email
1Division of Infectious DiseasesLiuzhou People’s Hospital, Guangxi Medical UniversityLiuzhouGuangxiChina
2Department of Infectious DiseasesThe First Affiliated Hospital of Guangxi Medical University530021NanningGuangxiChina
3Department of Infection Control ManagementLiuzhou People’s Hospital, Guangxi Medical UniversityLiuzhouGuangxiChina
4Liuzhou Key Laboratory of Infection Disease and ImmunologyLiuzhou People’s Hospital Affiliated to Guangxi Medical UniversityLiuzhouGuangxiChina
5Clinical Research CenterChongqing Public Health Medical CenterShapingbaChina
6
A
A
Infectious Disease CenterGuangzhou Eighth People’s Hospital, Guangzhou Medical University
7Internal Medicine Ward OneThe Third People’s Hospital of GuilinGuilinGuangxiChina
8Oncology DepartmentLiuzhou People’s Hospital Affiliated to Guangxi Medical UniversityLiuzhouGuangxiChina
9Liuzhou Key Laboratory of Severe Abdominal Infection ResearchLiuzhouGuangxiChina
10Guangxi Key Laboratory of Clinical Disease Biotechnology ResearchLiuzhou People’s HospitalLiuzhouGuangxiChina
11Department of Endocrinology and Metabolism, NephrologyMedical University Cancer HospitalNanningGuangxiChina
12Key Laboratory of Early Prevention and Treatment for Regional High-Frequency Tumor (Guangxi Medical University, Ministry of Education530021NanningGuangxiP. R. China
Jiaguang Hu1,3,9,10#, Wenming He1,4#, Qun Tian7#, Yanqiu Lu5, Peng Zhang1,4, Jinyu Qin1,4, Chuan Qin1,4, Ying Wu4, Cheng Huang8, Xu Li4, Luhuai Feng11, Linghua Li6*, Zhongsheng Jiang1,4*, Jianning Jiang2,12*
1 Division of Infectious Diseases, Liuzhou People’s Hospital Affiliated to Guangxi Medical University, Liuzhou, Guangxi, China
2Department of Infectious Diseases, The First Affiliated Hospital of Guangxi Medical University, Nanning 530021, Guangxi, China
3Department of Infection Control Management, Liuzhou People’s Hospital Affiliated to Guangxi Medical University, Liuzhou, Guangxi, China
4Liuzhou Key Laboratory of Infection Disease and Immunology, Liuzhou People’s Hospital Affiliated to Guangxi Medical University, Liuzhou, Guangxi, China
5Clinical Research Center, Chongqing Public Health Medical Center, Shapingba, China
6Infectious Disease Center, Guangzhou Eighth People's Hospital, Guangzhou Medical University
7 Internal Medicine Ward One, The Third People's Hospital of Guilin, Guilin, Guangxi, China
8 Oncology Department, Liuzhou People's Hospital Affiliated to Guangxi Medical University, Liuzhou, Guangxi, China.
9Liuzhou Key Laboratory of Severe Abdominal Infection Research, Liuzhou, Guangxi, China
10Guangxi Key Laboratory of Clinical Disease Biotechnology Research, Liuzhou People's Hospital, Liuzhou, Guangxi, China
11Department of Endocrinology and Metabolism, Nephrology, Guangxi Medical University Cancer Hospital, Nanning, China
12Key Laboratory of Early Prevention and Treatment for Regional High-Frequency Tumor (Guangxi Medical University), Ministry of Education, Nanning, Guangxi, 530021, P. R. China
*Corresponding author:
Jianning Jiang, E-mail: gxjjianning@163.com
Zhongsheng Jiang, E-mail:jiangzs1111@126.com
Linghua Li, E-mail:llheliza@126.com
Jiaguang Hu, Wenming He and Qun Tian contributed equally to this work.
Abstract
Introduction:
The existing predictive models for talaromycosis in HIV-infected patients without skin lesions are limited by established risk factors and traditional statistical approaches. This study aims to develop an interpretable machine learning (ML) model for predicting the risk of talaromycosis in HIV patients without skin lesions and to validate its clinical applicability.
Methods
This retrospective multicenter study involved the analysis of electronic medical records (EMR) from four tertiary hospitals in China, covering the period from 2010 to 2019. The training dataset comprised 1,009 HIV patients with opportunistic infections, while external validation was conducted using data from 305 patients at an independent center. From an initial set of 36 variables, twelve key features were selected, including albumin, absolute lymphocyte count, hemoglobin, alanine aminotransferase (ALT), aspartate aminotransferase (AST), AST/ALT ratio, C-reactive protein, white blood cell count, platelet count, peripheral or abdominal lymphadenopathy, CD4+ T-cell count, and age. Five ML algorithms were evaluated using 10-fold cross-validation. Model performance was measured using the AUC, ACC, and F1-score. Calibration curves and decision curve analysis (DCA) were employed to assess the model's reliability and clinical net benefit. The optimal model was subsequently implemented as a web-based tool.
Results
The Support Vector Machine (SVM) exhibited superior performance compared to other models, achieving an AUC of 0.809 (95% CI: 0.778–0.838), an ACC of 0.714, and an F1-score of 0.689. External validation demonstrated enhanced performance metrics, with an AUC of 0.921 (95% CI: 0.889–0.951), ACC of 0.853, and an F1-score of 0.819. DCA indicated a significant net clinical benefit across various risk thresholds, and calibration curves showed strong concordance between predicted and observed risks.
Conclusion
A
A
This interpretable SVM model effectively stratifies the risk of talaromycosis in HIV patients in endemic regions, aligning with WHO recommendations for targeted prophylaxis. Its integration into a web-based tool enhances clinical accessibility for early intervention in resource-constrained settings. Ongoing prospective trials (ChiCTR1900021195) are anticipated to further substantiate its real-world impact.
Keywords:
HIV infection
Talaromycosis
Machine learning
Predictive model
Data visualization
A
A
A
A
Introduction
Talaromycosis is classified as an invasive fungal infection, predominantly affecting individuals with weakened immune systems, particularly those with HIV. It is widespread in Asia's tropical and subtropical regions, notably in Southeast Asian nations like Vietnam, Thailand, and southern China [1]. In Southeast Asia, it accounts for between 6.4% and 11% of hospital admissions related to HIV in Vietnam and 3.3% in Thailand. In southern China, the prevalence is higher, with Guangxi and Guangdong reporting rates of 16.1% and 17.3%, respectively, highlighting its significant burden in these areas [16]. Despite the implementation of antifungal therapy, delayed diagnosis contributes to a high in-hospital mortality rate among patients with talaromycosis, ranging from 16.7–30% [79]. However, the efficacy of these interventions relies on the timely identification of high-risk individuals.
The typical clinical features of Talaromyces marneffei (TM) infection commonly include fever, weight loss, lower HB levels, generalized fatigue, skin lesions, peripheral or abdominal lymphadenopathy (POAL), and hepatosplenomegaly [10]. Studies have demonstrated that skin lesions of TM infection often present as central umbilicated lesions, which can serve as critical diagnostic indicators for early detection. Approximately one-third of patients with TM infection exhibit skin lesions [11]. Microbiological culture is currently considered the definitive standard for diagnosing talaromycosis. However, this method is time-consuming, and the isolation or identification of pathogens from clinical specimens often takes more than 10 days. Such delays may result in the postponement of antifungal therapy initiation, potentially adversely affecting patient outcomes [1214]. Despite the development of several highly specific serological diagnostic techniques, false-negative outcomes continue to pose a challenge for the prompt identification of TM infection [13].
Early initiation of antifungal therapy in talaromycosis can significantly reduce mortality and mitigate the severity of the disease [7, 1517]. However, the efficacy of targeted interventions depends on the timely identification of high-risk individuals. Although recent studies have used dynamic nomograms to evaluate the potential for TM infection in hospitalized individuals with HIV, relying on skin lesions as a risk factor may reduce diagnostic ACC for those without skin lesions [18]. There is a critical demand for quantitative, accessible, and efficient approaches to evaluate the potential risk of talaromycosis in HIV-infected patients lacking skin lesions.
In recent years, ML techniques utilizing EMR have gained significant traction and acceptance among healthcare professionals. Unlike conventional statistical approaches, ML algorithms impose fewer limitations on data requirements and excel at modeling intricate datasets [19], driving their growing application in the medical field. However, the inherent complexity of ML models, especially their “opaque system” nature, often limits transparency in understanding their decision-making processes [20]. To overcome this limitation, model interpretation tools are essential for elucidating the underlying mechanisms of ML models. This study utilized the Shapley Additive Explanations (SHAP) methodology to develop an interpretable model. This tool enhances the interpretability and clinical utility of ML-based predictions by clearly visualizing each variable's contribution to the model's outcomes. Additionally, this study developed an accessible web-based application for implementing the predictive model, enabling clinicians to utilize the model for program prediction without necessitating the installation of Python or the acquisition of programming skills.
Methods
Design of the Study
This study analyzed data from a multicenter retrospective cohort of HIV-positive patients in China, including those hospitalized at Liuzhou People's Hospital, the Third People's Hospital of Guilin, Guangzhou Eighth People's Hospital, and Chongqing Public Health Medical Center. The model development dataset comprised 751 patients from Liuzhou People's Hospital, 122 from Chongqing Public Health Medical Center, and 136 from Guangzhou Eighth People's Hospital, spanning August 3, 2010, to January 17, 2019. The external validation cohort comprising 305 cases was retrospectively collected from the Third People’s Hospital of Guilin between December 16, 2015, and December 12, 2018. This study is divided into three main phases:(1) selecting participants and screening variables; (2) developing and evaluating models; and (3) creating a web application for the best model.
A
The Ethics Committee of the Chongqing Public Health Medical Center (2019- 003- 02-KY) approved the study. The ethics committees waived the need for informed consent from participants due to the study's retrospective and anonymized nature. Eligible participants included individuals aged 18 years or older who had a confirmed diagnosis of HIV. Additionally, participants were required to have a CD4+ T-cell count of less than 200 cells/µL at the time of admission and to have been hospitalized for more than three days due to suspected opportunistic infections. Exclusion criteria included pregnancy, lactation, and the presence of skin lesions, and the study excluded medical records that were incomplete (> 20% missing data).
Outcomes
The diagnostic model's outcome variable determined the presence of talaromycosis in patients. Talaromycosis was diagnosed when TM was isolated from clinical specimens using standard culture techniques [12] or identified in biopsy tissue histopathology.
Sample size calculation
The sample size was determined using the events per variable (EPV) metric [21, 22], a widely recognized approach in statistical analyses. In southern China, the incidence of HIV co-infection with Talaromyces marneffei is 0.16. Considering our objective to include twelve predictor variables and establish the EPV at 10, the required sample size was calculated using the following formula:
Data collection and model predictors
The patient data were retrospectively extracted from hospital admission records covering the period from August 3, 2010, to January 17, 2019. Missing data in the clinical records were managed through mean substitution.
A
Initially, our model was developed through a systematic review, meta-analysis, and expert consensus, incorporating a comprehensive array of variables. These variables encompassed the presence of oral candidiasis, tuberculosis, bacterial pneumonia, pneumocystis pneumonia, cryptococcosis, cytomegalovirus disease, herpes simplex virus disease, and lymphoma. Additionally, demographic and clinical factors such as sex, age, body mass index (BMI), nationality, occupation, marital status, and injection drug use were included. The model also considered clinical symptoms and laboratory findings, including fever, cough, poor appetite, hepatomegaly, splenomegaly, hemoglobin (Hb) levels, white blood cell count (WBC), platelet count (PLT), absolute lymphocyte count (ALC), C-reactive protein (CRP), AST, ALT, AST/ALT ratio, albumin (ALB), blood urea nitrogen (BUN), creatinine (Cr), (1–3)-β-D glucan (G) levels, CD4+ T-cell count(CD4), and the presence of hepatitis C and hepatitis B.
Furthermore, potential variables were subjected to an additional screening process based on dimensionality reduction criteria to evaluate their appropriateness for inclusion in the model [23]. The Random Forest (RF) algorithm is an ensemble method that combines several decision trees. It generates models by applying random sampling to the training dataset and determining optimal split points. Individual decision trees in RF are constructed based on feature metrics derived from dataset attributes [24, 25], facilitating a robust assessment of each feature's significance [26]. Lasso regression performs variable selection and controls model complexity through regularization, effectively preventing overfitting. The parameter λ controls regularization strength, leading to a simplified model with fewer variables. The optimal value of λ(λmin) is identified using 10-fold cross-validation, where the point of minimum error is used to select the most relevant predictive variables [27, 28]. Unlike traditional feature selection methods, Boruta uses a wrapper-based method to select features. It aims to determine the feature set that demonstrates the strongest association with the dependent variable, prioritizing comprehensive relevance over the creation of a minimal, model-specific subset [29]. Through the iterative elimination of low-correlation features, this method significantly reduces noise and enhances the consistency of classification outcomes [30]. Meanwhile, as a highly potent ensemble learning technique, Extreme Gradient Boosting (XGBoost) is established based on classification trees. By integrating multiple weak learners into a robust ensemble model through sequential boosting steps, it constructs a tree-based classifier, providing a reliable approach for precise classification tasks [31]. Finally, Mutual Information (MI) effectively captures non-linear relationships, complementing RF for robust feature importance, Lasso for sparsity, Boruta for comprehensive relevance, and XGBoost for high-precision ensemble learning in medical data analysis [32].
Specifically, we utilized Lasso regression, RF, Boruta, XGBoost, and MI to analyze the 36 potential risk factors. Variables with non-zero coefficients were ranked according to their contribution to the outcome, and common variables were identified by intersecting the results from all five methods.
Statistical Analysis
Demographic, clinical, and laboratory data from the derivation cohort were analyzed upon hospital admission. Continuous variables are reported as medians with interquartile ranges (IQRs) and categorical variables as frequencies and percentages. A two-sided p-value of less than 0.05 was used to establish statistical significance. The gaze function in the panda's package enhances data analysis efficiency in medical research by automatically identifying variable types, such as continuous or categorical, and applying appropriate statistical methods to generate comprehensive descriptive statistics. Data analyses were performed utilizing SPSS version 29.0 (IBM Analytics, USA) and Anaconda 3, incorporating Python version 3.12.
Model development and comparison
In the study's final cohort of 1,314 patients, 379 had missing data in variables such as BMI (198), CRP (7), ALB (1), BUN (8), Cr (8), and G-test (157). These missing values were random. We used the pandas library to handle this by imputing the mean for each column and applying the fill () method for data preprocessing. The dataset of 1009 patients was divided into 80% training and 20% testing subsets using the Pandas library for internal validation. Preprocessing steps included encoding categorical variables as binary indicator variables, eliminating features with minimal variance, and standardizing continuous variables to minimize overfitting. Subsequently, a prediction model was built based on the 12 identified predictor variables. Eight machine learning models—logistic regression (LR), SVM, RF, multi-layer perceptron (MLP), naive Bayes, decision tree (DT), K-nearest neighbor (KNN), and XGBoost—were utilized to assess talaromycosis risk. The models were assessed and compared through the use of accuracy (ACC), area under the receiver operating characteristic curve (AUC), specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), F1 score, and MCC metrics. These metrics were similarly employed to assess the efficacy of the optimal model's performance in external validation. Calibration curves were utilized to assess the predictive performance of the optimal model, whereas clinical decision curve analysis (DCA) was employed to evaluate its clinical utility.
Interpretation of the model and its application within the network
Interpretability refers to the clarification of how ML models generate outcomes. The inherent opacity of machine learning models often hinders their clinical application, prompting significant research to improve interpretability [33, 34]. This study presents an intuitive model explanation framework employing the SHAP (Shapley Additive exPlanations) BreakDown package, which quantifies the extent of individual predictors' contributions to model predictions. Furthermore, we developed an intuitive model interpreter that works independently of any specific model and uses input predictor variables to interpret results. The Gradio package facilitated the incorporation of predictive variables and optimal modeling into an interactive web application.
Results
Demographic Attributes
A
A
In this retrospective study, an initial screening was conducted on 1,675 hospitalized patients diagnosed with HIV who met the inclusion criteria. Following this, 361 patients (21.5%) were excluded by predefined criteria, as they had missing data exceeding 20% of the total variables. The modeling cohort, comprising 1,009 patients, was randomly divided into training and testing subsets with an 80:20 ratio to enable internal validation. Furthermore, a distinct cohort of 305 patients was used for external validation. For a detailed illustration of the study's structure, please refer to Fig. 1. Furthermore, within the modeling group, patients were categorized into two subgroups based on the presence of talaromycosis, as shown in Table 1. Significant differences (p < 0.05) were observed between the talaromycosis group (n = 512) and the non-talaromycosis group (n = 497) in age, BMI, fever, poor appetite, POAL, hepatomegaly, splenomegaly, AST, ALT, AST/ALT ratio, ALB, Hb, WBC count, ALC, CRP, PLT count, and CD4+ T-cell count levels. Supplementary Figure S1 demonstrates the correlation between variables, while Supplementary Figure S2 depicts the distribution of 36 independent variables.
Table 1
Baseline Characteristics in the Modeling group
Predictive Variables
Levels
Total
(n = 1009)
Without TM infection
(n = 497)
With TM infection
(n = 512)
P
Sex, N (%)
Male
747(74.1%)
372 (74.8%)
375 (73.2%)
0.561
 
Female
262(25.9%)
125 (25.2%)
137 (26.8%)
 
Age (years)
median (IQR)
51.0(40.0–64.0)
55.0(44.0–68.0)
48.0(38.3–59.0)
< 0.001
BMI (kg/m²)
median (IQR)
19.4(18.0–20.0)
19.4(18.0-20.2)
19.3(18.0–20.0)
< 0.001
Nationality(Han), N (%)
No
219(21.7%)
101 (20.3%)
118 (23.0%)
0.294
 
Yes
790(78.3%)
396 (79.7%)
394 (77.0%)
 
Occupation, N (%)
Farmer
400(39.6%)
194(39.0%)
206(40.2%)
0.091
 
Unemployed
156(15.5%)
66(13.3%)
90(17.6%)
 
 
Other
453(44.9%)
237(47.7%)
216(42.2%)
 
Marital status, N (%)
Married
718(71.1%)
354(71.2%)
364(71.1%)
0.233
 
Single
143(14.2%)
63(12.7%)
80(15.6%)
 
 
Other
148(14.7%)
80(16.10)
68(13.3%)
 
Injection drug user, N (%)
No
976(96.8%)
481(96.8%)
495(96.7%)
0.928
 
Yes
33(3.2%)
16(3.2%)
17(3.3%)
 
Oral candidiasis, N (%)
No
402(39.9%)
198(39.8%)
204(39.8%)
0.999
 
Yes
607(60.1%)
299(60.2%)
308(60.2%)
 
Tuberculosis, N (%)
No
852(84.4%)
422(85.0%)
430(84.0%)
0.685
 
Yes
157(15.6%)
75(15.0%)
82(16.0%)
 
Bacterial Pneumonia, N (%)
No
710(70.4%)
352(70.8%)
358(69.9%)
0.753
 
Yes
299(29.6%)
145(29.2%)
154(30.1%)
 
Pneumocystis pneumonia, N (%)
No
768(76.1%)
378(76.0%)
390(76.2%)
0.966
 
Yes
241(23.9%)
119(24.0%)
122(23.8%)
 
Cryptococcosis, N (%)
No
988(97.9%)
486(97.8%)
502(98.0%)
0.772
 
Yes
21(2.1%)
11(2.2%)
10(2.0%)
 
Cytomegalovirus Disease, N (%)
No
923(91.5%)
452(91.0%)
471(92.0%)
0.552
 
Yes
86(8.5%)
45(9.0%)
41(8.0%)
 
Herpes Simplex Virus Disease,N (%)
No
961(95.2%)
475(95.6%)
486(94.9%)
0.627
 
Yes
48(4.8%)
22(4.4%)
26(5.1%)
 
Lymphoma, N (%)
No
996(99.0%)
490(98.6%)
506(98.8%)
0.739
 
Yes
13(1.0%)
7(1.4%)
6(1.2%)
 
Hepatitis B, N (%)
No
911(90.3%)
453(91.1%)
458(89.5%)
0.364
 
Yes
98(9.7%)
44(8.9%)
54(10.5%)
 
Hepatitis C, N (%)
No
979(97.0%)
485(97.6%)
494(96.5%)
0.303
 
Yes
30(3.0%)
12(2.4%)
18(3.5%)
 
Fever, N (%)
No
482(47.7%)
291(58.5%)
191(37.3%)
< 0.001
 
Yes
527(52.2%)
206(41.4%)
321(62.7%)
 
Cough, N (%)
No
478(47.3%)
223(44.8%)
255(49.8%)
0.116
 
Yes
531(52.6%)
274(55.1%)
257(50.2%)
 
Poor appetite, N (%)
No
691(68.4%)
401(80.6%)
290(56.6%)
< 0.001
 
Yes
318(31.5%)
96(19.3%)
222(43.4%%)
 
POAL, N (%)
No
700(69.3%)
433(87.1%)
267(52.2%)
< 0.001
 
Yes
309(30.6%)
64(12.8%)
245(47.8%)
 
Hepatomegaly, N (%)
No
885(87.7%)
478(96.1%)
407(79.5%)
< 0.001
 
Yes
124(12.2%)
19(3.8%)
105(20.5%)
 
Splenomegaly, N (%)
No
886(87.8%)
472(94.9%)
414(80.9%)
< 0.001
 
Yes
123(12.1%)
25(5.0%)
98(19.1%)
 
Hb (g/L)
median (IQR)
100.0(84.0-114.0)
105.0(92.0-119.5)
93.0(78.0-108.0)
< 0.001
WBC (×109/L)
median (IQR)
4.6(3.2–6.7)
5.2(3.9–7.2)
4.1(2.8–6.2)
< 0.001
ALC(×109/L)
median (IQR)
0.6(0.3–0.9)
0.7(0.5–1.1)
0.4(0.3–0.7)
< 0.001
CRP(mg/L)
median (IQR)
47.9(17.5–69.6)
34.7(11.0-61.5)
62.0(29.4–75.0)
< 0.001
PLT (×109/L)
median (IQR)
188.0(110.0-257.5)
216.0(158.5-284.5)
143.0(74.0-220.0)
< 0.001
AST(U/L)
median (IQR)
43.0(26.0–78.0)
31.0(22.0–47.0)
62.5(35.8-110.5)
< 0.001
ALT(U/L)
median (IQR)
26.0(16.0–46.0)
22.0(14.0-35.4)
34.0(20.0–59.0)
< 0.001
AST/ALT
median (IQR)
1.7(1.1–2.5)
1.5(1.1–2.1)
1.9(1.3-3.0)
< 0.001
ALB(g/L)
median (IQR)
28.7(24.2–32.6)
30.5(26.6–34.3)
26.1(22.6–30.1)
< 0.001
BUN (mmol/L)
median (IQR)
4.6(3.4–6.3)
4.7(3.4–6.3)
4.4(3.3–6.3)
0.357
Creatinine (mg/L)
median (IQR)
70.9(58.0-85.3)
69.9(59.3–86.2)
71.0(57.0–85.0)
0.885
G(pg/mL)
median (IQR)
190.0(38.5–255.0)
182.0(35.0-251.5)
202.0(43.0-256.8)
0.651
CD4 + T cell count (cells/µL)
median (IQR)
20.0(9.0-43.5)
28.0(13.0-58.5)
14.0(6.3–29.0)
< 0.001
Abbreviations: BMI, Body mass index; POAL, Peripheral or abdominal lymphadenopathy; HB, Hemoglobin; WBC, White blood cell; ALC, Absolute lymphocyte count;
CRP, C-reactive protein; PLT, Platelet; AST, Aspartate aminotransferase; ALT, Alanine transaminase; ALB, albumin; BUN, Blood urea nitrogen; G, (1–3)-β-D glucan ;
IQR, Interquartile range.
Variable screening
In this study, the RF algorithm was employed to generate an ensemble comprising 100 decision trees, utilizing random splitting at each node and a predetermined random seed (seed = 42) to ensure reproducibility. The out-of-bag (OOB) error rate was calculated to be 22.8%, reflecting a relatively low error rate on the OOB samples and suggesting robust generalization capabilities. Notably, the model exhibited particularly low error rates in the classification of non-Talaromyces marneffei infection cases, thereby further substantiating its efficacy for this specific category. (Supplementary Fig. S3). Compared with the Mean Decrease Accuracy, the Mean Decrease Gini (MDG) is better suited for handling high-dimensional or noisy datasets [35]. Therefore, our study chose MDG. The 36 predictive variables influencing talaromycosis risk in the training set were ranked in descending importance. (Supplementary Fig. S4).
A
A
In the Lasso regression model, 18 predictors retained non-zero coefficients when the minimal lambda (λmin) was specified as 0.016. The variable selection process was systematically evaluated through penalization path analysis (Fig. 2) and coefficient trajectory visualization (Fig. 3). Notably, the model bias exhibited negligible fluctuations across the regularization interval spanning from λ1se to λmin, indicating stable parameter estimation within this critical range of penalty values [36]. Supplementary Figure S5 displays the ranking of nonzero predictive variables. The coefficient analysis indicated that POAL possessed the highest absolute coefficient value (β = 0.263, positive), signifying its primary role as the most significant determinant of the target variable. Subsequently, ALC displayed a considerable negative relationship with the outcome (β = -0.038). Conversely, variables such as AST exhibited coefficients nearing zero, suggesting a minimal impact on the model's predictive capability. The Boruta algorithm, a random forest-based feature selection technique, employs permutation importance analysis to iteratively eliminate insignificant predictors while maintaining both strong and moderate associations with clinical outcomes [37]. The algorithm identified 12 significant variables after 20 iterations. Supplementary Figs. S6 and S7 illustrate the fluctuations in variable importance scores during the Boruta process and the shifts in feature importance rankings across various classifier executions. XGBoost's computational efficiency enables accurate training with less data, while its strong generalization and scalability make it ideal for medical data analysis, enhancing diagnostic and prognostic reliability [38]. Ultimately, A total of 32 predictor variables were identified, with their significance depicted in Supplementary Fig. S8. MI enables distribution-free detection of non-linear relationships, facilitating robust feature selection without requiring data distribution assumptions. It effectively identifies complex feature interactions and handles mixed data types [39], conclusively identifying 31 predictive features with high discriminative power from an initial set of 36 candidate variables. The feature importance is shown in Supplementary Fig. S9.
A
The shared features for model construction were the predictor variables identified through RF, XGBoost, Lasso, MI, and Boruta, which included 12 variables: ALB, ALC, Hb, AST/ALT ratio, CRP, ALT, WBC count, PLT count, POAL, CD4 + T-cell count levels, AST, and age. For further details, please refer to Fig. 4 for the Venn diagram and Supplementary Table S1.
Development and Evaluation of the Model
In this study, we developed and assessed eight machine-learning models to predict TM infection, aiming to identify the most effective model based on comprehensive diagnostic performance across various cohorts. The models were evaluated using key metrics, including AUC, Accuracy, Sensitivity, Specificity, and F1-Score. The SVM model emerged as the superior performer, exhibiting remarkable consistency and high metric values across training, testing, and external validation cohorts. Notably, during external validation, the SVM model achieved an AUC of 0.921 and an Accuracy of 0.853, with well-balanced Sensitivity (0.823) and Specificity (0.873). LR and NB were the next best performers; LR demonstrated an excellent balance between precision and recall, while NB showed strong generalization capabilities. Furthermore, it is essential to evaluate whether the models are overfitting the training data. For example, DT, RF, and XGBoost all achieved an AUC of 1.000 and an Accuracy close to 1.000 in the training set. However, their performance declined in the test and external validation sets, indicating a potential risk of overfitting. The model selection process prioritized consistent performance across multiple datasets over excelling in a single metric. Ultimately, the SVM was selected due to its stability and superior overall performance, rendering it the most dependable option for predicting TM infection in varied clinical environments. A summary of the performance metrics for each model is presented in Supplementary Table S2.
Validation of the final model was conducted both internally and externally.
A
A
The reliability and clinical utility of the SVM model were rigorously validated using calibration curves and DCA across both internal and external cohorts. Internally, the calibration curve (Fig. 5A) exhibited near-perfect alignment with the ideal line (AUC = 0.84), thereby confirming accurate risk estimation. The DCA (Fig. 6A) further demonstrated a superior net benefit compared to default strategies, particularly at thresholds ranging from 10–50%, supporting its application for prioritizing high-risk patients. Externally, the model maintained robust calibration (Fig. 5B) with minimal deviation in high-risk ranges and sustained a positive net benefit (Fig. 6B) across thresholds from 20–100%. These findings underscore the model's generalizability, as it consistently balanced sensitivity and specificity while minimizing unnecessary interventions. The concordance between internal and external validation highlights its readiness for clinical implementation in diverse settings, providing a reliable tool for the early detection of Talaromyces marneffei infection in HIV patients.
Analysis of the model's implications
A
The SHAP framework offers a visualization tool tailored for clinicians, aimed at clarifying the influence of specific clinical features on infection risk predictions. As illustrated in Supplementary Figure S10, the selected set of 12 features demonstrates differential impacts on the prediction of TM infection. Among these, POAL, AST, and ALC are particularly noteworthy due to their significant contributions to the model's predictive accuracy, thereby establishing them as essential diagnostic markers. Figure 7A illustrates a high-risk patient (f(x) = 0.86), whose prediction is primarily influenced by elevated AST levels (101 U/L, with a normal range of 7–40 U/L) and severe thrombocytopenia (platelet count = 33×10⁹/L, with a normal range of 100–300×10⁹/L), contributing SHAP values of + 0.15 and + 0.12, respectively. These biomarkers are consistent with established pathophysiological mechanisms, as hepatic injury and bone marrow suppression are characteristic features of disseminated Talaromyces marneffei infection in immunocompromised individuals. In contrast, Fig. 7B depicts a low-risk scenario (f(x) = 0.45), where protective factors such as normal platelet counts (PLT = 249×10⁹/L, SHAP − 0.04) and elevated albumin levels (ALB = 36.7 g/L, within the normal range of 35–55 g/L, SHAP − 0.05) reduce the likelihood of infection. Both predictions are contextualized against the baseline risk (E[f(X)] = 0.505), which represents the average infection probability across the cohort.
Development of user-friendly applications
Our predictive model can be used locally or accessed online via Gradio, a platform that enables application sharing without requiring Python code. The application computes the probability of talaromycosis in HIV-infected patients without skin lesions using the specific values of the 12 predictor variables, as shown in Supplementary Figures S11A and S11 B. The web application is available to users in China(https://modelscope.cn/studios/LRYHJG/rf/summary)and internationally༈https://huggingface.co/spaces/HuJiaGuang/LRYHJG-TM).
Discussion
This multicenter retrospective cohort study utilized electronic medical record data from endemic and non-endemic regions in China to develop a machine learning model that accurately predicts talaromycosis risk in HIV-infected patients without skin lesions, demonstrating strong discriminative capacity and clinical applicability. The model's performance was evaluated at an independent medical center. This new model offers personalized risk prediction via a comprehensive multi-step process that includes variable selection, model development, validation, analysis, understanding, and online deployment, surpassing previously published models in terms of accuracy and usability. Featuring a user-friendly interface and practical application, it provides significant clinical value to aid clinicians in swift decision-making. This can facilitate targeted interventions, such as whether to initiate antifungal treatment. These measures can substantially decrease patient mortality and economic burden.
It is commonly recognized that ML models provide enhanced predictive capabilities compared to conventional models in the medical field [40]. This benefit facilitates the development of a more robust model from complex datasets [41]. This study underscores the proficiency of machine learning in managing heterogeneous multidimensional data through the integration of twelve predictive variables. For example, the elevation of AST and the reduction of PLT show high heterogeneity. Additionally, we employed visualization techniques to clarify the forecast model and create functional web applications. This step is crucial for addressing concerns related to the opacity of black-box models and for rebuilding trust in machine learning within the medical sector [42]. To enhance reproducibility, the model development process incorporated cross-checking, self-validation, and outside confirmation, all conducted with predetermined random seeds. This is a crucial step to ensure the clinical significance of the model through its interpretability [43].
In recent years, studies on HIV-infected patients hospitalized with TM infection have gradually increased. This infection is more common in Southeast Asia, especially among immunocompromised patients. However, while many studies focus on the prognostic risks of these patients, there are relatively few studies predicting the risk of infection. Xu L et al [18]. Employed multivariable logistic regression to create a forecasting model using clinical indicators for ten key factors from 1,564 cases in four tertiary hospitals in Southwest China to evaluate talaromycosis risk in hospitalized HIV-positive patients. The model attained AUC values of 0.883 in training and 0.889 in validation, highlighting the significant role of skin lesions as a key variable in disease prediction within diagnostic models. However, for patients with HIV-infected TM infection who do not have skin lesions, the use of this model may have certain limitations.
We created and externally validated a talaromycosis prediction model for HIV-infected patients lacking skin lesions, using 12 clinical variables from the electronic medical records of four hospitals. To address collinearity in the model development data, we used RF, Lasso, XGBoost, Boruta, and MI to select the 36 influential variables, employing L1 regularization. The study aimed to create a practical clinical application by selecting a limited number of readily accessible patient markers. Taking practicality into account, if a single marker can fulfill the prediction objective, there is no need to incorporate additional variables. More significantly, these predictive variables serve as baseline markers. Additionally, drawing on the findings from previous literature, these 36 variables are frequently used indicators in similar research studies. The study retrospectively analyzed 9 years of patient data. Moreover, these 36 variables represent the most extensive and readily accessible data within our electronic medical records system. We employed eight machine learning algorithms to individually fit each of the 12 predictive variables, aiming to achieve optimal fitting results. The SVM algorithm generated a model with a high ACC (0.714) and an AUC of 0.809. This model performed exceptionally well on both internal and external validation datasets.
Xu L et al [18]. Reported that the AST, PLT, age, POAL, and CD4+ T-cell count were independent factors influencing the diagnosis of TM infection and subsequently formulated a clinically applicable predictive model based on these findings. Our study has arrived at similar conclusions, with these five variables standing out as key contributors to the diagnosis of TM infection. This finding is further corroborated by the network application we developed using Gradio.
Studies indicate that TM infections are more prevalent and deadly among individuals infected with HIV in Southeast Asia [1]. Upon diagnosis, immediate and active treatment measures must be taken. High model ACC is essential to reduce false negatives (missed diagnoses) and false positives (misdiagnoses). The SVM model demonstrated robust diagnostic performance during the validation process. During the training phase, it achieved an ACC of 0.714, an AUC of 0.809, a specificity of 0.795, and a sensitivity of 0.631. Notably, the model's performance metrics, including ACC, AUC, specificity, sensitivity, and F1-score, showed improvement in both the testing and external validation datasets compared to the training phase. For instance, in the external validation cohort, the SVM model attained an AUC of 0.921, an ACC of 0.853, and an F1-score of 0.819. These enhancements underscore the model's strong generalization capability and reliability. The model's high accuracy, specificity, and sensitivity in detecting TM infection render it highly promising for clinical application in the early diagnosis and risk stratification of TM infection in HIV-positive patients without skin lesions. Its consistent performance across multiple validation stages further emphasizes its potential utility in real-world clinical settings.
Our study has several limitations that we must acknowledge. Firstly, the study was carried out in China, with the selection of participants mainly based on local populations. Consequently, applying these results to global populations may introduce potential biases, a frequent issue with predictive models [4446]. The method can be readily adapted for application in other countries, and we foresee future opportunities for its implementation [47]. Consistent with other studies, HIV-infected individuals with talaromycosis often exhibit severe immunosuppression, as indicated by lower CD4+ T-cell counts [6, 9, 14]. This study targeted individuals with CD4+ T-cell counts under 200 cells/µL, who are more susceptible to opportunistic infections and health issues. Utilizing a multi-stage validation approach, which encompassed 10-fold cross-validation and independent external validation, we achieved a substantial enhancement in the predictive accuracy of the model for assessing the risk of Talaromyces marneffei infection in HIV-infected individuals lacking skin lesions. The calibration curve exhibited a slope approaching 1.0, signifying an almost linear correlation between predicted and observed probabilities, which is considered optimal. The model's capability for risk stratification is consistent with the clinical objectives outlined in the World Health Organization's 2023 guidelines on priority fungal pathogens [48]. This alignment provides evidence-based support for decision-making regarding the early administration of antifungal treatment to high-risk patients in endemic regions. Thirdly, the development model's 12 predictive variables are available upon admission. In brief, clinicians need to be alert to possible co-infections with talaromycosis in patients with lower CD4+ T-cell counts, especially in areas where such infections are common, even in the absence of skin lesions. However, in real-world clinical settings, the diagnosis of TM infection is influenced by a variety of factors, such as disease severity, physician experience, and the choice of diagnostic tools. Fortunately, in recent years, with the growing attention to rare infectious diseases, awareness of TM infection diagnosis has also gradually increased. By optimizing these factors, the accuracy and timeliness of the diagnosis can be improved, thus enhancing patient prognosis. In addition, when applied to populations in non-endemic areas, our model may have a higher misdiagnosis rate. In recent years, we have expanded data collection from additional hospitals and aim to enhance the model using the latest data from non-endemic regions and various centers. Simultaneously, we are undertaking prospective multicenter studies (ChiCTR1900021195).
Conclusions
To conclude, we have created a clinically beneficial and user-friendly network application. While prospective validation remains essential, predicting the risk of major complications in HIV-associated talaromycosis patients without skin lesions is now possible. This advancement facilitates proactive and targeted clinical decision-making, enabling proactive and targeted clinical decision-making.
Statement on Data Availability
A
The data from the study can be obtained from the corresponding author upon a reasonable request.
A
Author Contribution
Jianning Jiang and Zhongsheng Jiang were responsible for the study's conception and design. Jiaguang Hu, Wenming He, Qun Tian, Yanqiu Lu, Peng Zhang, Jinyu Qin, Chuan Qin, Ying Wu, Xu Li, and Linghua Li conducted the material preparation and data collection. Jiaguang Hu, Cheng Huang, Luhuai Feng, and Xu Li conducted the primary analysis, with Jiaguang Hu also drafting the initial manuscript. All authors have approved the final manuscript.
Financial support
This study was supported by funding from multiple sources, including the National Science and Technology Major Project of China during the 13th Five-Year Plan Period (2018ZX10302104), Liuzhou Key Laboratory of Severe Abdominal Infection Research/Guangxi Key Laboratory of Clinical Disease Biotechnology Research (LRYFQ202501), National Natural Science Foundation(82260124), and(81960115), the Science and Technology Project of Liuzhou ( 2022SB009), The Scientific Research Project of Liuzhou People's Hospital affiliated to Guangxi Medical University (Lry202327), and The Chinese Preventive Medicine Association's Hospital Infection Control Branch's Young Talent Support Program (CPMA-HAIC-2024012900113).
Conflict of interest
All the authors declare that they have no conflicts of interest.
References
1.
QIN Y, HUANG X, CHEN H, et al. Burden of Talaromyces marneffei infection in people living with HIV/AIDS in Asia during ART era: a systematic review and meta-analysis [J]. BMC Infect Dis, 2020, 20(1): 551. https://doi.org/10.1186/s12879-02 0-05260-8.
2.
CAO C, XI L, CHATURVEDI V. Talaromycosis (Penicilliosis) Due to Talaromyces (Penicillium) marneffei: Insights into the Clinical Trends of a Major Fungal Disease 60 Years After the Discovery of the Pathogen [J]. Mycopathologia, 2019, 184(6): 709–20. https://doi.org/10.1007/s11046-019-00410-2.
3.
LARSSON M, NGUYEN L H T, WERTHEIM H F L, et al. Clinical characteristics and outcome of Penicillium marneffei infection among HIV-infected patients in northern Vietnam [J]. AIDS Research and Therapy, 2012, 9(1): 24. http://www.aidsrestherapy.com/content/9/1/24.
4.
CHAYAKULKEEREE M, DENNING D W. Serious fungal infections in Thailand [J]. European Journal of Clinical Microbiology & Infectious Diseases, 2017, 36(6): 931–5. https://doi.org/10.1007/s10096-017-2927-6.
5.
JIANG J, MENG S, HUANG S, et al. Effects of Talaromyces marneffei infection on mortality of HIV/AIDS patients in southern China: a retrospective cohort study [J]. Clinical Microbiology and Infection, 2019, 25(2): 233–41. https://doi.org/10.1016/j.cmi. 2018.04.018.
6.
YING R S, LE T, CAI WP, et al. Clinical epidemiology and outcome of HIV-associated talaromycosis in Guangdong, China, during 2011–2017 [J]. HIV Medicine, 2020, 21(11): 729–38. https://doi.org/10.1111/hiv.13024.
7.
HU Y, ZHANG J, LI X, et al. Penicillium marneffei Infection: An Emerging Disease in Mainland China [J]. Mycopathologia, 2012, 175(1–2): 57–67. https://doi.org/10.1007/s11 046-012-9577-0.
8.
CHEN J, ZHANG R, SHEN Y, et al. Clinical Characteristics and Prognosis of Penicilliosis Among Human Immunodeficiency Virus-Infected Patients in Eastern China [J]. The American Society of Tropical Medicine and Hygiene, 2017, 96(6): 1350–4. https://doi.org/10.4269/ajtmh.16-0521.
9.
SON V T, KHUE P M, STROBEL M. Penicilliosis and AIDS in Haiphong, Vietnam: Evolution and predictive factors of death [J]. Médecine et Maladies Infectieuses, 2014, 44(11–12): 495–501. http://dx.doi.org/10.1016/j.medmal.2014.09.008.
10.
RANJANA K H, PRIYOKUMAR K, SINGH T J, et al. Disseminated Penicillium marneffei Infection among HIV-Infected Patients in Manipur State, India [J]. Journal of Infection, 2002, 45(4): 268–71. http://dx.doi.org/10.1053/jinf.2002.1062.
11.
SHI J, YANG N, QIAN G. Case Report: Metagenomic Next-Generation Sequencing in Diagnosis of Talaromycosis of an Immunocompetent Patient [J]. Front Med (Lausanne), 2021, 8: 656194. https://doi.org/10.3389/fmed.2021.656194.
12.
VANITTANAKOM N, COOPER C R, FISHER M C, SIRISANTHANA T. Penicillium marneffei Infection and Recent Advances in the Epidemiology and Molecular Biology Aspects [J]. Clinical Microbiology Reviews, 2006, 19(1): 95–110. https://doi.org/10.1128/CMR.19.1.95-110.2006.
13.
RUAN Z, NING C, LAI J, et al. Accuracy of rapid diagnosis of Talaromyces marneffei: A systematic review and meta-analysis [J]. Plos One, 2018, 13(4). https://doi.org/10.1371/journal.pone.0195569.
14.
KAWILA R, CHAIWARITH R, SUPPARATPINYO K. Clinical and laboratory characteristics of penicilliosis marneffei among patients with and without HIV infection in Northern Thailand: a retrospective study [J]. AIDS Research and Therapy, 2013, 10(1): 12. https://doi.org/10.1186/1471-2334-13-464.
15.
WONG S Y N, WONG K F. Penicillium marneffei Infection in AIDS [J]. Pathology Research International, 2011, 2011: 1–10. https://doi.org/10.4061/2011/764293.
16.
SHAHARUDDIN N H, SAHLAWATI M, KUMAR C S, CHRISTOPHER LEE K C. A Retrospective Review on Successful Management of Penicillium Marneffei Infections in Patients with Advanced HIV [J]. Med J Malaysia, 2012, 67(1): 66–70.
17.
LE T, KINH N V, CUC N T K, et al. A Trial of Itraconazole or Amphotericin B for HIV-Associated Talaromycosis [J]. New England Journal of Medicine, 2017, 376(24): 2329–40. https://doi.org/10.1056/NEJMoa1613306.
18.
LI X, JIANG Z, MO S, et al. A web-based dynamic nomogram for estimating talaromycosis risk in hospitalized HIV-positive patients [J]. Epidemiology and Infection, 2024, 152. https://doi.org/10.1017/S0950268824001456.
A
19.
REN W, LI D, WANG J, et al. Prediction and Evaluation of Machine Learning Algorithm for Prediction of Blood Transfusion during Cesarean Section and Analysis of Risk Factors of Hypothermia during Anesthesia Recovery [J]. Computational and Mathematical Methods in Medicine, 2022, 2022: 1–9. https://doi.org/10.1155/2022/866 1324.
20.
AZODI C B, TANG J, SHIU S-H. Opening the Black Box: Interpretable Machine Learning for Geneticists [J]. Trends in Genetics, 2020, 36(6): 442–55. https://doi.org/10.1016/j.tig.2020.03.005.
21.
VITTINGHOFF E, MCCULLOCH C E. Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression [J]. American Journal of Epidemiology, 2007, 165(6): 710–8. https://doi.org/10.1093/aje/kwk052.
22.
VAN SMEDEN M, MOONS K G M, DE GROOT J A H, et al. Sample size for binary logistic prediction models: Beyond events per variable criteria [J]. Statistical Methods in Medical Research, 2018, 28(8): 2455–74. https://doi.org/10.1177/0962280218784726.
23.
VENKATESH K K, JELOVSEK J E, HOFFMAN M, et al. Postpartum readmission for hypertension and pre-eclampsia: development and validation of a predictive model [J]. BJOG: An International Journal of Obstetrics & Gynaecology, 2023, 130(12): 1531–40. https://doi.org/10.1111/1471-0528.17572.
24.
FARHADIAN M, TORKAMAN S, MOJARAD F. Random forest algorithm to identify factors associated with sports-related dental injuries in 6 to 13-year-old athlete children in Hamadan, Iran-2018 a cross-sectional study [J]. BMC Sports Science, Medicine and Rehabilitation, 2020, 12(1). https://doi.org/10.1186/s13102-020-00217-5.
25.
SHI G, LIU G, GAO Q, et al. A random forest algorithm-based prediction model for moderate to severe acute postoperative pain after orthopedic surgery under general anesthesia [J]. BMC Anesthesiology, 2023, 23(1). https://doi.org/10.1186/s12871-023-02 328-1.
26.
NACHOUKI M, MOHAMED E A, MEHDI R, ABOU NAAJ M. Student course grade prediction using the random forest algorithm: Analysis of predictors' importance [J]. Trends in Neuroscience and Education, 2023, 33. https://doi.org/10.1016/j.tine.2023.1 00214.
27.
WANG J, XU Y, LIU L, et al. Comparison of LASSO and random forest models for predicting the risk of premature coronary artery disease [J]. BMC Medical Informatics and Decision Making, 2023, 23(1). https://doi.org/10.1186/s12911-023-02407-w.
28.
KANG J, CHOI Y J, KIM I-K, et al. LASSO-Based Machine Learning Algorithm for Prediction of Lymph Node Metastasis in T1 Colorectal Cancer [J]. Cancer Research and Treatment, 2021, 53(3): 773–83. https://doi.org/10.4143/crt.2020.974.
29.
ZHOU H, XIN Y, LI S. A diabetes prediction model based on Boruta feature selection and ensemble learning [J]. BMC Bioinformatics, 2023, 24(1). https://doi.org/10.1186/s 12859-023-05300-5.
30.
SUN Y, ZHANG Q, YANG Q, et al. Screening of Gene Expression Markers for Corona Virus Disease 2019 Through Boruta_MCFS Feature Selection [J]. Frontiers in Public Health, 2022, 10. https://doi.org/10.3389/fpubh.2022.901602.
31.
MOORE A, BELL M. XGBoost, A Novel Explainable AI Technique, in the Prediction of Myocardial Infarction: A UK Biobank Cohort Study [J]. Clinical Medicine Insights: Cardiology, 2022, 16. https://doi.org/10.1177/11795468221133611.
32.
XU J, TANG B, HE H, MAN H. Semisupervised Feature Selection Based on Relevance and Redundancy Criteria [J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(9): 1974–84. https://doi.org/10.1109/TNNLS.2016.2562670.
33.
RUDIN C. Stop explaining black-box machine learning models for high-stakes decisions and use interpretable models instead [J]. Nature Machine Intelligence, 2019, 1(5): 206–15. https://doi.org/10.1038/s42256-019-0048-x.
34.
HUNTER D J, DRAZEN J M, KOHANE I S, et al. Where Medical Statistics Meets Artificial Intelligence [J]. New England Journal of Medicine, 2023, 389(13): 1211–9. https://doi.org/10.1056/NEJMra2212850.
35.
GREENER J G, KANDATHIL S M, MOFFAT L, JONES D T. A guide to machine learning for biologists [J]. Nature Reviews Molecular Cell Biology, 2021, 23(1): 40–55. https://doi.org/10.1038/s41580-021-00407-0.
36.
YAQI W, YIWEI S, YIMIN L. Prognosis prediction of paraquat poisoning with lasso-logistic regression [J]. Occup Health & Emerg Rescue, 2022, 40(3): 259 – 64. https://doi.org/10.16369/j.oher.issn.1007-1326.2022.03.001.
37.
HAMIDI F, GILANI N, ARABI BELAGHI R, et al. Identifying potential circulating miRNA biomarkers for the diagnosis and prediction of ovarian cancer using the machine-learning approach: application of Boruta [J]. Frontiers in Digital Health, 2023, 5. https://doi.org/10.3389/fdgth.2023.1187578.
38.
GUAN X, DU Y, MA R, et al. Construction of the XGBoost model for early lung cancer prediction based on metabolic indices [J]. BMC Medical Informatics and Decision Making, 2023, 23(1). https://doi.org/10.1186/s12911-023-02171-x.
39.
WANG X, ZHOU Y, IQBAL N. Multi-Label Feature Selection with Conditional Mutual Information [J]. Computational Intelligence and Neuroscience, 2022, 2022: 1–13. https://doi.org/10.1155/2022/9243893.
40.
OTA R, YAMASHITA F. Application of machine learning techniques to the analysis and prediction of drug pharmacokinetics [J]. Journal of Controlled Release, 2022, 352: 961–9. https://doi.org/10.1016/j.jconrel.2022.11.014.
41.
ZHOU S-N, JV D-W, MENG X-F, et al. Feasibility of machine learning-based modeling and prediction using multiple centers data to assess intrahepatic cholangiocarcinoma outcomes [J]. Annals of Medicine, 2022, 55(1): 215 – 23. https://doi.org/10.1080/07853 890. 2022.2160008.
42.
PARIKH R B, OBERMEYER Z, NAVATHE A S. Regulation of predictive analytics in medicine [J]. Science, 2019, 363(6429): 810–2. https://doi.org/10.1126/science.aaw0029
43.
ADALI T L, CALHOUN V D. Reproducibility and replicability in neuroimaging data analysis [J]. Current Opinion in Neurology, 2022, 35(4): 475–81. https://doi.org/10.109 7/WCO.0000000000001081
44.
SHI Y, ZHANG G, MA C, et al. Machine learning algorithms to predict intraoperative hemorrhage in surgical patients: a modeling study of real-world data in Shanghai, China [J]. BMC Medical Informatics and Decision Making, 2023, 23(1). https://doi.org/10.1 186/s1291 1-023-02253-w.
45.
HU J, XU J, LI M, et al. Identification and validation of an explainable prediction model of acute kidney injury with prognostic implications in critically ill children: a prospective multicenter cohort study [J]. EClinicalMedicine. 2024, 68: 102409. https://doi.org/10.1016/j.eclinm.2023. 102409.
46.
YU Y, LIN J, MUTO C, et al. Assessment of the Utility of Physiologically-based Pharmacokinetic Model for Prediction of Pharmacokinetics in Chinese and Japanese Populations [J]. International Journal of Medical Sciences, 2021, 18(16): 3718–27. https://doi.org/10.7150/ijms.65040.
47.
KITSON S J, CROSBIE E J, EVANS D G, et al. Predicting risk of endometrial cancer in asymptomatic women (PRECISION): Model development and external validation [J]. BJOG: An International Journal of Obstetrics & Gynaecology, 2023, 131(7): 996–1005. https://doi.org/10.1111/1471-0528.17729.
48.
WHO. WHO fungal priority pathogens list to guide research, development, and public health action. [J]. 2022. (accessed July 31, 2023).
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Total words in MS: 5290
Total words in Title: 17
Total words in Abstract: 316
Total Keyword count: 5
Total Images in MS: 7
Total Tables in MS: 1
Total Reference count: 48