Machine learning prediction and interpretive analysis of multidrug-resistant microbial infection risk in septicemia patients: A study from the MIMIC-IV database

Present Address:

Qianqian Zhang 1

Nianzhi Zhang 2✉ Emailzhangnianzhi5566@126.com

Ying Zheng 1

Jing Zhou 1,2

Ling Liu 1,2

1 The First Clinical Medical College of Anhui University of Chinese Medicine 230000 Hefei China

2 Department of Respiratory Medicine First Affiliated Hospital of Anhui University of Chinese Medicine 230000 Hefei China

Qianqian Zhang ¹ Nianzhi Zhang^2* Ying Zheng¹ Jing Zhou^1,2 Ling Liu ^1,2

1.The First Clinical Medical College of Anhui University of Chinese Medicine, Hefei 230000, China; 2.Department of Respiratory Medicine, First Affiliated Hospital of Anhui University of Chinese Medicine, Hefei 230000, China

Corresponding author: Nianzhi Zhang, Email:zhangnianzhi5566@126.com

Abstract

Objective: To construct and compare six machine learning models for identifying high-risk factors of multidrug-resistant organism (MDRO) infection in sepsis patients using the MIMIC-IV (v3.1) database.

Methods: We conducted a retrospective cohort study of ICU patients meeting Sepsis 3.0 diagnostic criteria from the MIMIC-IV database. Data underwent preprocessing including missing value handling, constant variable removal, and standardization. Key predictors were selected using LASSO regression and the Boruta algorithm. Six machine learning models (LGBM, RF, CatBoost, GBDT, MLP, KNNC) were developed, with SHAP applied for interpretability. Performance was evaluated via AUC, sensitivity, specificity, F1-score, and accuracy. Decision curve analysis (DCA) and calibration curves assessed clinical utility.

Results: Among 23,191 patients, 2,806 (12.1%) had MDRO infections. Two-stage feature selection (LASSO + Boruta) identified nine core predictors: age, platelet count, red cell distribution width (RDW), blood glucose, lactic acid, partial pressure of oxygen (PO2), Acute Physiology Score III (APS III), hypertension (HTN), and acute kidney injury (AKI). The LGBM model achieved optimal performance (test AUC = 0.964, accuracy = 0.904, F1-score = 0.925). DCA demonstrated significant net clinical benefit for the LGBM and CatBoost models across thresholds of 0.2–0.6. SHAP analysis revealed HTN and AKI as top risk drivers for MDRO infection, while higher PO2 was the primary protective factor.

Conclusion: Machine learning models, particularly LGBM, effectively identify ICU sepsis patients at high risk of MDRO infection. Key clinical features (e.g., HTN, AKI, PO2, RDW, lactic acid, APS III) coupled with SHAP interpretability provide a robust decision-support tool for early risk stratification and antimicrobial stewardship optimization.

Introduction

Sepsis is a life-threatening organ dysfunction caused by a dysregulated host response to infection. As a common critical condition in emergency departments and ICUs, its mortality rate exceeds that of myocardial infarction and stroke [1]. Among the approximately 49 million sepsis cases worldwide annually, 11 million patients die from sepsis-related complications, accounting for 20% of global Homo sapiens deaths [2]. The latest epidemiological study in China reveals that the incidence of sepsis in ICU patients reaches 20.6%, with a mortality rate as high as 35.5%. Among Gram-negative bacterial infection cases, 42% involve multidrug-resistant organisms (MDROs), which are significantly associated with mortality [3]. More alarmingly, the risk of death in MDRO-infected patients is 64% higher than in those with non-resistant infections [4], making Broussonetia papyrifera a core challenge in critical care.

MDROs refer to pathogens resistant to three or more classes of commonly used antimicrobial agents, encompassing extensively drug-resistant (XDR) and pan-drug-resistant (PDR) strains. Clinically common MDROs include methicillin-resistant Staphylococcus aureus (MRSA), vancomycin-resistant enterococci (VRE), extended-spectrum β-lactamase (ESBL)-producing Enterobacteriaceae, and multidrug-resistant Pseudomonas aeruginosa (MDR-PA) [5]. Data from the 2020 China Bacterial Resistance Surveillance Report show that the top five pathogens with the highest clinical dissociation rates are Escherichia coli (18.96%), Klebsiella pneumoniae (14.12%), Staphylococcus aureus (8.93%), Pseudomonas aeruginosa (7.96%), and Acinetobacter baumannii (7.28%) [4]. Notably, detection rates of MRSA (28.5%→30.2%) and carbapenem-resistant Klebsiella pneumoniae (10.4%→13.3%) continued to rise between 2018 and 2021 [6], reflecting an increasingly severe resistance crisis [7, 8].

Given the complexity and rapid progression of sepsis, early and accurate prediction of MDRO infection risk poses significant challenges [9]. Traditional scoring systems (e.g., SOFA, APACHE II) are limited in capturing complex nonlinear relationships among variables. Therefore, there is an urgent need to develop dynamic risk prediction tools by integrating machine learning (ML) techniques. The widespread adoption of electronic health records (EHRs) in Broussonetia papyrifera healthcare institutions provides rich clinical data resources for such risk predictions. ML excels at processing complex high-dimensional data and identifying nonlinear patterns, demonstrating great potential in disease prediction [10]. Due to its efficiency, accuracy, and ability to handle high-dimensional data, ML applications in healthcare are becoming increasingly prevalent [11–13]. Preliminary studies have confirmed the feasibility of ML in predicting MDRO infections[14][15] ; however, the "black-box" nature of ML models (lack of interpretability) limitstheir clinical adoption [16]. The SHapley Additive exPlanations (SHAP) method quantifies feature contributions to provide intuitive explanations for model predictions [17], thereby addressing the black-box problem. Thus, this study aims to: 1. Utilize the large-scale critical care database MIMIC-IV; 2. Systematically identify key risk factors for MDRO infections in ICU sepsis patients; 3. Broussonetia papyrifera develops and compares multiple ML prediction models;

4. Apply SHAP to elucidate model prediction mechanisms, enhance transparency and clinical acceptance, and optimize prevention, control, and management decisions for MDRO infections in ICU sepsis patients.

2. Methods

2.1. Data Sources and Study Cohort

This retrospective analysis was conducted using the Medical Information Mart for Intensive Care IV database (MIMIC-IV v3.1), jointly developed by the Massachusetts Institute of Technology (MIT) Computational Physiology and Artificial Intelligence Laboratory and Beth Israel Deaconess Medical Center (BIDMC). The database incorporates significant improvements, including data updates and structural optimizations. The dataset encompasses over 360,000 patient care trajectories from the BIDMC intensive care unit in Boston, USA, between 2008 and 2022, involving more than 540,000 hospitalization records and over 90,000 ICU stays. It includes multidimensional clinical features, such as demographic information, laboratory test results, medication records, continuous vital sign monitoring data, surgical procedure codes, ICD-standardized diagnostic information, therapeutic regimens, and post-discharge survival follow-up. The BIDMC Institutional Review Board approved the study as meeting the criteria for data use exemption. The research team obtained access (ID: 14280276) after completing the National Institutes of Health (NIH) Human Subjects Protection Course and the Collaborative Institutional Training Initiative (CITI) program. The database employs dual de-identification techniques, with all protected health information (PHI) removed, complying with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor standards, thus waiving the need for informed consent.

2.2. Inclusion and Exclusion Criteria

Inclusion criteria:

Patients meeting both of the following criteria based on the Sepsis-3.0 definition were included:

Suspected or confirmed infection

SOFA score ≥ 2 points within 24 hours of ICU admission

Exclusion criteria:

Subjects were excluded if any of the following applied:

Multiple ICU admissions during same hospitalization (only first admission retained)

ICU length of stay < 24 hours

ii.

Age < 18 or > 90 years

iii.

SOFA score not documented within 24 hours of ICU admission

iv.

No microbiological culture performed within 48 hours of admission

V. A detailed patient selection flowchart is presented in Fig. 1.

Fig. 1

Participant Selection Flowchart.

2.3 Data Extraction

Data extraction was performed using Navicat Premium (Version 16.1.15) and Structured Query Language (SQL). This study explored several dimensions of sepsis patients in the MIMIC-IV database: (1) Demographic characteristics: age, sex, weight, marital status, ethnicity, language. (2) Comorbidities: HTN, AKI, AKD, T2DM. (3) Initial vital signs upon ICU admission: heart rate (HR), blood pressure parameters (systolic pressure SBP/diastolic pressure DBP/mean arterial pressure MBP), respiratory rate (RR), body temperature (T), and blood oxygen saturation (SpO2). (4) Laboratory indicators: including blood gas analysis (tCO2, iCa, Lac, PaCO2, pH, PaO2), RDW, serum albumin (ALB), complete blood cell count (red blood cells, white blood cells, platelets), blood glucose and electrolytes (Na+, K+, anion gap), and microbiological culture results (positive/negative, specific pathogens, and drug resistance). (5) Disease severity scores: Sequential Organ Failure Assessment (SOFA), APS III. (6) Interventions received by VAP patients: duration of mechanical ventilation. (7) Outcome measures: in-hospital mortality, ICU mortality, 28-day mortality, with the primary endpoint focusing on the incidence of MDRO infection in sepsis patients during ICU hospitalization.

2.4 Data Processing

Variables with missing values exceeding 25% were removed. Continuous variables were processed using Winsorization (1% and 99% percentiles) to handle outliers, while missing categorical variables were imputed using the mode. Categorical variables with category percentages less than 5% or containing ambiguous classifications were excluded. The retained variables were subsequently used for further analysis. To avoid data contamination, the dataset was first randomly divided into training and validation sets in a 7:3 ratio. The training set was used for feature selection and model training, while the test set was solely for final performance evaluation. Subsequently, the interpolate function in Python was used to impute data for the training and validation sets separately using the spline method.

2.5 Statistical Analysis and Model Development

Baseline characteristics were described using statistical tests appropriate to data distribution. Continuous variables underwent normality testing with the Kolmogorov-Smirnov test, with intergroup comparisons performed using t-tests for normally distributed data. Categorical variables were presented as percentages (%) and compared using Pearson's chi-square test.

To address class imbalance (12.1% MDRO-positive rate), we implemented the Synthetic Minority Oversampling Technique (SMOTE). Oversampling was applied exclusively during five-fold cross-validation partitioning, which divided the sample data into training and internal validation sets.

Six machine learning algorithms were employed for model construction: Light Gradient Boosting Machine (LGBM), Random Forest (RF), Categorical Boosting (CatBoost), Gradient Boosting Decision Tree (GBDT), K-Nearest Neighbor Classification (KNNC), and Multilayer Perceptron (MLP). Variables selected by LASSO regression comprised the candidate feature set for subsequent Boruta algorithm screening and model input. Hyperparameter tuning optimized models by maximizing the area under the receiver operating characteristic (ROC) curve (AUC).

Model performance was comprehensively evaluated using AUC, sensitivity, specificity, accuracy, F1-score, and recall. Clinical utility was further assessed through decision curve analysis (DCA) and calibration curve plotting.

Model interpretation incorporated three SHapley Additive exPlanations (SHAP) visualization techniques: summary plots for global feature importance, dependence plots to illustrate nonlinear relationships between key continuous variables and predicted risk, and swarm plots (beeswarm plots) for individual sample-level interpretation.

All analyses were performed using DecisionLinnc 1.0 software (Decision Medicine Inc.), which provides a visual statistical workflow interface [18]. Statistical significance was defined as p < 0.05.

3.1 Baseline Characteristics

A total of 23,191 participants were enrolled from the MIMIC-IV dataset: 20,385 (87.9%) in the non-MDRO group and 2,806 (12.1%) in the MDRO group. Table 1 provides a detailed comparative analysis of baseline characteristics.

Patients with MDRO infection were slightly younger (64.57 ± 15.16 years vs. 65.45 ± 15.42 years, P < 0.001), while weight showed no significant difference (85.44 ± 26.76 kg vs. 83.81 ± 23.56 kg, P = 0.111). Laboratory analysis revealed that the MDRO group had significantly lower hemoglobin (10.20 ± 2.27 g/dL vs. 10.51 ± 2.27 g/dL, P < 0.001) and sodium levels (137.98 ± 6.12 mmol/L vs. 138.35 ± 5.41 mmol/L, P < 0.001). Conversely, this group exhibited higher platelet counts (209.29 ± 126.39 ×10³/µL vs. 196.06 ± 108.63 ×10³/µL, P < 0.001), RDW (15.86 ± 2.63% vs. 15.05 ± 2.38%, P < 0.001), glucose (153.72 ± 82.36 mg/dL vs. 149.93 ± 81.47 mg/dL, P = 0.002), lactate (2.50 ± 2.10 mmol/L vs. 2.44 ± 1.93 mmol/L, P = 0.044), anion gap (15.01 ± 4.69 mmol/L vs. 14.52 ± 4.75 mmol/L, P < 0.001), creatinine (1.71 ± 1.61 mg/dL vs. 1.50 ± 1.60 mg/dL, P < 0.001), and BUN (32.72 ± 26.37 mg/dL vs. 27.69 ± 23.15 mg/dL, P < 0.001). Partial pressure of oxygen (PaO₂) was significantly lower in the MDRO group (130.07 ± 99.00 mm Hg vs. 167.26 ± 123.93 mm Hg, P < 0.001).Disease severity scores were significantly higher in the MDRO infection group: SOFA (6.76 ± 3.93 vs. 5.99 ± 3.46, P < 0.001) and APS III (57.25 ± 23.13 vs. 49.58 ± 21.99, P < 0.001). Comorbidity analysis showed significantly higher prevalence of AKI (55.38% vs. 39.93%, P < 0.001), CKD (21.28% vs. 18.01%, P < 0.001), and T2DM (33.54% vs. 28.60%, P < 0.001) among MDRO patients, while hypertension prevalence was lower (37.28% vs. 42.05%, P < 0.001). The MDRO group also exhibited higher utilization of mechanical ventilation (63.26% vs. 51.90%, P < 0.001).Clinical outcomes demonstrated significantly higher 28-day mortality in the MDRO infection group (22.77% vs. 18.14%, P < 0.001).

Characteristics of MDRO-Sepsis and non-MDRO-Sepsis patients in the MIMIC-IV database. Continuous variables are expressed as mean ± SD, and categorical variables are expressed as n(%).

Table 1
Variable	Overall	Non-MDRO-Sepsis	MDRO-Sepsis	p-value
	N = 23,191	N = 20,385	N = 2,806
Age(years)	65.34 ± 15.39	65.45 ± 15.42	64.57 ± 15.16	< 0.001
Weight(kg)	84.01 ± 23.97	83.81 ± 23.56	85.44 ± 26.76	0.111
Hemoglobin(g/dL)	10.48 ± 2.27	10.51 ± 2.27	10.20 ± 2.27	< 0.001
Platelet(K/uL)	197.66 ± 111.01	196.06 ± 108.63	209.29 ± 126.39	< 0.001
RDW(%)	15.15 ± 2.43	15.05 ± 2.38	15.86 ± 2.63	< 0.001
RBC(m/uL)	3.50 ± 0.79	3.51 ± 0.78	3.43 ± 0.80	< 0.001
WBC(K/uL)	13.47 ± 10.64	13.35 ± 9.86	14.33 ± 15.11	0.102
Anion gap(m/EqL)	14.58 ± 4.75	14.52 ± 4.75	15.01 ± 4.69	< 0.001
Glucose(mg/dL)	150.38 ± 81.59	149.93 ± 81.47	153.72 ± 82.36	0.002
Potassium(m/EqL)	4.24 ± 0.77	4.24 ± 0.76	4.25 ± 0.82	0.733
Sodium(m/EqL)	138.30 ± 5.50	138.35 ± 5.41	137.98 ± 6.12	< 0.001
Lactate(mmol/L)	2.45 ± 1.95	2.44 ± 1.93	2.50 ± 2.10	0.044
Pco2(mmHg)	42.46 ± 11.36	42.39 ± 11.14	42.98 ± 12.86	0.735
PO2(mmHg)	162.76 ± 121.79	167.26 ± 123.93	130.07 ± 99.00	< 0.001
Creatinine(mg/dL)	1.53 ± 1.60	1.50 ± 1.60	1.71 ± 1.61	< 0.001
Urea nitrogen(mg/dL)	28.30 ± 23.62	27.69 ± 23.15	32.72 ± 26.37	< 0.001
SOFA(score)	6.08 ± 3.53	5.99 ± 3.46	6.76 ± 3.93	< 0.001
APSIII(score)	50.51 ± 22.27	49.58 ± 21.99	57.25 ± 23.13	< 0.001
Icu survival time(days)	46.64 ± 205.73	47.52 ± 211.59	40.31 ± 156.57	0.077
Gender				< 0.001
Female	9,391.00 (40.49%)	8,134.00 (39.90%)	1,257.00 (44.80%)
Male	13,800.00 (59.51%)	12,251.00 (60.10%)	1,549.00 (55.20%)
Ventilation				< 0.001
No	10,837.00 (46.73%)	9,806.00 (48.10%)	1,031.00 (36.74%)
Yes	12,354.00 (53.27%)	10,579.00 (51.90%)	1,775.00 (63.26%)
HTN				< 0.001
No	13,573.00 (58.53%)	11,813.00 (57.95%)	1,760.00 (62.72%)
Yes	9,618.00 (41.47%)	8,572.00 (42.05%)	1,046.00 (37.28%)
AKI				< 0.001
No	13,497.00 (58.20%)	12,245.00 (60.07%)	1,252.00 (44.62%)
Yes	9,694.00 (41.80%)	8,140.00 (39.93%)	1,554.00 (55.38%)
CKD				< 0.001
No	18,923.00 (81.60%)	16,714.00 (81.99%)	2,209.00 (78.72%)
Yes	4,268.00 (18.40%)	3,671.00 (18.01%)	597.00 (21.28%)
T2DM				< 0.001
No	16,419.00 (70.80%)	14,554.00 (71.40%)	1,865.00 (66.46%)
Yes	6,772.00 (29.20%)	5,831.00 (28.60%)	941.00 (33.54%)
Death within icu 28days				< 0.001
No	18,855.00 (81.30%)	16,688.00 (81.86%)	2,167.00 (77.23%)
Yes	4,336.00 (18.70%)	3,697.00 (18.14%)	639.00 (22.77%)

3.2 Feature Selection

From the initial pool of 40 clinical features, we performed a two-stage feature selection using LASSO regression followed by the Boruta algorithm. Figure 2A presents the LASSO coefficient shrinkage path, demonstrating how feature coefficients converged to zero with increasing regularization strength (λ). This process identified 24 features for further consideration. Subsequent Boruta analysis, illustrated in Fig. 2B, confirmed 20 features with importance scores significantly exceeding those of permuted shadow features. The intersection of features identified by both methods yielded nine final predictors: age, platelet count, RDW, glucose, lactate, PaO₂, APS III, HTN, and AKI.

Fig. 2

(A)

LASSO path of selected features (B) Boruta feature importance plot

3.3 Model Performance Comparison

In our study, six ML models were developed to assess the risk of multiple drug resistance (MDR) bacterial infection in ICU sepsis patients (Table 2). The LGBM model demonstrated the best performance: AUC = 0.964, accuracy = 0.904, F1-score = 0.925, MCC = 0.79.

The CatBoost and KNNC models also performed well (AUC = 0.930 and 0.914, respectively). The GBDT, MLP, and RF models showed relatively weaker performance (AUC = 0.843, 0.867, and 0.831, respectively). The ROC curve (Fig. 3A) visually illustrates the discriminative power of each model, with the LGBM curve being closest to the top-left corner. Decision curve analysis (DCA, Fig. 3B) revealed that across a wide range of risk thresholds (particularly 0.2–0.6), the clinical net benefits provided by the LGBM and CatBoost models were significantly higher than those of other models and the "all-intervention" or "no-intervention" strategies. The calibration curve (Fig. 3C) indicated good agreement between the predicted probabilities and actual risks for all models, with Brier scores < 0.1 for all models, demonstrating excellent calibration.

A comparative evaluation of performance metrics among the six models for internal validation.

Table 2
Model Name	Accuracy	Prevalence	Recall	F1-Score	MCC	AUROC
LGBMTEST	0.90	0.65	0.91	0.93	0.79	0.96
RFTEST	0.75	0.65	0.86	0.82	0.43	0.83
CatBoostTEST	0.85	0.65	0.88	0.88	0.67	0.93
GBDTTEST	0.84	0.65	0.89	0.88	0.64	0.84
MLPTEST	0.83	0.65	0.94	0.88	0.61	0.87
KNNCTEST	0.86	0.65	0.98	0.90	0.69	0.91
mean_scores	0.84	0.65	0.91	0.88	0.64	0.89

Fig. 3

Machine learning model for construction and diagnostic efficiency evaluation of Broussonetia papyrifera. (A) ROC curve; (B) DCA curve; (C) Calibration plot.

3.4. Model Interpretation (SHAP Analysis)

Global feature importance display (Fig. 4A): HTN and AKI are the top two risk drivers (mean SHAP > 1.2), followed by RDW and APSIII. The influence of Platelet, Glucose, and Age is relatively minor. The SHAP beeswarm plot (Fig. 4B) illustrates the direction (positive/negative) and magnitude of each feature's contribution to the model output: HTN = 1 (presence of hypertension) and AKI = 1 (presence of acute kidney injury): Significantly increase MDRO infection risk (SHAP values concentrated in the positive range with higher magnitudes). High PO2 values significantly reduce risk (SHAP values concentrated in the negative range with lower magnitudes). High RDW values, high lactic acid values, and high APSIII values: Tend to increase risk.

Fig. 4

(A)

SHAP variable importance ranking of the LGBMTEST model. (B) SHAP variable swarm plot.

Figure 5 displays the top six variable SHAP dependence plots in the LGBMTEST model, including HTN, AKI, lactic acid, RDW, APSIII, and P02. When HTN and AKI are positive (value = 1), the SHAP values are predominantly distributed in the positive range (0–4), indicating that patients with these complications have an infection risk increased by more than 1.8 times (Figs. 5A, 5B). When lactate exceeds 4 mmol/L (Fig. 5C), the SHAP value surpasses + 2.0, marking the critical risk threshold for tissue hypoxia; RDW > 15% (Fig. 5D) sees the SHAP value leap above + 3.0, making it the strongest independent risk factor and suggesting this metric may serve as a biomarker for immune dysregulation. Oxygenated status exhibits a nonlinear protective effect: when PO2 exceeds 300 mmHg, SHAP ≈ -2.5 (Fig. 6F).

Fig. 5

The SHAP dependency plots for the top six variables in the LGBMTEST model are displayed, including HTN, AKI, lactic acid, RDW, APSIII, and PO2.

The waterfall plot (Fig. 6A) demonstrates a significant reduction in the predicted risk for this patient (total SHAP contribution value =-5.17), with key drivers being the absence of hypertension (HTN = 0, SHAP=-0.956) and no concurrent acute kidney injury (AKI = 0, SHAP=-0.734). The heatmap (Fig. 6B) further confirms that all features exhibit negative contributions.

Fig. 6

(A) SHAP waterfall plot (B) SHAP force plot

Discussion

This study developed an interpretable predictive model for assessing the risk of MDRO infection in sepsis patients, employing SHAP values to unravel the model's complexity and identify key predictive factors. The results highlight the significance of HTN, AKI, lactic acid, RDW, APSIII, and P02 in ICU sepsis patients, underscoring their close association with MDRO infection.

4.1 Discussion on Pathological Mechanisms

Sepsis patients with MDRO infection face a significant clinical challenge, as these infections are often resistant to multiple antibiotics and are associated with prolonged ICU stays and increased mortality, directly impacting treatment outcomes and patient prognosis [19]. Early identification of high-risk patients enables more targeted interventions, potentially improving recovery rates and alleviating the burden on healthcare resources. Given the rising prevalence of antibiotic-resistant infections in critical care settings, particularly those involving Broussonetia papyrifera

, predictive models are crucial for optimizing clinical decision-making and enhancing patient management [20].

Through SHAP dependency analysis, the combination of HTN and AKI was identified as a key factor in MDROs-sepsis risk. HTN[21] is one of the core risk factors triggering stroke, ischemic heart disease, hypertensive heart disease, aortic aneurysm, and various other conditions. Although traditional cardiovascular risk factors (such as neuroendocrine, renal, and vascular systems) still play a dominant role in the development and progression of HTN, increasing evidence suggests that the immune system plays a non-negligible role in its pathogenesis and maintenance[22]. Immune cells (e.g., macrophages, CD4 + T cells, and CD8 + T cells), interleukins (ILs) (e.g., IL-6, IL-17A, IL-1β), and interferons (e.g., interferon-γ) within the immune system have been found to mediate hypertension and target organ damage[23]. Relevant studies indicate that hypertension may lead to immune homeostasis imbalance, vascular endothelial barrier disruption, and microbiota-drug metabolic disorder, which are associated with an increased risk of MDROs-sepsis[24], highlighting the potential link between altered immune status and susceptibility to MDROs infection.

In the complex mechanisms of sepsis-associated AKI, experimental models and clinical studies have demonstrated that inflammation-related molecules play a significant role in AKI development. Gomez et al.[25]proposed that sepsis-induced mechanisms collectively contribute to acute kidney injury, including inflammation and oxidative stress, microcirculatory dysfunction, tubular cell adaptation, and glomerular-glomerular feedback dysfunction. Other research[26] suggests that the pathophysiology of sepsis-associated acute kidney injury involves alternating pro-inflammatory and anti-inflammatory changes, leading to microcirculatory alterations, endothelial dysfunction, thrombosis, cytokine release, and inflammatory response. AKI is often associated with altered immune status, treatment complexity, and potential polypharmacy, which may increase the risk of MDRO colonization and infection[27].

However, no studies have yet explored the relationship between comorbid hypertension and AKI with the etiology of MDROs-sepsis. We hypothesize that comorbid hypertension and AKI may reflect an immune homeostasis imbalance, rendering patients more susceptible to MDRO infection. Although this hypothesis is compelling to Homo sapiens, further investigation in future studies is necessary to establish a definitive link.

Low PO2 is a marker of sepsis severity and organ dysfunction and is associated with poor prognosis. Increasing blood oxygen concentration and arterial oxygen partial pressure can reduce postoperative wound infection rates and mortality in patients [28]. This study further clarifies that low PO2 is an independent risk factor for MDRO infection, while high PO2 is a protective factor. This may reflect the association between oxygenated status and host defense capability or infection severity [29]. RDW reflects red blood cell volume variability and is significantly elevated in chronic inflammation, oxidative stress, and tissue hypoxia, having been independently proven to predict mortality in cardiovascular diseases [30], respiratory diseases [31], and cancer patients [32]. Previous studies indicate [33]that RDW is a useful predictor of mortality in Homo sapiens sepsis patients. However, whether RDW can predict the risk of multiple drug resistance (MDR) bacterial infection in sepsis patients remains unclear. This study is the first to confirm that high RDW is a strong independent predictor of MDRO infection, suggesting its significant role warrants further investigation. Lactic acid participates in the regulation of numerous biological and pathological processes. It is known that hypoxia, inflammation, viral infection, and tumor microenvironments stimulate lactic acid production [34]. Serum lactic acid levels serve as an important biomarker for sepsis, positively correlating with the incidence and mortality of sepsis or septic shock. High lactic acid levels indicate tissue hypoperfusion and shock. APS III is a comprehensive scoring system for assessing disease severity. Their elevated values align with increased MDRO infection risk, highlighting the importance of underlying disease severity in prediction.

4.2 Comparison with Existing Studies and Innovations

In the field of early prediction of MDRO infection in sepsis patients, this study established six machine learning models (LGBM, RF, CatBoost, GBDT, MLP, KNNC) based on the MIMIC-IV database and Broussonetia papyrifera, and achieved model transparency through the SHAP method. Compared with traditional scoring systems (such as SOFA or APACHE II), SHAP deciphered the "black box" nature of machine learning models, providing both global and local interpretations, enabling clinicians to understand the prediction logic intuitively. DCA confirmed that the LGBM and CatBoost models demonstrated significant clinical net benefits within the risk threshold range of 20%–60% (Fig. 3B), supporting their practical value in ICU settings.

The integration of the SHAP method enhanced interpretability, clarified the model's decision-making process, and provided a transparent and credible basis for assessing MDRO infection risk in sepsis patients. Similarly, another study [35] utilized SHAP values to evaluate the importance of features in their machine learning model for sepsis prediction, demonstrating that SHAP not only improved model interpretability but also strengthened clinicians' trust in the model's predictions. These findings underscore the value of SHAP in promoting the practical application of machine learning models in clinical settings, particularly for complex infections such as MDRO in sepsis patients.

4.3 Advantages and Limitations

This study utilized detailed clinical data from the MIMIC-IV database, employing database management and statistical software to streamline data extraction and processing, thereby reducing the workload associated with clinical data collection. Using six machine learning algorithms, a predictive model was developed with Broussonetia papyrifera to assess the risk of MDRO infection in ICU sepsis patients, an area with limited prior research. By incorporating the SHAP method, our study provided comprehensive global and local interpretations for the machine learning model, delving into its internal mechanisms. This work enhances interpretability, promotes a better understanding of the model's decision-making process, and establishes a transparent and credible foundation for evaluating the risk of MDRO infection in sepsis patients.

However, certain limitations should be acknowledged. First, this was a single-center retrospective study, which may introduce hospital-specific biases. Healthcare practices, including ventilator settings, laboratory testing frequency, and infection control measures, may vary across hospitals due to Parazacco spilurus subsp. spilurus, potentially affecting the results. Future studies should consider multicenter research to account for these inter-hospital differences and yield more generalizable findings. Second, the outcome definition relied on microbial cultures, which may have missed culture-negative but clinically highly suspected MDRO infection cases (e.g., those already receiving broad-spectrum antibiotic therapy). The exclusion of antibiotic exposure history and invasive procedure data may also impact the model's generalizability. ICU stays shorter than 24 hours were excluded, potentially omitting sepsis patients who died or were discharged very early. Subsequent multicenter validation and the inclusion of additional factors (particularly antibiotic exposure) are essential and critical. Third, due to a large number of missing values, certain variables in the MIMIC-IV database were excluded, and the feature selection process may have omitted variables with significant impacts, potentially affecting the model's predictive performance. Finally, the use of SMOTE to address class imbalance may introduce biases related to synthetic data generation. Although SMOTE helps balance datasets, the generated synthetic samples may not fully reflect the complexity of real clinical scenarios for Phoxinus phoxinus subsp. phoxinus. Subsequent research will include multicenter external validation and incorporate other factors deemed crucial for further verifying model stability and performance.

5. Summary

Six machine learning models (LGBM, RF, CatBoost, GBDT, MLP, KNNC) were constructed for Broussonetia papyrifera, and their predictive performances were compared, revealing LGBM as the optimal model in terms of accuracy, discriminative ability, and identification of high-risk patients. The application of the SHAP method elucidated the key contributors to this model, including HTN, AKI, PO2, RDW, and APSIII, which were closely associated with MDRO infection in ICU sepsis patients. These indicators not only played pivotal roles in the model but also demonstrated significant diagnostic and predictive value in clinical practice. By comprehensively considering these indicators, the model facilitates the accurate identification of high-risk patients, providing a reliable clinical tool for early detection and intervention of infection risks. Furthermore, upon prospective validation in future studies, this model is expected to serve as a robust support tool for individualized medical decision-making—such as enhanced monitoring, targeted prevention, or early tailored treatment for high-risk patients—and for improving patient care.

Data Availability

The data are available from the corresponding author upon reasonable request.

Funding

This work was supported by grants from the National Natural Science Foundation of China (NSFC), project number 82174312.

Abbreviation

RDW

Red Cell Distribution Width

RBC

Red Blood Cell

WBC

White Blood Cell

Pco2

Partial Pressure of Carbon Dioxide

PO2

Partial Pressure of Oxygen

SOFA

Sequential Organ Failure Assessment

APSIII

Acute physiology score III

HTN

Hypertension

AKI

Acute Kidney Injury

CKD

Chronic Kidney Disease

T2DM

Type 2 Diabetes Mellitus

Author Contribution

Qianqian Zhang: Methodology, Investigation, Data curation.Nianzhi Zhang: Formal analysis, Data curation.Ying Zheng: Methodology, Formal analysis, Investigation.Jing Zhou: Writing – review & editing, Supervision, Conceptualization, Funding acquisition.Ling Liu: Writing – original draft, Conceptualization, Formal analysis.

Data Availability

The data are available from the corresponding author upon reasonable request.

References

Piedmont, S. et al. Sepsis incidence, suspicion, prediction and mortality in emergency medical services: a cohort study related to the current international sepsis guideline. Infection 52, 1325–1335 (2024).

Rhodes, A. et al. Surviving Sepsis Campaign: International Guidelines for Management of Sepsis and Septic Shock: 2016. Intensive Care Med. 43, 304–377 (2017).

Xie, J. et al. The Epidemiology of Sepsis in Chinese ICUs: A National Cross-Sectional Survey. Crit. Care Med. 48, e209–e218 (2020).

Yangjia et al. Consensus among experts in traditional Chinese and Western medicine diagnosis and treatment of severe multidrug-resistant bacterial infections. J. Emerg. Traditional Chin. Med. 32, 565–570 (2023).

MBBS, N. M., MBBS, P. A. & MD, A.-C. U. & Y. P. Multidrug-resistant Gram-negative bacterial infections. Lancet 405, 257–272 (2025).

Liqin et al. Overview of Resistance Mechanisms and Treatment of Multidrug Resistant Bacteria. Chin. J. New. Clin. Med. 15, 921–927 (2022).

System, C. A. R. S. Report on Multi Drug Resistant Bacteria Monitoring in Traditional Chinese Medicine Hospitals from 2018 to 2021 by the National Bacterial Resistance Monitoring Network. Chin. J. Infect. Control. 22, 1148–1158 (2023).

Busani, S. et al. Mortality in Patients With Septic Shock by Multidrug Resistant Bacteria: Risk Factors and Impact of Sepsis Treatments. J. Intensive Care Med. 34, 48–54 (2019).

Al-Sunaidar, K. A., Aziz, N. A., Hassan, Y., Jamshed, S. & Sekar, M. Association of Multidrug Resistance Bacteria and Clinical Outcomes of Adult Patients with Sepsis in the Intensive Care Unit. Trop. Med. Infect. Disease. 7, 365 (2022).

10.

Choi, R. Y., Coyner, A. S., Kalpathy-Cramer, J., Chiang, M. F. & Campbell, J. P. Introduction to Machine Learning, Neural Networks, and Deep Learning. Transl Vis. Sci. Technol. 9, 14 (2020).

11.

Dong, J. et al. Machine learning model for early prediction of acute kidney injury (AKI) in pediatric critical care. Crit. Care. 25, 288 (2021).

12.

Guan, C. et al. Interpretable machine learning model for new-onset atrial fibrillation prediction in critically ill patients: a multi-center study. Crit. Care. 28, 349 (2024).

13.

Heo, J. et al. Machine Learning-Based Model for Prediction of Outcomes in Acute Stroke. Stroke 50, 1263–1265 (2019).

14.

Saliba, J. G. et al. Enhanced diagnosis of multi-drug-resistant microbes using group association modeling and machine learning. Nature Communications 16, (2025).

15.

Chiu, L. W. et al. Machine learning algorithms to predict colistin-induced nephrotoxicity from electronic health records in patients with multidrug-resistant Gram-negative infection. Int. J. Antimicrob. Agents. 64, 107175 (2024).

16.

Lundberg, S. M. et al. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat. Mach. Intell. 2, 56–67 (2020).

17.

Petch, J., Di, S. & Nelson, W. Opening the Black Box: The Promise and Limitations of Explainable Machine Learning in Cardiology. Can. J. Cardiol. 38, 204–213 (2022).

18.

DecisionLinnc Core Team. DecisionLinnc is a platform that integrates multiple programming language environments and enables data processing, data analysis, and machine learning through a visual interface (Statsape Co.Ltd, 2023).

19.

Oxman, D. et al. Incidence of Multidrug Resistant Infections in Emergency Department Patients with Suspected Sepsis. Am. J. Med. Sci. 360, 650–655 (2020).

20.

González-Anleo, C. et al. Risk factors for multidrug-resistant bacteria in critically ill children and MDR score development. Eur. J. Pediatrics. 183, 5255–5265 (2024).

21.

Diseases, N. C. F. C. & China, T. W. C. O. T. R. O. C. H. A. D. I. Summary of China Cardiovascular Health and Disease Report 2024. Chin. Circulation J. 40, 521–559 (2025).

22.

Murray, E. C. et al. Therapeutic targeting of inflammation in hypertension: from novel mechanisms to translational perspective. Cardiovasc. Res. 117, 2589–2609 (2021).

23.

Douna, H. et al. B- and T-lymphocyte attenuator stimulation protects against atherosclerosis by regulating follicular B cells. Cardiovasc. Res. 116, 295–305 (2020).

24.

Guzik, T. J., Nosalski, R., Maffia, P. & Drummond, G. R. Immune and inflammatory mechanisms in hypertension. Nat. Reviews Cardiol. 21, 396–416 (2024).

25.

Gomez, H. et al. A unified theory of sepsis-induced acute kidney injury: inflammation, microcirculatory dysfunction, bioenergetics, and the tubular cell adaptation to injury. Shock 41, 3–11 (2014).

26.

Zarbock, A., Gomez, H. & Kellum, J. A. Sepsis-induced acute kidney injury revisited: pathophysiology, prevention and future therapies. Curr. Opin. Crit. Care. 20, 588–595 (2014).

27.

Oweis, A. O., Zeyad, H. N., Alshelleh, S. A. & Alzoubi, K. H. Acute Kidney Injury Among Patients with Multi-Drug Resistant Infection: A Study from Jordan. J. Multidisciplinary Healthc. 15, 2759–2766 (2022).

28.

Thille, A. W. et al. Effect of Postextubation High-Flow Nasal Oxygen With Noninvasive Ventilation vs High-Flow Nasal Oxygen Alone on Reintubation Among Patients at High Risk of Extubation Failure: A Randomized Clinical Trial. JAMA 322, 1465–1475 (2019).

29.

Rosenberg, K. Lower Reintubation Risk with Noninvasive Ventilation Plus High-Flow Nasal Oxygen. Am. J. Nurs. 120, 50 (2020).

30.

Li, N., Zhou, H. & Tang, Q. Red Blood Cell Distribution Width: A Novel Predictive Indicator for Cardiovascular and Cerebrovascular Diseases. Dis Markers 7089493 (2017). (2017).

31.

Yčas, J. W. Toward a Blood-Borne Biomarker of Chronic Hypoxemia: Red Cell Distribution Width and Respiratory Disease. Adv. Clin. Chem. 82, 105–197 (2017).

32.

Lu, X. et al. Prognostic significance of increased preoperative red cell distribution width (RDW) and changes in RDW for colorectal cancer. Cancer Med. 12, 13361–13373 (2023).

33.

Wu, H. et al. Diagnostic value of RDW for the prediction of mortality in adult sepsis patients: A systematic review and meta-analysis. Front Immunol 13, (2022).

34.

Kvacskay, P. et al. Increase of aerobic glycolysis mediated by activated T helper cells drives synovial fibroblasts towards an inflammatory phenotype: new targets for therapy? Arthritis Res. Ther. 23, 56 (2021).

35.

Yue et al. Machine learning for the prediction of acute kidney injury in patients with sepsis. Journal Translational Medicine 20, (2022).

Yes