Background
Blood donor recruitment currently faces significant challenges, as declining donor willingness continues to place increasing pressure on recruitment efforts. Developing precise strategies to improve both efficiency and quality has therefore become essential. With the ongoing digitalization of blood services and the emergence of large-scale databases, machine learning (ML) solutions are increasingly being adopted in transfusion medicine1,2. ML techniques offer promising tools to address challenges across the transfusion supply chain—enhancing donor recruitment efficiency and optimizing blood resource management. Compared to traditional approaches, which randomly select a list of donors who meet the requirements for blood donation, ML models provide substantial advantages in processing large volumes of data and uncovering latent patterns. Notably, they can effectively capture complex nonlinear relationships without relying on linear assumptions, handle high-dimensional feature spaces, and reveal intricate interdependencies among variables. Furthermore, through incremental learning3, ML models can be continuously updated in real time, enabling rapid adaptation to new data. These capabilities confer greater efficiency, adaptability, and analytical power for identifying hidden trends in large-scale datasets. Although several ML-based approaches have been explored for blood and hematopoietic stem cell donor recruitment, most prior studies have focused primarily on evaluating model performance from a data modeling standpoint, without extending to real-world applications4,5.
Our previous work6 developed a machine learning–based blood donor recruitment model using data from Yangzhou, where accuracy was employed as the primary evaluation metric. The results demonstrated that machine learning–based precise recruitment at a single center outperformed traditional methods. However, that study did not evaluate the generalizability of the recruitment framework. Moreover, because donor recruitment constitutes a highly imbalanced classification problem—where the number of non-donors greatly exceeds that of actual recruiters—the accuracy, which reflects the overall proportion of correctly predicted samples, may not be the most appropriate evaluation metric. In contrast, recall more effectively captures the proportion of true positive cases identified, making it a more suitable metric for this task. To better address this class imbalance, the present study adopts the principle of “recall as the primary metric, supported by other indicators” in model construction and performance evaluation. Specifically, models were initially trained on data from Nanjing and subsequently fine-tuned using 10% of the data from Yangzhou and Suzhou to assess cross-center transferability. Building on this foundation, optimized models were developed and applied to targeted SMS-based recruitment in all three cities—Nanjing, Suzhou, and Yangzhou—resulting in more effective and accurate donor recruitment outcomes.
Construction and Validation of Recruitment Models in Nanjing
Recruitment models, with recall as the primary evaluation metric, were initially trained and fine-tuned using data from Nanjing. Donation records from 2017 to 2021 were analyzed in conjunction with SMS recruitment data to capture historical donor behavior and characterize donation patterns—including donation count, frequency, total volume, intervals, and age—thus generating comprehensive donor profiles. SMS data from 2022 were subsequently used to assess donor willingness by linking response outcomes to these profiles. This enabled the identification of features distinguishing high- and low-willingness donors and the construction of a labeled dataset comprising effective donors (those who donated within seven days of receiving the SMS) and ineffective donors (those who did not).
The SMS recruitment data collected from Nanjing in 2022 were divided into a training set and a test set using a 70:30 split. Several machine learning models—including eXtreme Gradient Boosting (XGBoost) 7, Support Vector Machine (SVM) 8, K-Nearest Neighbors (KNN), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF) 9, and Multi-Layer Perceptron (MLP) 10—were trained on the training set. To address class imbalance, sampling techniques such as Synthetic Minority Oversampling Technique (SMOTE) 11 and under-sampling (US) were employed, in combination with cost-sensitive learning approaches, including MFE, MSFE, and weighted mean squared error loss functions 12,13. Each model (XGBoost, SVM, KNN, LR, RF, DT, MLP) was evaluated under different sampling strategies (raw data, US, SMOTE) using a grid search over four performance metrics: accuracy, precision, recall, and F1-score. Model selection was primarily based on the recall value, with the remaining metrics used for reference. All models were implemented in Python programming language using the XGBoost, Scikit-learn 14, and PyTorch 15 libraries.
Research on Precise Recruitment in Nanjing, Suzhou and Yangzhou
Assuming that blood donation records and SMS recruitment data can be used to rank and predict donor willingness, the selected MLP and RF models were jointly employed to conduct prospective blood donor recruitment in Nanjing, Suzhou, and Yangzhou, China.
A
The procedure was as follows: (1) Candidates were drawn from SMS recruitment lists provided by the local blood centers, and their key characteristics were input into the trained MLP and RF models to generate donation probability scores reflecting predicted willingness; (2) Donors with scores above 0·5 in both models were selected, prioritizing those appearing in the intersection of MLP and RF predictions, followed by those identified by either model individually; (3) To ensure comparability, the number of selected donors matched that of conventional recruitment efforts. These high-willingness donors were recruited for a controlled study, and donation outcomes within seven days of SMS delivery were recorded. Based on these trained models, an incremental learning paradigm 3 was further applied to fine-tune and iteratively update the MLP model using the most recent donation data. In total, 13 prospective recruitment studies were conducted across the three cities, leveraging the updated MLP and RF models and comparing their performance against traditional recruitment methods.
Research on prediction thresholds and on repetition ratios
To further enhance model performance, a threshold adjustment experiment in which the prediction thresholds of the MLP and RF models were systematically varied was conducted, to generate different levels of overlap between the model-recommended donor list and the conventional recruitment list. For each threshold setting, the corresponding improvement in recruitment performance was evaluated and defined as the increase in successfully recruited high-willing donors. This experiment aimed to assess whether reducing the redundancy between model-based recommendations and conventional recruitment could improve the overall effectiveness of machine learning–driven donor recruitment.
Statistical Analysis
For the analysis of blood donor feature means, the total sample was divided into two groups: successful recruitment and failed recruitment. A z-test was performed to evaluate the mean differences for each feature. In the modeling phase, features with p-values less than 10⁻⁵ were selected for Nanjing and Suzhou, while a threshold of 10⁻² was applied for Yangzhou. IBM SPSS Statistics 27·0 was used to analyze recruitment data during the practical implementation of the model. For categorical variables, the chi-square test or Fisher’s exact test was employed, with a p-value of less than 0·05 considered statistically significant.
Results
Data Collected
In Nanjing, a total of 622,368 blood donation records from 2017 to 2022 were analyzed, comprising 63·61% male and 36·38% female donors. Repeat donors accounted for 50·37%, and first-time donors for 49·63%, with an SMS response rate of 3·15%. In 2022, 11,968 recruitment messages were sent, of which 11,616 (8,131 male and 3,485 female) were used for model training and testing. The average volume of the most recent donation was 368 mL; total donation volume averaged 684 mL; the mean number of donations was 1·84; the average interval since the last donation was 434 days; and the mean age of donors was 34 years. In Suzhou, 907,230 records from 2018 to 2023 were included, with 67·70% male and 32·30% female donors. Repeat donors comprised 63·36% of the sample. Between 2022 and 2023, 47,922 recruitment messages were sent, with 32,669 (19,567 male and 13,102 female) used for model development. The average last donation volume was 306 mL, total donation volume was 726·52 mL, the mean number of donations was 2·32, the average interval was 781·5 days, and the mean donor age was 36 years. In Yangzhou, 456,955 records spanning 1998 to 2024 were analyzed, with 57·75% male and 42·24% female donors. Between 2022 and 2023, 20,459 recruitment messages were sent, all of which were used for modeling (11,581 male and 8,878 female). The average last donation volume was 370 mL, total donation volume was 1,052 mL, the mean number of donations was 2·6, the average interval since the last donation was 560 days, and the mean donor age was 40 years. The recruitment and model evaluation procedures are illustrated in Fig. 1.
Model Development Results in Nanjing
A
Individuals who responded and did not respond to SMS recruitment in 2022 were treated as two distinct populations. A two-sample mean test (z-test) was performed to compare the distributions of donor features, and the resulting p-values were ranked from smallest to largest, as shown in Table
S1 of the Supplementary Material. Features identified through this analysis were considered candidate variables and selected based on recruitment objectives and model performance. For the final recruitment model in Nanjing, seven features were used: repeat donor status, total donation volume, number of donations, donation interval, donation frequency, age, and whether the donor had AB blood type.
A
The performance of various combinations of models and sampling methods on the test dataset is presented in Fig.
2, with detailed results provided in Table
S2 of the Supplementary Material. Based on these results, the integrated MLP and RF models were selected for practical deployment, following the principle of prioritizing recall as the primary evaluation metric, with other metrics serving as secondary references to determine the final recruitment outcomes.
Results of the Generalizability Study
A
The performance of the Nanjing model and its fine-tuned counterparts in Suzhou and Yangzhou—detailed in Table S3 of the Supplementary Material—provides empirical support for the strong generalizability of machine learning models in precision donor recruitment. Although both cities adopted Nanjing’s standardized 7-feature framework—differing from Suzhou’s original 8 features and Yangzhou’s 11 features—the models still achieved clinically acceptable recall rates (≥ 0.58), closely approaching those of their locally optimized baselines. Specifically, the RF model achieved a recall of 0.67 in Suzhou (compared to 0.77 with local optimization) and 0.63 in Yangzhou (versus 0.64 locally), while the MLP model attained recall scores of 0.63 and 0.58, respectively. Notably, the observed improvements in SMS recruitment efficiency following minimal fine-tuning using only 10% of local data further confirm that cross-center model transfer requires limited adaptation to achieve near-native performance.
Performance of the Models on Precise Recruitment in Nanjing, Suzhou and Yangzhou
A
A
The recruitment classification methods used for Nanjing, Suzhou, and Yangzhou are detailed in Tables S1, S4, and S5 of the Supplementary Material. Candidate features were initially selected based on recruitment objectives and model training performance. As previously described, the final recruitment model for Nanjing was constructed using seven features. For Suzhou, the final model incorporated eight features: last donation volume, repeat donor status, total donation volume, number of donations, donation frequency, age, donation interval, and whether the individual was classified as a worker. For Yangzhou, the final model included eleven features: number of donations, repeat donor status, last donation volume, donation interval, age, total donation volume, blood type A, blood type O, occupation (student or worker), and educational level (middle school and below).
A
A
The performance of various combinations of models and sampling methods across the three blood centers is illustrated in Fig.
2, with detailed results provided in Tables S2, S6, S7 of the Supplementary Material. Based on these results, we selected an integrated approach combining MLP and RF models for practical deployment, following the principle of prioritizing recall as the primary evaluation metric, with other metrics used as supplementary references to determine the final recruitment outcomes.
A
A
On average, the recruitment success rate for high-willing donors was 408·28%, 25·19%, and 47·31% higher than that of conventional methods in Nanjing (χ² = 35·471, p = 0·000), Suzhou (χ² = 9·384, p = 0·009), and Yangzhou (χ² = 96·701, p = 0·000), respectively, as shown in Table 1. When SMS messages were sent exclusively to donors recommended by the model, recruitment improvements remained statistically significant in Suzhou (χ² = 34·499, p = 0·000) and Yangzhou (χ² = 136·110, p = 0·000), but not in Nanjing (χ² = 0·451, p = 0·536). Moreover, SMS volume was reduced by 18·71%, 121·52%, and 138·75% in Nanjing, Suzhou, and Yangzhou, respectively, while recruitment efficiency per SMS increased by 9·54%, 53·61%, and 38·98% compared with conventional methods (Table 2).
Relation between the prediction thresholds and the repetition ratios
The experiments involved systematically adjusting the prediction thresholds of the MLP and RF models to produce varying repetition ratios, with corresponding changes in model performance evaluated, as shown in Fig. 3. The results reveal a clear inverse relationship between repetition ratio and performance improvement across Nanjing, Suzhou, and Yangzhou. Specifically, as the overlap between the model-recommended donor list and the conventional recruitment list decreases, recruitment performance gains become more pronounced. These findings highlight the potential of reducing redundancy in donor selection to enhance the effectiveness of machine learning–based recruitment strategies.
Discussion
Blood donor recruitment in China currently faces several critical challenges: (1) a nationwide decline in donations—voluntary blood donations (15·821 million) and total blood volume collected (26·924 million units) decreased by 6·9% compared to 2023; (2) few institutions seeing growth in blood collection—only 4 out of 395 prefecture-level or higher blood collection institutions reported positive growth in 2024; and (3) an aging donor population—with a noticeable decline in younger donors, as the proportion of donors under age 35 dropped from 65% in 2019 to 52% in 2024· These developments raise serious concerns about the sustainability and security of the national blood supply, underscoring the urgent need for targeted recruitment strategies, data-driven interventions, and improved engagement of younger populations to ensure a safe, stable, and clinically sufficient blood supply.
In recent years, data-driven statistical and artificial intelligence (AI) methods have become increasingly important for predicting donor willingness and forecasting future blood supply needs. Hanapi WHWH et al. employed logistic regression to identify key factors influencing donation intention, including donation interval, frequency, and total volume 16. Salazar-Concha C et al. applied decision trees, achieving 84·17% accuracy in predicting donor re-engagement 17. Marade C et al. utilized SVM, decision trees (DT), and KNN to forecast monthly blood donations, aiding inventory management 18. Ding X et al. proposed a hybrid SARIMAX-LSTM neural network for accurate donation estimation 19, while El-Rashidy N et al. and Selvaraj P et al. reported that random forest models yielded the best performance in predicting donation and re-donation behaviors 20,21. Other studies have leveraged SVM for donor classification and donation outcome prediction 22,23, and Zulfikar W.B. et al. found decision trees to outperform naive Bayes classifiers 24. Cloutier M. focused on predicting return rates among young donors 25. Several reviews—including those by Kiarie et al. 26, Li Y et al. 27, Buturovic L et al. 28, Sivasankaran A et al. 29, and Gupta V et al. 30—have highlighted the growing application of machine learning in donor retention, recruitment optimization, post-transplant outcomes, and hematopoietic stem cell transplantation (HSCT), while also acknowledging current limitations and challenges. Collectively, these studies underscore the value of AI and big data analytics in enhancing blood donor recruitment and ensuring a safe, sufficient, and sustainable clinical blood supply.
A
A
A
A
A
A
This study demonstrates the generalizability and practical effectiveness of our machine learning–based recruitment framework by adopting recall as the primary evaluation metric—contrasting with previous studies that largely evaluated performance from a data modeling perspective. In earlier work, we developed seven recruitment models using XGBoost and SVM on a dataset comprising 697,174 blood donation records (October 1998–2019) and 95,476 SMS recruitment records (April 2016–July 2019) from Yangzhou, incorporating 13 donor features. This laid the groundwork for the Nanjing recruitment model. In the present study, we extend this foundation into a prospective setting, employing multi-layer perceptron (MLP) and random forest (RF) models, with recall as the central metric. While MLP, as a deep learning algorithm, offers strong representational capacity, its performance is sensitive to data scale and resource demands. To address this, we complemented MLP with RF, a more lightweight and robust traditional machine learning method. Given the limited size of available blood donation data, we adopted an incremental learning paradigm: following each recruitment round, newly collected data were used to fine-tune the MLP model, improving its performance before deployment in subsequent rounds. As illustrated by Table S8 (Supplementary Material), we began with an initial dataset D₀. After performing the recruitments in some universities in Nanjing, China in our previous work, we obtained some most up-to-date blood donation data, denoted as D1· Therefore, we can get an augmented dataset D0∪D1 (denoted as D01). We randomly selected a training set and a test set in a 7:3 ratio on D01 and selected a new model trained on such a training set with a higher value recall than the old model on such a testing set. This new model was applied in the 1st recruitment in this work. After performing the 1st recruitment, we can obtain some new data D2 and use D01∪D2 (denoted as D11) to fine-tune the MLP via similar procedures as stated above. As shown in Tables S9 and S10, the updated models consistently outperformed their predecessors in terms of identifying high-willing donors and improving recruitment efficiency (measured by success rate per SMS). Some exceptions were observed during simultaneously conducted recruitments (e.g., 1st and 2nd, 4th and 5th, 6th and 7th rounds), where a single model could not optimize performance across both settings. Nonetheless, the overall trend confirmed the effectiveness of the incremental learning paradigm. Detailed fine-tuning results are provided in Tables S11–S14. Furthermore, models initially trained on Nanjing data were successfully fine-tuned with data from Suzhou and Yangzhou, validating the cross-center generalizability of our approach. Across 13 recruitment rounds, the machine learning framework improved the recruitment success rate for high-willing donors by an average of 151·57% compared with conventional methods, reduced SMS volume by 96·52%, and increased recruitment efficiency per SMS by 34·42%. These results highlight the framework’s ability to identify high-potential donors, reduce unnecessary messaging, and lower overall recruitment costs. Analysis of the Venn diagrams in Figures
S1–S8 revealed minimal overlap between donors identified by both MLP and RF and those recruited via conventional methods—likely due to high redundancy in traditional approaches. By adjusting prediction thresholds to vary repetition ratios, we observed that lower overlap corresponded to greater performance improvements, as shown in Fig.
3. Model hyperparameters were as follows: the MLP consisted of three layers with 6, 8, and 2 neurons for the three cities, ReLU activation after the second layer, and a Softmax layer for classification. The RF model used the Gini impurity criterion, a maximum depth of 9, a minimum of 5 samples per leaf, and balanced class weights (1:1). As shown in Tables S13 and S14, MLP consistently achieved higher overall recruitment success than RF—both for high-willing donors and the general donor population—aligning with its superior dataset-level performance.
Blood donor recruitment presents a highly imbalanced classification problem, with substantially more non-responders than responders. In such scenarios, simultaneously achieving high accuracy and high recall is challenging, and the F1 score may not always offer meaningful insight. Because the primary goal is to identify as many willing donors as possible, recall is prioritized as the key evaluation metric. This section explains the rationale for adopting the principle of “recall as the primary metric, with other metrics serving as auxiliary criteria” in model selection. We begin by reviewing standard evaluation metrics in binary classification. A correctly predicted positive sample is termed a True Positive (TP), while a negative sample incorrectly predicted as positive is a False Positive (FP). A positive sample incorrectly predicted as negative is a False Negative (FN), and a correctly predicted negative sample is a True Negative (TN). Based on these definitions: Accuracy = (TP + TN) / (TP + TN + FP + FN) Precision = TP / (TP + FP) Recall = TP / (TP + FN) F1 score = 2 × (Precision × Recall) / (Precision + Recall) Accuracy reflects the overall proportion of correct predictions, precision measures the correctness among predicted positives, and recall quantifies the proportion of actual positives that are successfully identified. The F1 score serves as a harmonic mean of precision and recall, balancing the two. However, these metrics may lead to misleading interpretations in imbalanced settings. Consider a population of 100 individuals, where 20 are true positives and 80 are negatives. Suppose Model A predicts all 100 as positive—yielding a recall of 1·0 but an accuracy of only 0·20· In contrast, Model B correctly identifies 10 of the 20 positive cases and misclassifies only 5 of the 80 negatives as positives—resulting in a recall of 0·67 and an accuracy of 0·85· Although Model A achieves perfect recall, it produces many unnecessary messages; Model B, by contrast, strikes a more practical balance by recruiting half of the willing donors while minimizing redundant outreach. Therefore, while recall remains our primary metric—given the high stakes of missing willing donors—it must be interpreted alongside precision and accuracy to ensure practical and efficient recruitment.
In our previous work 6, model selection was based on accuracy. However, in the current datasets from Nanjing, Suzhou, and Yangzhou, the proportion of positive cases is extremely low—only 4·71%, 0·49%, and 3·67%, respectively. In such imbalanced settings, a model that predicts all samples as negative would still achieve high accuracy (approximately 95%, 99%, and 96% for the three cities), but would be of little practical value for identifying potential donors. Consequently, we adopt the principle of prioritizing recall as the primary evaluation metric, while treating accuracy, precision, and other metrics as auxiliary. In practical implementation, models with accuracy below 0·5 are discarded, and the optimal model is selected from the remaining candidates based primarily on recall. Additionally, recruitment contexts differ notably across cities. In Nanjing, recruitment efforts primarily target university students, resulting in a relatively small and concentrated donor pool. In contrast, recruitment in Suzhou and Yangzhou is community-based, encompassing a broader and more diverse population. These differences in target populations and recruitment volumes are likely to influence the extent of performance improvements achieved by machine learning models across regions.
In future work, the performance of machine learning models for donor recruitment can be further improved through several avenues. For example, model parameters can be optimized using additional evaluation metrics—such as accuracy and F1 score—to identify models that perform well across multiple criteria, beyond recall alone. Moreover, building on the insights from the threshold adjustment experiment, heuristic algorithms could be employed to dynamically determine the optimal prediction threshold for each recruitment round. This approach would aim to minimize overlap with conventional recommendation lists, thereby improving the overall effectiveness and efficiency of the recruitment strategy.
Limitations
This study has several limitations. First, the proposed models are designed to recruit individuals with existing donation records and are not capable of identifying potential first-time donors. Second, certain contextual factors—such as donation location and the involvement of specific collection staff—may influence donor willingness, but could not be incorporated due to limited data availability. Finally, while we propose a unified recruitment framework to standardize the overall process, separate models were trained for each city to account for regional differences, rather than developing a single model applicable across all settings.