Study Design and Sample Collection
A
A
Following ethical clearance, data were collected from individuals attending the National Thalassemia Centre in Kurunegala, Sri Lanka. Participants were recruited through three main pathways: voluntary screening, which included individuals who independently sought thalassemia carrier testing; referrals from primary care physicians or hospital wards for those with clinically suspected anaemia or microcytic hypochromic red cell indices; and mass screening programs conducted in schools across the Kurunegala district, particularly targeting adolescent populations. The routine diagnostic workflow at the centre was used for the study without any alterations.
In routine (non-programmatic) screening, a structured clinical history was first obtained, followed by a Full Blood Count (FBC). Individuals with microcytic, hypochromic red cell indices (Mean Corpuscular Volume [MCV] < 77 fL and Mean Corpuscular Haemoglobin [MCH] < 27 pg) were further investigated by peripheral blood smear examination. Smears were prepared by trained laboratory technicians from venous blood collected in Ethylenediaminetetraacetic acid (EDTA) tubes, stained using the Leishman, and examined by haematologists for morphological features suggestive of thalassemia, including target cells, pencil cells, and tear-drop cells. Suspected cases were subjected to quantitative haemoglobin analysis using High-Performance Liquid Chromatography (HPLC), which is the standard confirmatory test for β-thalassemia in Sri Lanka. Individuals with an HbA₂ level > 3.5% were classified as β-thalassemia carriers.
Following completion of diagnostic confirmation, Full Blood Count (FBC) reports, blood smears, and HPLC results, materials that would otherwise be discarded, were collected for analysis. Sensitive personal identifiers were removed, and only anonymised report numbers were retained to link FBC, smear, and HPLC data.
A
Written informed consent was obtained from all participants prior to data use.
A
For minors and individuals unable to provide consent, proxy consent was obtained from parents or legal guardians. All data were stored securely accessible only to authorised research personnel, and documentation of consent and ethical approval will be made available to the ethics review committee or journal upon request.
Exclusion criteria included incomplete FBC reports, inability to retrieve the corresponding HPLC result, or absence of usable blood smear images. In addition, individuals who were HPLC-negative for β-thalassaemia but subsequently underwent further investigations for other haemoglobinopathies or alternative causes of anaemia were excluded to avoid diagnostic heterogeneity. No exclusions were made on the basis of prior blood transfusions or iron therapy.
Population Characteristics and Generalisability Considerations
Although screening services are available in multiple tertiary hospitals and dedicated haematology clinics across the country, the National Thalassemia Centre at Teaching Hospital Kurunegala functions as the largest national referral and screening hub. It hosts the highest number of registered patients and conducts the most extensive screening programmes in Sri Lanka, reflecting the highest reported regional prevalence of thalassaemia in areas surrounding Kurunegala. As a result, the study cohort is likely representative of the population most affected by β-thalassaemia within the national context.
Age groups were approximately equally represented, with a near-balanced sex distribution (79 females, 73 males), reducing the likelihood of sex- or age-related sampling bias within the cohort. However, the dataset was collected from a single geographical region, and demographic characteristics beyond age and sex, such as ethnicity, socioeconomic background, and place of residence, were not recorded. These factors may or may not influence red blood cell indices or peripheral smear morphology, and to date, no scientific evidence has demonstrated systematic variation in these haematological parameters attributable to such demographic variables in Sri Lankan populations. Nonetheless, the absence of these data limits the ability to fully assess potential demographic bias.
Furthermore, while some sources of variability were incorporated into the dataset through differences in operators and acquisition devices, the model has not yet been evaluated in populations served by other screening centres, laboratories using different equipment, or regions with differing genetic backgrounds. Therefore, future work should aim to include a more demographically diverse cohort through a multi-centre, nationwide study to enhance external validity and support broader clinical deployment.
Two types of data were acquired: Red Blood Cell Indices and Blood Smear Imaging
The following red cell indices and haematological parameters were collected for each participant: Haemoglobin (Hb), Red Blood Cell count (RBC), Mean Corpuscular Volume (MCV), Mean Corpuscular Haemoglobin (MCH), Mean Corpuscular Haemoglobin Concentration (MCHC), and Red cell Distribution Width (RDW). In addition, patient age and sex were recorded as shown in Table 1. The dataset comprised 152 participants, including 54 BTT-positive and 98 BTT-negative individuals. For each RBC parameter and age, the mean ± standard deviation (SD) and observed range are presented for the total cohort, and separately for male and female participants within each group.
Table 1
Descriptive summary of the study dataset: Red Blood Cell indices and age stratified by β-thalassemia trait (BTT) status and sex.
| Total Dataset | Positive (BTT) | Negative (Normal) |
|---|
Male | Female | Total | Male | Female | Total |
|---|
MCV | 71.46 ± 10.63 (51.4–90.5) | 59.57 ± 4.39 (51.4–70.2) | 60.21 ± 4.06 (53.9–74.4) | 59.89 ± 4.20 (51.4–74.4) | 77.26 ± 8.14 (54.3–89.9) | 77.71 ± 6.96 (56.1–90.5) | 77.51 ± 7.47 (54.3–90.5) |
MCH | 23.47 ± 4.22 (15.1–33.0) | 19.01 ± 1.42 (16.4–21.6) | 19.09 ± 1.34 (17.0–23.9) | 19.05 ± 1.37 (16.4–23.9) | 25.75 ± 3.56 (15.1–31.2) | 25.80 ± 3.01 (17.7–33.0) | 25.78 ± 3.25 (15.1–33.0) |
MCHC | 32.59 ± 1.94 (14.3–35.7) | 31.97 ± 0.64 (30.5–32.9) | 31.83 ± 0.76 (29.6–33.2) | 31.90 ± 0.70 (29.6–33.2) | 33.25 ± 1.58 (27.9–35.7) | 32.72 ± 2.67 (14.3–34.9) | 32.96 ± 2.26 (14.3–35.7) |
Hb | 12.00 ± 2.11 (5.7–19.9) | 11.24 ± 1.63 (8.5–13.8) | 10.58 ± 1.07 (8.5–12.4) | 10.91 ± 1.40 (8.5–13.8) | 13.30 ± 2.59 (5.7–19.9) | 12.00 ± 1.63 (9.4–17.3) | 12.57 ± 2.19 (5.7–19.9) |
RBC | 5.16 ± 0.73 (3.2–7.1) | 5.91 ± 0.77 (3.9–7.1) | 5.52 ± 0.46 (4.6–6.1) | 5.72 ± 0.65 (3.9–7.1) | 5.12 ± 0.60 (3.2–6.3) | 4.69 ± 0.52 (3.5–5.8) | 4.88 ± 0.60 (3.2–6.3) |
RDW | 15.56 ± 2.17 (11.4–24.7) | 17.31 ± 1.24 (14.8–20.0) | 16.45 ± 1.23 (12.5–19.0) | 16.88 ± 1.30 (12.5–20.0) | 14.39 ± 1.74 (11.4–21.0) | 15.24 ± 2.48 (12.7–24.7) | 14.86 ± 2.21 (11.4–24.7) |
Age | 22.29 ± 12.85 (1.0–63.0) | 24.02 ± 15.81 (1.0–63.0) | 19.29 ± 12.46 (2.8–45.0) | 21.65 ± 14.30 (1.0–63.0) | 22.37 ± 13.21 (1.0–48.0) | 22.84 ± 11.22 (1.0–59.0) | 22.63 ± 12.07 (1.0–59.0) |
Key observations from the dataset indicate that BTT-positive individuals exhibit markedly lower MCV values and MCH values compared to BTT-negative participants, who show higher averages for both parameters. Additionally, BTT-positive individuals demonstrate a higher RBC count than the negative group reflecting the compensatory erythropoiesis characteristic of thalassemia carriers. RDW and MCHC values showed additional distinctions between groups. BTT-positive individuals had higher RDW values compared to BTT-negative participants, indicating greater variability in red cell size. Differences in MCHC were more modest, although slightly lower values were observed among BTT-positive cases compared with the negative group.
The age distribution of participants ranged from 1 to 63 years, with similar mean ages across groups. BTT-positive individuals had a mean age of 21.65 ± 14.30 years, closely aligned with the mean age of 22.63 ± 12.07 years observed in the negative group, suggesting no notable age-related sampling bias. Sex distribution was also balanced across the cohort, with male and female participants represented in approximately equal proportions in both BTT-positive and BTT-negative groups. Minor sex-related differences were noted Hb and RBC, with males showing slightly higher values. However, key screening parameters such as MCV and MCH demonstrated minimal variation by sex, indicating that these indices are largely independent of sex within this dataset.
The dataset provides a comprehensive representation of RBC indices for both BTT-positive and BTT-negative individuals, making it suitable for training and validating the dual-modal machine learning pipeline. The distribution of values also highlights the discriminative potential of CBC indices for initial β-thalassemia screening.
Blood smear images were captured using two approaches - Smartphone-assisted microscopy and Standard microscope camera.
Smartphone-assisted microscopy – Three different smartphones were used: Samsung Galaxy M21, Apple iPhone 15, and Xiaomi Redmi Note 13. Three operators independently captured images, with slides randomly assigned in approximately equal proportions to simulate real-world variability in community and laboratory settings. Each phone camera was manually aligned with the microscope eyepiece without customised adapters or camera settings, and the smears were focused under the ×100 oil immersion objective. The microscope lamp served as the sole illumination source, and no exposure, colour, or white-balance calibration was performed prior to acquisition. Images were saved in the native formats of the devices (HEIC or PNG) and subsequently converted to JPEG during preprocessing. This method emulates routine haematological practice, with the smartphone effectively substituting for the human eye.
Standard microscope camera – An AxioCam ERc5s camera attached to a standard light microscope was used to acquire reference-quality images at the same ×100 magnification. Relevant camera metadata, including pixel accuracy (0.0039 mm), image size (up to 5712 × 4284 pixels), light path, and lens information, were retained for reproducibility.
For each smear, 25–30 non-overlapping images were obtained from the morphological “zone of interest,” i.e., the area between the body and tail of the smear where red cells are evenly distributed, non-overlapping, and morphologically intact. The head and tail regions, where cells are either too crowded or distorted, were excluded following standard haematology practice. Smear examination was performed systematically using a Z-pattern scanning approach to ensure comprehensive coverage. Representative blood smear images used in the study is shown in Fig. 1.
Captured images were reviewed, and any blurred or artefact-containing images were excluded. The final dataset comprised 5,198 high-quality images from 152 blood smear samples, each sample labelled as thalassemia carrier or non-carrier according to the corresponding HPLC results.
Haematological Data (Red Blood Cell indices)
A
The FBC reports obtained from the laboratory information system were exported into a structured format and underwent several preprocessing steps. Personal identifiers were removed to ensure anonymisation, with only the report ID retained to match smear images and HPLC results. Reports containing incomplete or missing red cell indices were excluded, and any outliers falling outside physiologically plausible limits were flagged and manually verified against the original records. Numerical variables, including MCV, MCH, MCHC, haemoglobin (Hb), RDW, and RBC count, were normalised using z-score standardization. Patient sex was encoded as a binary variable (0 = female, 1 = male). HPLC-confirmed carrier status served as the ground truth label, with carriers coded as 1 and non-carriers as 0. These steps resulted in a structured tabular dataset of RBC indices aligned with each participant’s corresponding smear images.
The raw smear images underwent several preprocessing steps to ensure consistency and suitability for deep learning analysis. Initial quality control filtering was performed to remove images affected by motion blur, poor focus, or uneven staining, using both visual inspection and automated quality checks. To minimise background artefacts, a rectangular Region of Interest (ROI) corresponding to the largest inscribed area within the circular microscope field was extracted using the Hough Circle Transform in OpenCV, allowing automatic cropping of the maximal area containing red blood cells. All images were then resized to 224 × 224 pixels to meet the input requirements of the convolutional neural network (CNN) and maintain computational efficiency. To address class imbalance and improve model generalisation, data augmentation techniques—including random rotations (± 15°), horizontal and vertical flips, minor translations and zooming, and controlled brightness and contrast variations—were applied (Fig. 2). Following preprocessing, the dataset was divided into training (70%) and independent testing (30%) subsets, with patient-level stratification to prevent data leakage by ensuring that images from the same individual did not appear across different partitions.
Model Architecture and Training
The proposed model architecture was designed with careful consideration of the requirements of a population-level screening tool. In such settings, maximising sensitivity is paramount to ensure that no β-thalassemia carrier is missed, even at the expense of specificity. A false-negative result could delay diagnosis and increase the risk of undetected carrier marriages, whereas false positives can be resolved at the confirmatory testing stage.
In addition to achieving high diagnostic accuracy, the architecture was intentionally designed to meet the practical challenges of real-world healthcare delivery, especially in resource-limited and rural settings where trained haematologists and advanced laboratory infrastructure may be scarce. The system therefore aimed to minimise the workload on clinicians and laboratory staff by reducing the need for unnecessary smear preparation and microscopic examination, while also lowering overall costs and reagent consumption to support large-scale screening initiatives. Furthermore, portability and scalability were prioritised to enable deployment across multiple platforms—from mobile devices to desktop systems—ensuring broad accessibility. Finally, the pipeline was optimised for rapid turnaround, delivering near real-time results suitable for screening during outreach activities and school-based screening programmes.
Overview of the Dual-Modal Framework
A two-step dual-modal pipeline was implemented.
Step 1 of the screening pipeline involves haematological analysis based RBC indices together with patient age and sex. This stage was optimised for high specificity to ensure that individuals flagged as positive could be confidently referred directly for confirmatory HPLC testing without requiring additional intermediate assessments. Only those classified as negative at this stage proceed to the secondary evaluation.
Step 2 of the pipeline focuses on morphological screening using blood smear images. For individuals not identified as positive in the first stage, peripheral blood smears are prepared, digitised, and analysed using a CNN-based model trained to detect subtle morphological features associated with thalassemia. This stage was intentionally designed to maximise sensitivity, ensuring that borderline or atypical cases are captured and the risk of missed carriers is minimised.
By integrating the two modalities in a sequential architecture, the pipeline ensures high diagnostic reliability while maintaining operational efficiency. At the same time, the overall workload for laboratory staff is significantly reduced, as only a small subset of individuals who screen negative initially require smear preparation and further microscopic analysis. This design also optimises time, cost, and reagent consumption, making the workflow well suited for mass-screening programmes where preserving resources without compromising clinical robustness is essential.
The dual-modal strategy is shown in Fig. 3.
Processing of Red Blood Cell Indices and Blood Smear Images
Septs followed in processing the Blood Cell Indices and Blood Smear Images are discussed below.
Processing Red Blood Cell Indices
The first component of the dual-modal model processes Red Blood Cell (RBC) indices, along with patient age and sex.
The preprocessing and model selection workflow included the following steps:
I.Dimensionality Reduction (PCA):
Principal Component Analysis (PCA) was applied to the RBC dataset to identify the most informative combinations of red cell indices. PCA helped reduce redundancy between correlated parameters while retaining the variance most relevant for discriminating carriers from non-carriers. The resulting principal components were analysed to guide feature selection for downstream model training.
To identify the optimal classifier for RBC-based screening, nine machine learning algorithms were trained and compared using different permutations of RBC parameters: Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Tree (DT), Random Forest (RF), Gradient Boosting Machine (GBM), AdaBoost, Support Vector Machine (SVM), Naive Bayes (NB), Multilayer Perceptron (MLP). Each model was evaluated on training and validation sets using cross-validation to ensure robustness, and performance metrics, including accuracy, precision, recall, and area under the ROC curve (AUC), were compared.
III.Feature Selection & Optimisation:
Based on performance, the most informative subset of RBC parameters was selected to maximise specificity in this first step, ensuring that predicted positives could be confidently sent for confirmatory HPLC without unnecessary smear preparation. Hyperparameters for each algorithm were tuned using grid search or Bayesian optimisation to achieve optimal predictive performance.
Processing Blood Smear Images
A
For model development, blood smears were prepared for 152 out of 163 participants, comprising 98 negative and 54 positive cases. All images underwent the preprocessing steps described previously, including ROI extraction using Hough Circle Transform, resolution standardisation (224 × 224 pixels), and data augmentation to increase variability and reduce class imbalance.
The image component employed transfer learning to leverage pretrained convolutional neural networks (CNNs) for feature extraction. Multiple architectures were evaluated: VGG16, ResNet50, DenseNet121, MobileNetV2 and ConvNeXtTiny. Each architecture was fine-tuned and hyperparameters were systematically varied using grid search or random search to identify the optimal configuration for each model.
Models were trained using training and validation subsets, with patient-level stratification to prevent data leakage. Performance metrics, including accuracy, sensitivity, specificity, F1-score, and area under the ROC curve (AUC), were calculated to identify the model with the best sensitivity, ensuring that borderline and atypical positive cases were reliably detected.
The selected image model forms the second-stage screening component.
Model Training and Optimisation
Training and optimization of each model is described below.
The processed RBC dataset was used to train a machine learning model for β-thalassemia carrier classification. The model was implemented in Python 3.9, using scikit-learn v1.2, along with NumPy, Pandas, Matplotlib, and Seaborn for evaluation and visualisation. The MLP architecture comprised hidden layers with ReLU activation, fine-tuned to optimise performance for binary classification. To ensure balanced representation of carriers and non-carriers and to prevent data leakage, stratified 5-fold cross-validation was employed throughout the optimisation process. Performance of the model was evaluated using a comprehensive set of custom metrics, including sensitivity, specificity, accuracy, F1-score, ROC–AUC, and the Matthews correlation coefficient.
The second component of the dual-modal model processes blood smear images and was implemented in Python 3.9 using TensorFlow/Keras 2.x alongside standard scientific libraries such as NumPy, Pandas, and Matplotlib. This component was designed to ensure patient-level stratified training and evaluation, preventing data leakage across folds or between images belonging to the same individual. Pretrained convolutional neural networks were used as feature extractors. Their base layers were initially frozen, and a custom classifier head was added, consisting of a global average pooling layer, a dense layer with 256 ReLU-activated units, a dropout layer with a rate of 0.5, and a final softmax output layer with two units for binary classification. After initial training, the top four layers of each CNN backbone were unfrozen for fine-tuning. Training followed a two-stage strategy: 50 epochs with a frozen backbone using the Adam optimiser (learning rate = 0.001), followed by 50 fine-tuning epochs with a reduced learning rate of 1e-5. Images were processed at 224 × 224 resolution with a batch size of 32, and extensive data augmentation—such as random flips, brightness and contrast adjustments, saturation shifts, and rotations—was applied to improve generalisation.
To maintain rigorous reproducibility, patient-level splitting was enforced using train_test_split to ensure that no images from the same patient appeared in both training and validation sets. A custom PatientFolderDataGenerator, implemented as a subclass of tf.keras.utils.Sequence, handled batch generation, preprocessing, augmentation, and GPU-accelerated loading. Training incorporated callbacks including EarlyStopping (patience = 10, monitoring validation accuracy), ReduceLROnPlateau (factor = 0.5, patience = 5), and ModelCheckpoint for saving the best-performing model.
Implementation and Deployment
The proposed dual-modal pipeline was designed with practical deployment in low-resource and primary care settings in mind. A key feature of the system is that it does not require specialised imaging hardware; smartphone-based microscopy was performed using standard laboratory microscopes without custom adapters, modified optics, or controlled imaging environments. This approach reflects the conditions under which peripheral smear evaluation is routinely carried out in many regional hospitals and rural clinics in Sri Lanka, enabling potential adoption without additional capital cost.
In a real-world workflow, peripheral smear images captured using a smartphone or existing microscope camera could be uploaded to a local workstation or mobile application interface, where automated image analysis and FBC feature processing would be performed. The computational requirements of the final model allow inference to be executed either on-device for newer smartphones or via a low-resource edge or cloud-based server, making the system adaptable to settings with variable technical infrastructure. This supports potential deployment in rural primary care clinics that lack on-site haematologists, providing decision support for early carrier screening and referral.
The intended operational use-case involves assisting healthcare workers by flagging individuals with a high likelihood of β-thalassaemia trait, thereby prioritising confirmatory testing and improving screening throughput. However, the current implementation remains a research prototype, and several constraints must be addressed prior to clinical integration. These include regulatory approval, standardisation of smear preparation and staining quality, operator training for consistent image capture, and integration with electronic laboratory information systems. Connectivity limitations in remote settings may also affect workflows reliant on cloud-based processing.
Future development will focus on a full mobile application interface, streamlined clinician-facing reporting, and incorporation into existing national thalassaemia screening programmes. Prospective field evaluation and usability testing will be required to ensure that the system is safe, scalable, and operationally feasible in real-world clinical environments.