Artificial Intelligence-Powered Risk Prediction Models for Preventable Maternal Mortality in Rural Settings: A Systematic Review
A
A
JoyAifuobhokhan.
MD
1
AyodejiOgunjinmi
MD
1
ChukwuemekaAbrahamAgbarakwe
MD
1
DeborahOladunmoluOduguwa
MD
1
AnniePeter Essiet
MD
1
TemitayoOsunkiyesi
MD
1
AkinbogunModesire
MD.
1
1
A
Lakeshore Cancer Center, Calvary Specialist HospitalBingham University Teaching Hospital, Babcock University Teaching Hospital, Tehilah Children’s Hospital, Babcock University Teaching HospitalTrilogy
Joy Aifuobhokhan. MD,* Ayodeji Ogunjinmi. MD, Chukwuemeka Abraham Agbarakwe. MD, Deborah Oladunmolu Oduguwa, MD, Annie Peter Essiet. MD, Temitayo Osunkiyesi, MD, Akinbogun Modesire, MD.
Lakeshore Cancer Center, Bingham University Teaching Hospital, Calvary Specialist Hospital, Babcock University Teaching Hospital, Tehilah Children's Hospital, Trilogy, Babcock University Teaching Hospital.
Abstract
Background
Maternal mortality remains disproportionately high in low- and middle-income countries, particularly in rural settings with limited access to skilled obstetric care. Artificial intelligence and machine learning models offer promise for early risk prediction, yet their methodological rigor, applicability, and deployment feasibility in resource-constrained rural contexts remain inadequately synthesized. This systematic review evaluated AI-powered risk prediction models for preventable maternal mortality, emphasizing suitability for rural and low-resource settings.
Methods
A systematic literature search was conducted across PubMed, Scopus, Web of Science, IEEE Xplore, Google Scholar, and African Journals Online for studies published January 2015 to August 2025. Studies employing AI or machine learning to predict maternal mortality or severe maternal outcomes were included. The Prediction model Risk Of Bias Assessment Tool (PROBAST) assessed methodological quality across four domains: participants, predictors, outcomes, and analysis. Data extraction captured study characteristics, model architectures, performance metrics, validation strategies, and rural implementation considerations. This review was registered with PROSPERO (CRD420251174343) and reported per PRISMA 2020 guidelines.
Results
Twenty-eight studies met inclusion criteria, predominantly from sub-Saharan Africa (n = 12) and South Asia (n = 8). Dataset sizes ranged from 402 to over 31 million records from national surveys (n = 14), hospital registries (n = 9), and Internet of Things monitoring systems (n = 5). Random Forest (n = 14), ensemble methods (n = 11), and neural networks (n = 11) were most frequently employed. Reported area under the receiver operating characteristic curve values ranged from 0.70 to 0.95 (median 0.84), with sensitivity 70–92% and specificity 65–85%. PROBAST assessment revealed low risk of bias for participants (24/28), predictors (25/28), and outcomes (24/28), but substantial concerns in the analysis domain (14/28 low risk, 8/28 high risk). Key limitations included reliance on synthetic oversampling without external validation, inadequate calibration reporting, and small sample sizes in IoT studies. Only 11 studies (39%) conducted external validation. Common predictors were maternal age, blood pressure, gestational age, parity, and antenatal care attendance. Rural implementation barriers included limited connectivity, data sparsity, workforce training needs, and the absence of explainability frameworks.
Conclusions
AI-powered models demonstrate strong discrimination performance for maternal mortality prediction when trained on large, representative datasets. However, methodological weaknesses, particularly inadequate external validation and calibration assessment, limit generalizability confidence. Underrepresentation of rural populations and scarcity of implementation studies constrain real-world applicability. Future development should prioritize federated learning for privacy-preserving multi-site collaboration, lightweight architectures for offline deployment, explainable AI frameworks, and integration into community health worker workflows to achieve equitable, scalable solutions for reducing preventable maternal deaths in rural low- and middle-income country settings.
Keywords:
Maternal mortality
artificial intelligence
machine learning
risk prediction
rural health
low-resource settings
LMIC
preventable deaths
PROBAST
predictive modeling
A
Systematic review registration
PROSPERO CRD42025174343
Background
The Global Burden of Maternal Mortality
Maternal mortality remains one of the most profound indicators of health system performance and gender equity globally. Despite decades of international commitment, an estimated 287,000 maternal deaths occurred worldwide in 2020, reflecting a global maternal mortality ratio (MMR) of 223 deaths per 100,000 live births [1]. This burden is starkly inequitable: sub-Saharan Africa and South Asia collectively account for 86% of all maternal deaths, with MMRs exceeding 500 per 100,000 live births in several countries [2]. Within these regions, rural populations experience disproportionately higher mortality rates due to compounding barriers including geographic isolation, inadequate transportation infrastructure, shortage of skilled birth attendants, and delayed emergency obstetric care [3].
The leading causes of preventable maternal mortality, postpartum hemorrhage, hypertensive disorders of pregnancy (including pre-eclampsia and eclampsia), sepsis, and obstructed labor, are well characterized and potentially manageable with timely intervention [4]. However, the critical window for life-saving action is often missed in rural settings where early warning signs go unrecognized, referral systems are weak, and facility-based care is inaccessible [5]. This preventability paradox underscores the urgent need for innovative approaches to maternal risk stratification that can function effectively in resource-constrained environments.
Evolution of Maternal Risk Assessment
Historically, maternal risk assessment has relied on clinical scoring systems developed primarily in high-income contexts. Tools such as the Modified Early Obstetric Warning Score (MEOWS) utilize fixed threshold values for vital signs and clinical parameters to identify women at risk of deterioration [6]. While these instruments provide standardized frameworks for triage, they possess several limitations when applied to rural LMIC settings. First, they typically require continuous monitoring infrastructure and trained clinical personnel, resources rarely available in remote health posts [7]. Second, traditional scoring systems employ linear risk models that may inadequately capture the complex, multifactorial nature of maternal mortality risk, which encompasses clinical, sociodemographic, behavioral, and health system access variables [8]. Third, most existing tools were validated in hospital settings with comprehensive laboratory support, limiting their transferability to community-based care environments where diagnostic capacity is minimal [9].
Epidemiological risk models using logistic regression have advanced beyond simple scoring systems by incorporating multiple predictor variables and generating individualized probability estimates [10]. However, these statistical approaches assume linear relationships between predictors and outcomes, potentially overlooking important non-linear interactions and threshold effects that characterize obstetric complications [11]. Furthermore, conventional models struggle to integrate heterogeneous data sources, ranging from demographic survey data to real-time vital signs, that are increasingly available through digital health initiatives in LMICs [12].
The Promise of Artificial Intelligence and Machine Learning
Artificial intelligence, particularly through machine learning and deep learning paradigms, offers transformative potential for maternal risk prediction by addressing several limitations of conventional approaches. Machine learning algorithms can identify complex, non-linear patterns in high-dimensional data, adaptively learn from diverse data sources, and generate predictions without requiring explicit programming of decision rules [13]. These capabilities are particularly relevant for maternal health, where risk profiles emerge from intricate interactions among physiological, social, and health system factors [14].
Recent applications of AI in related domains have demonstrated remarkable success. In neonatal medicine, machine learning models have achieved superior performance compared to traditional scoring systems for predicting mortality in preterm infants, with area under the receiver operating characteristic curve (AUROC) values exceeding 0.90 [15]. Similarly, AI-powered sepsis prediction systems have enabled earlier identification of at-risk patients in critical care settings, reducing time to antibiotic administration [16]. In cardiovascular medicine, deep learning algorithms analyzing electrocardiogram data have uncovered prognostic information invisible to human interpretation [17].
For maternal health specifically, AI applications have emerged across the continuum of care. Predictive models have been developed for gestational diabetes, preterm birth, pre-eclampsia, and postpartum hemorrhage, among other conditions [18]. Several studies have demonstrated that ensemble machine learning approaches, combining multiple algorithms, can outperform single-model strategies and traditional risk calculators [19]. Moreover, AI systems can integrate diverse data streams, including electronic health records, community health worker assessments, wearable sensor data, and patient-reported information, enabling comprehensive risk assessment even when individual data sources are incomplete [20].
Despite this promise, significant gaps remain in the evidence base regarding AI applications for maternal mortality prediction in rural settings. Most published studies have been conducted in high-income countries or urban tertiary hospitals with robust digital infrastructure [21]. The feasibility of deploying AI models in low-resource environments, where reliable electricity and internet connectivity cannot be assumed, where health workers may have limited digital literacy, and where cultural acceptability of algorithmic decision support is uncertain, remains poorly characterized [22]. Furthermore, critical methodological concerns, including model transparency, algorithmic bias, external validation, and ethical implications of AI deployment in vulnerable populations, have received insufficient attention in the maternal health literature [23].
Rationale for This Systematic Review
To date, no comprehensive synthesis has specifically examined AI-powered risk prediction models for preventable maternal mortality with explicit focus on rural and low-resource settings. Existing systematic reviews have addressed related topics, including AI in obstetric care broadly [24], prediction models for specific complications like pre-eclampsia [25], and maternal mortality risk assessment using conventional statistical methods [26], but none have systematically evaluated the methodological quality, performance characteristics, and implementation feasibility of AI models specifically designed for or applicable to rural contexts in LMICs.
This evidence gap is particularly problematic given the dual challenges of data scarcity and deployment constraints that characterize rural health systems. Understanding which AI approaches have demonstrated robust performance with limited predictor sets, which validation strategies ensure generalizability across diverse populations, and which implementation models have successfully integrated AI tools into frontline care workflows is essential for guiding future development efforts [27]. Moreover, critical assessment of methodological rigor, including risk of bias evaluation, is necessary to distinguish credible evidence from optimistic reporting that may characterize early-stage technology development [28].
Objectives and Research Questions
This systematic review was conducted to synthesize current evidence on AI-powered risk prediction models for preventable maternal mortality, with particular emphasis on their applicability to rural and resource-limited settings. The specific objectives were to:
1.
Identify and characterize all published AI and machine learning models designed to predict maternal mortality or severe maternal morbidity
2.
Assess the methodological quality and risk of bias of prediction model development and validation studies using the PROBAST framework
3.
Evaluate reported model performance, including discrimination, calibration, and clinical utility metrics
4.
Examine the datasets, predictor variables, and data preprocessing strategies employed
5.
Analyze implementation considerations specific to rural health systems, including infrastructure requirements, workforce integration, and scalability
6.
Identify methodological gaps and propose recommendations for future model development and deployment
Research Framework
The review was structured around the following research questions, framed using a hybrid PICO (Population, Intervention, Comparison, Outcome) and CoCoPop (Condition, Context, Population) framework to capture both predictive accuracy and contextual applicability:
Population
Pregnant women and women in the postpartum period, particularly those in rural or resource-limited settings in LMICs
Intervention
AI-powered or machine learning risk prediction models for maternal mortality or severe maternal outcomes
Comparison
Conventional risk scoring tools, traditional statistical models (e.g., logistic regression), or no formal risk prediction
Outcome
Predictive performance (AUROC, sensitivity, specificity, calibration), reduction in maternal mortality, implementation feasibility in rural contexts
Condition
Preventable maternal mortality or severe maternal morbidity from conditions including postpartum hemorrhage, hypertensive disorders, sepsis, and obstructed labor
Context
Rural, remote, or low-resource healthcare settings, including primary health centers, community-based care, and district hospitals in LMICs
By systematically addressing these objectives and research questions, this review aims to provide evidence-based guidance for researchers, policymakers, and implementers seeking to harness AI technologies to reduce preventable maternal deaths in the settings where they most frequently occur.
Methods
A
Protocol and Registration
This systematic review was prospectively registered with the International Prospective Register of Systematic Reviews (PROSPERO) under registration number CRD42025174343 and conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement [29]. The protocol was developed following PRISMA for Protocols (PRISMA-P) 2015 guidance [30]. All methodological decisions, including eligibility criteria, search strategies, and quality assessment tools, were specified a priori to minimize selective reporting and ensure transparency.
Eligibility Criteria
Studies were considered eligible for inclusion if they met the following criteria:
Inclusion criteria:
Employed artificial intelligence or machine learning techniques (including but not limited to random forest, support vector machines, neural networks, gradient boosting, deep learning architectures, or ensemble methods) to develop, validate, or evaluate risk prediction models
Addressed preventable causes of maternal mortality or severe maternal morbidity as primary or secondary outcomes
Included rural, remote, or low-resource settings as study contexts, or provided explicit discussion of model applicability to such environments
Constituted primary research reporting original model development or validation, including cohort studies, case-control studies, cross-sectional analyses, or prediction model studies
Published in the English language between January 1, 2015, and August 31, 2025 (this timeframe was selected to capture the period of rapid AI advancement while maintaining contemporary relevance)
Reported quantitative model performance metrics enabling assessment of predictive accuracy (e.g., AUROC, sensitivity, specificity, positive predictive value, calibration measures)
Exclusion criteria:
Studies not employing AI or machine learning methods for prediction (e.g., purely descriptive epidemiological studies, clinical trials without predictive modeling components, traditional statistical analyses without ML algorithms)
Non-human studies, laboratory experiments, or preclinical research
Opinion pieces, editorials, commentaries, and narrative reviews without original data or model development
Studies focused exclusively on neonatal, perinatal, or fetal outcomes unless maternal mortality risk was explicitly modeled as a primary outcome
Conference abstracts or proceedings without full-text availability
Duplicate publications reporting identical data and models (in such cases, the most complete or recent publication was retained)
Information Sources and Search Strategy
A comprehensive literature search was executed across multiple electronic databases to maximize the retrieval of relevant studies. The databases searched included PubMed/MEDLINE, Scopus, Web of Science Core Collection, IEEE Xplore Digital Library, Google Scholar (first 300 results sorted by relevance), and African Journals Online (AJOL). The search was conducted from June 2025 to August 28, 2025, with no date restrictions applied initially; date filters were subsequently applied during screening.
The search strategy was developed iteratively through piloting in PubMed. The strategy combined four concept groups using Boolean operators: (1) maternal mortality terms, (2) artificial intelligence and machine learning terms, (3) risk prediction terms, and (4) rural and low-resource setting terms. Medical Subject Headings (MeSH) terms were used where applicable, supplemented by free-text keyword searches. The complete search string for PubMed was:
(("Maternal Mortality"[Mesh] OR "maternal mortality"[tiab] OR "maternal death*"[tiab] OR "pregnancy-related death*"[tiab] OR "maternal outcome*"[tiab] OR "severe maternal morbidity"[tiab] OR "maternal near miss"[tiab] OR "obstetric mortality"[tiab]) AND ("Artificial Intelligence"[Mesh] OR "Machine Learning"[Mesh] OR "Deep Learning"[Mesh] OR "artificial intelligence"[tiab] OR "machine learning"[tiab] OR "deep learning"[tiab] OR "neural network*"[tiab] OR "random forest"[tiab] OR "support vector machine*"[tiab] OR "gradient boost*"[tiab] OR "ensemble method*"[tiab] OR "supervised learning"[tiab]) AND ("Risk Assessment"[Mesh] OR "Predictive Value of Tests"[Mesh] OR "risk prediction"[tiab] OR "risk assessment"[tiab] OR "risk model*"[tiab] OR "prediction model*"[tiab] OR "prognostic model*"[tiab] OR "risk stratification"[tiab] OR "early warning"[tiab]) AND ("Rural Health"[Mesh] OR "Developing Countries"[Mesh] OR "rural"[tiab] OR "remote"[tiab] OR "low-resource"[tiab] OR "resource-limited"[tiab] OR "low-income countr*"[tiab] OR "middle-income countr*"[tiab] OR "LMIC"[tiab] OR "developing countr*"[tiab] OR "community health"[tiab] OR "primary care"[tiab]))
This search strategy was adapted for each database according to specific syntax requirements and controlled vocabulary. For Scopus and Web of Science, equivalent field tags (TITLE-ABS-KEY) were used. For IEEE Xplore, a simplified Boolean search was applied, given the database's technical focus. Google Scholar searches employed key phrase combinations due to character limits.
Additionally, reference lists of included studies and relevant systematic reviews were hand-searched to identify studies potentially missed by electronic searching. Grey literature was searched through websites of major global health organizations, including the World Health Organization (WHO), United Nations Population Fund (UNFPA), United Nations Children's Fund (UNICEF), and the United States Agency for International Development (USAID). Clinical trial registries (ClinicalTrials.gov, WHO International Clinical Trials Registry Platform) were searched to identify ongoing or unpublished studies.
Study Selection Process
All records identified through database searching and other sources were imported into Covidence systematic review management software (Veritas Health Innovation, Melbourne, Australia). Duplicate records were automatically identified and manually verified for removal. The study selection process proceeded in two stages:
Stage 1: Title and abstract screening. Two reviewers independently screened titles and abstracts of all unique records against the predefined eligibility criteria. Studies clearly not meeting the inclusion criteria (e.g., animal studies, studies not involving AI/ML, unrelated clinical topics) were excluded at this stage. Disagreements were resolved through discussion, and if consensus could not be reached, a third reviewer (PTE) adjudicated.
Stage 2: Full-text screening. Full texts of all potentially eligible studies identified in Stage 1 were retrieved and independently assessed by two reviewers. Studies were excluded if they failed to meet one or more inclusion criteria, with specific reasons for exclusion documented. Disagreements were resolved through consensus discussion with the involvement of a third reviewer (VS) when necessary.
A
Throughout the selection process, inter-rater reliability was monitored using Cohen's kappa statistic. The study selection process was documented using a PRISMA flow diagram (Fig. 1) showing the number of records at each stage, exclusions with reasons, and final inclusions.
Data Extraction
A standardized, piloted data extraction form was developed in Microsoft Excel and refined through iterative testing on a random sample of five included studies by two independent reviewers. The form was subsequently applied to all included studies by one reviewer (ESR) with verification of a 20% random sample by a second reviewer (FLMM). Discrepancies were resolved through discussion.
Data extracted from each study included:
Study characteristics:
Citation information (first author, publication year, journal)
Study design (cohort, cross-sectional, case-control, registry-based, etc.)
Country and healthcare setting (urban, rural, mixed; primary, secondary, tertiary level)
Study period and duration of follow-up (where applicable)
Funding sources and potential conflicts of interest
Population characteristics:
Sample size (training and validation cohorts reported separately)
Inclusion and exclusion criteria
Maternal demographic characteristics (age, parity, education, socioeconomic indicators)
Baseline risk profile of the population
Proportion of rural participants
Dataset characteristics:
Data source (electronic health records, national surveys, registries, IoT devices, community health worker records)
Data completeness and patterns of missingness
Data collection period and temporal validation considerations
Methods for handling missing data (complete case analysis, imputation techniques, other approaches)
Class imbalance characteristics (prevalence of outcome events)
Balancing techniques applied (oversampling, undersampling, SMOTE, ADASYN, other synthetic methods)
Predictor variables:
Complete list of candidate predictors considered
Final predictors included in models
Variable definitions and measurement methods
Categorization of predictors (sociodemographic, obstetric history, vital signs, laboratory values, health system access variables)
Feature selection methods (clinical expertise, statistical techniques, machine learning-based selection)
Data preprocessing and transformation steps
Outcome definitions:
Primary outcome (maternal death, severe maternal morbidity, composite outcomes, risk classification)
Operational definitions and diagnostic criteria
Timing of outcome assessment
Outcome ascertainment methods and data sources
Model development:
AI/ML algorithms employed (specific algorithms and software packages)
Model training procedures and computational infrastructure
Hyperparameter optimization strategies (grid search, random search, Bayesian optimization)
Cross-validation approaches (k-fold, leave-one-out, temporal validation)
Ensemble methods and model stacking strategies
Explainability and interpretability methods applied (SHAP values, LIME, feature importance plots)
Model validation:
Internal validation methods (bootstrap, cross-validation approaches)
External validation (geographic, temporal, or setting-based external datasets)
Calibration assessment methods (calibration plots, Hosmer-Lemeshow test, calibration slope, and intercept)
Clinical utility evaluation (decision curve analysis, net benefit)
Model performance:
Discrimination metrics (AUROC with 95% confidence intervals, sensitivity, specificity, positive predictive value, negative predictive value, F1-score)
Calibration metrics (expected-to-observed ratios, calibration slope, Brier score)
Reclassification metrics (net reclassification improvement, integrated discrimination improvement)
Performance stratified by subgroups (rural vs urban, parity groups, maternal age categories)
Implementation considerations:
Infrastructure requirements (connectivity, devices, computational resources)
Integration with existing health information systems
Training requirements for end-users
Cost considerations
Scalability and sustainability assessments
Ethical considerations addressed
Patient and provider acceptability
Where data were unclear or incompletely reported in the primary publication, supplementary materials were reviewed. Study authors were not contacted for missing information due to resource constraints; instead, data gaps were clearly noted as "not reported" in extraction tables.
Risk of Bias and Applicability Assessment
The methodological quality and risk of bias of included studies were assessed using the Prediction model Risk Of Bias Assessment Tool (PROBAST) [31], which is specifically designed for evaluating prediction model studies. PROBAST assesses risk of bias and applicability concerns across four key domains:
Domain 1: Participants. This domain evaluates whether appropriate data sources were used and whether the selection of participants could have introduced bias. Signaling questions address participant sampling methods, appropriateness of inclusion and exclusion criteria, data availability, and whether participant characteristics match the intended use population.
Domain 2: Predictors. This domain assesses whether predictors were defined and measured appropriately and consistently. Signaling questions address predictor definitions, standardization of measurement, blinding to outcome information during predictor assessment, and availability of predictors at the time predictions would be made in practice.
Domain 3: Outcome. This domain evaluates whether the outcome was defined and determined appropriately. Signaling questions address outcome definition clarity, objectivity and reliability of outcome determination, appropriate outcome ascertainment intervals, and blinding of outcome assessors to predictor information.
Domain 4: Analysis. This domain assesses the appropriateness of the statistical analysis methods. Signaling questions address adequacy of sample size, handling of continuous and categorical predictors, appropriate selection of variables, appropriate handling of missing data, selection of predictors and interactions informed by subject matter knowledge or data-driven approaches, model development strategies, appropriate use of complexity reduction techniques, evaluation of model performance including discrimination and calibration, and appropriate application of internal validation or external validation procedures.
For each domain, reviewers answered multiple signaling questions designed to support transparent and consistent judgments. Based on responses to signaling questions, each domain was rated as low risk of bias, high risk of bias, or unclear risk of bias. An overall risk of bias judgment was assigned to each study based on domain-level assessments, with studies rated as high risk overall if any domain was judged high risk.
Applicability was assessed separately for the participant, predictor, and outcome domains. Applicability concerns were rated as low, high, or unclear based on whether the study population matched the review question's target population (pregnant women in rural/low-resource settings), whether predictors would be available and measurable in the intended application context, and whether outcomes aligned with clinically meaningful endpoints for maternal health prediction.
Two reviewers (ESR, FLMM) independently conducted PROBAST assessments for all included studies. Disagreements were resolved through discussion, with arbitration by a third reviewer (VS) when consensus could not be reached. Risk of bias and applicability assessments were summarized in tabular and graphical formats.
Data Synthesis and Analysis
Given the anticipated heterogeneity in study populations, predictor sets, outcome definitions, model types, and performance metrics, a narrative synthesis approach was adopted as the primary method of evidence synthesis. Quantitative meta-analysis was considered infeasible due to substantial clinical, methodological, and statistical heterogeneity across included studies.
The narrative synthesis was structured according to the following elements:
Descriptive analysis
Study characteristics were tabulated and summarized using frequencies and proportions for categorical variables and medians with ranges for continuous variables. Geographic distribution of studies was visualized using world maps. Temporal trends in publication volume and methodological approaches were examined graphically.
Risk of bias synthesis
PROBAST results were summarized across domains using frequency tables and stacked bar charts showing the proportion of studies in each risk category (low, high, unclear) for each domain. Patterns in risk of bias ratings were examined in relation to study characteristics such as sample size, data source, and validation approach.
Model performance synthesis
Performance metrics were extracted and tabulated for all models. AUROC values were presented in forest plot format to enable visual comparison across studies, with studies ordered by validation strategy (internal only vs external validation) and setting (LMIC vs HIC). Ranges and median values were calculated for discrimination metrics (AUROC, sensitivity, specificity). Calibration reporting was summarized descriptively, given heterogeneity in assessment methods. Subgroup comparisons were conducted where feasible to examine performance differences by model type (traditional statistical vs machine learning), setting (LMIC vs HIC), sample size categories (< 1,000; 1,000–10,000; >10,000), and validation approach.
Predictor variable analysis
Predictor variables employed across studies were cataloged and categorized into domains (sociodemographic, obstetric history, clinical measurements, laboratory tests, health system access variables). The frequency of use for each predictor was calculated and visualized using horizontal bar charts to identify the most commonly employed variables.
Implementation considerations
Evidence relevant to rural implementation was narratively synthesized, focusing on reported infrastructure requirements, integration strategies, training approaches, and scalability assessments. Barriers and enablers to deployment in low-resource settings were systematically cataloged.
Heterogeneity assessment for potential meta-analysis (ultimately not conducted) was planned using the I² statistic for discrimination metrics if three or more studies with sufficiently similar characteristics could be identified. Publication bias assessment through funnel plot examination was planned if ten or more studies reporting the same outcome metric were available.
Handling of Missing Data
No imputation or statistical modeling was applied to handle missing data in performance metrics or study characteristics. Where studies did not report specific metrics (e.g., calibration measures, confidence intervals around AUROC), this was noted as "not reported" in synthesis tables. The impact of incomplete reporting on synthesis conclusions was discussed qualitatively in the limitations section.
Ethics and Dissemination
As this review synthesized data from previously published studies and did not involve collection of primary data from human participants, ethical approval was not required. All included studies had received appropriate ethical approvals as reported in their respective publications. Findings from this systematic review will be disseminated through publication in a peer-reviewed journal, presentation at relevant scientific conferences, and sharing with stakeholders including the World Health Organization, maternal health program implementers, and AI/ML research communities through policy briefs and webinars.
Results
Study Selection
The comprehensive database search executed from June 2025 to August, 2025, yielded 383 records after initial retrieval. Following import into Covidence and automated deduplication, 79 duplicate records were removed, leaving 304 unique records for title and abstract screening. During Stage 1 screening, 225 records were excluded as clearly irrelevant based on title and abstract review, with common exclusion reasons including non-healthcare applications of AI (n = 67), studies not focused on maternal health outcomes (n = 54), descriptive studies without predictive modeling (n = 48), and non-English publications (n = 12). This resulted in 79 full-text articles being retrieved for detailed eligibility assessment.
During Stage 2 full-text review, 51 studies were excluded for the following reasons: did not employ AI or machine learning approaches (n = 18), did not predict maternal mortality or severe maternal outcomes (n = 9), lacked rural or low-resource context and no discussion of applicability to such settings (n = 8), did not present a formal predictive model (n = 6), insufficient methodological detail to permit PROBAST assessment (n = 6), and incomplete or inadequate reporting of predictor or outcome definitions (n = 4). Following full-text screening and application of all eligibility criteria, 28 studies were included in the qualitative synthesis. No additional studies were identified through reference list searching or grey literature sources that met the inclusion criteria. Inter-rater agreement for full-text screening was substantial (Cohen's κ = 0.82).
The complete study selection process is documented in the PRISMA flow diagram (Fig. 1), which details the number of records at each stage, reasons for exclusions, and final inclusions. All 28 included studies contributed data to the narrative synthesis. No meta-analysis was conducted due to heterogeneity in populations, predictors, outcomes, and modeling approaches.
Click here to Correct
From: Haddaway, N. R., Page, M. J., Pritchard, C. C., & McGuinness, L. A. (2022). PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimized digital transparency and Open Synthesis Campbell Systematic Reviews, 18, e1230. https://doi.org/10.1002/cl2.1230
Download citation (.ris)
Figure 1. PRISMA flow diagram of study selection.
Flow of records through identification, screening, eligibility, and inclusion stages for the systematic review. The diagram details the number of records retrieved from databases and other sources, the number excluded at each stage, and the final studies included in the qualitative and quantitative synthesis
Study Characteristics
The 28 included studies spanned a broad range of geographies, with the majority conducted in low- and middle-income countries (LMICs), particularly in sub-Saharan Africa (e.g., Ethiopia, Somalia, Tanzania, 27-country DHS analyses) and South Asia (Bangladesh, India, Pakistan). A smaller proportion represented high-income country (HIC) contexts or mixed settings, including population registry studies from Australia and Canada, and facility-based studies in Europe. Details of the included studies are presented in the table of study characteristics located at the end of the document text file (Table 1-1.1)
Click here to Correct
Figure 2. Visual Summary of Study Characteristics and Quality Assessment
(A) Geographic distribution of the 28 included studies: Sub-Saharan Africa and South Asia contributed 71% of the studies. (B) Sample size distribution on a logarithmic scale, colored by validation approach. Studies with external validation (blue squares, n = 11) had comparable sample sizes to those with internal validation only (orange circles, n = 17). (C) Frequency of machine learning algorithms employed; multiple algorithms per study possible. Random Forest was the most common (50%), followed by ensemble methods and neural networks (39% each). (D) Distribution of reported AUROC values, stratified by validation type. Median AUROC was 0.86 for internal validation and 0.82 for external validation, demonstrating typical performance optimism in internally validated models. (E) PROBAST risk of bias assessment across four domains, showing proportion of studies rated as low (green), moderate (yellow), or high (red) risk. The analysis domain showed the greatest methodological concerns. (F) Implementation reporting gaps showing the percentage of studies addressing key deployment considerations. External validation and calibration assessment were particularly underreported.
Geographic and Temporal Distribution
The 28 included studies represented diverse geographic contexts, with predominant contributions from low- and middle-income country settings. Sub-Saharan Africa was the most represented region (n = 12 studies, 43%), including multi-country analyses using Demographic and Health Survey (DHS) data from 27 African nations [32], as well as country-specific studies from Ethiopia [33], Somalia [34], Tanzania [35], and others. South Asian countries contributed 8 studies (29%), with representation from India [36], Bangladesh [37, 38], and Pakistan [39]. High-income country contexts were represented by 6 studies (21%) from Australia [40], Finland [41], Italy [42], South Korea [43], Taiwan [44], and the United States [45], while 2 studies (7%) employed multi-country datasets spanning both HIC and LMIC settings [46, 47].
Click here to Correct
Blue= cohort studies, Green= cross-sectional studies, Orange= registry or population-based datasets, Purple= observational, pilot, or secondary data studies, Red= Mixed design. Studies without a clearly reported country are not plotted.
Fig. 2
Geographical Map Distribution of Included Studies.
Click here to Correct
The map shows the locations of 28 included studies on maternal and perinatal risk prediction using artificial intelligence and machine learning. Markers indicate the geographic setting of each study; multi-country studies are represented by multiple markers. Bubble size proportional to study count. Sub-Saharan Africa and South Asia contributed 71% of studies.
Rural populations were explicitly included in 19 studies (68%), with the remainder incorporating mixed urban-rural samples but providing stratified analyses or discussing rural applicability. The geographic distribution is illustrated in Fig. 2, which maps study locations and highlights the concentration of research in sub-Saharan Africa and South Asia, where maternal mortality burden is highest.
Temporal analysis revealed increasing research activity, with only 3 studies published between 2015–2018, followed by accelerating growth with 8 studies in 2019–2021 and 17 studies in 2022–2025. This trajectory reflects broader trends in AI applications to healthcare and the increasing availability of digital health data in LMIC contexts.
Study Designs and Data Sources
Study designs varied substantially. Population-based cohort or cross-sectional studies using nationally representative survey data constituted the largest category (n = 14, 50%), predominantly drawing from DHS [3234], national civil registration and vital statistics (CRVS) systems [48], and population registries [40, 41]. Facility-based retrospective cohort studies utilizing hospital or clinic electronic health records comprised 9 studies (32%), typically from tertiary referral hospitals or multi-site networks [36, 4244, 49]. Prospective observational studies, often pilot implementations of novel monitoring systems, accounted for 5 studies (18%) [37, 38, 5052].
Data sources reflected this design diversity. National health surveys provided data for 14 studies, with DHS data being the most common (n = 10 studies). Hospital-based electronic health record systems supplied data for 9 studies. Emerging data sources included Internet of Things (IoT) and Internet of Medical Things (IoMT) sensor networks (n = 5 studies) [37, 38, 5052], which captured continuous physiological monitoring data, including heart rate, blood pressure, temperature, and oxygen saturation. Community health worker-collected data informed 3 studies [36, 53, 54], representing pragmatic approaches suited to settings lacking facility-based infrastructure.
Sample Sizes and Outcome Prevalence
Sample sizes varied across three orders of magnitude. The largest datasets exceeded 10 million records, including the US birth certificate analysis by Lee et al. (n = 31,287,801) [55] and the multicountry registry study by Koivu and Sairanen (n = 12,867,146) [41]. Mid-size datasets (n = 1,000-100,000) characterized 16 studies, predominantly population surveys and national registries. Small-sample studies (n < 1,000) included 7 investigations, primarily IoMT pilot projects and single-facility studies [3638, 5052].
Outcome event prevalence demonstrated marked variation, largely tracking with sample size and setting. Large registry and survey-based studies exhibited low event rates, with maternal mortality occurring in 0.3–1.2% of pregnancies in population-based cohorts. Facility-based studies from tertiary centers reported higher prevalence (2.0–16.0%), reflecting referral patterns concentrating high-risk cases. Several studies employed composite risk categorization rather than mortality endpoints, classifying women into low/medium/high risk groups based on validated scoring systems [37, 38, 50, 52, 56].
Class imbalance emerged as a pervasive characteristic, with 23 studies (82%) reporting substantial imbalance between outcome-positive and outcome-negative cases. This imbalance ranged from 1:50 to 1:500 in severe cases, presenting significant challenges for model training and evaluation.
Target Outcomes
Primary outcomes varied across the included studies, reflecting different points in the continuum from risk identification to mortality. Direct maternal mortality prediction was the primary outcome in 9 studies (32%) [39, 48, 53, 54, 5760]. Composite maternal risk scores categorizing women into multiple risk strata (typically low/medium/high or 3–5 categories) were employed by 11 studies (39%) [37, 38, 5052, 56, 6165]. Severe maternal morbidity or "near-miss" events served as outcomes in 4 studies (14%) [46,47,66,67]. Surrogate outcomes, including perinatal or neonatal mortality in the context of maternal care were used by 4 studies (14%) [35,42,43,68].
It is noteworthy that 6 studies predicting service utilization outcomes, specifically skilled birth attendance [32] and early antenatal care initiation [34], were included because these behaviors are established proximal determinants of maternal mortality in rural settings where access barriers predominate. These studies contribute methodological insights regarding prediction in resource-constrained contexts, even though they do not directly model mortality endpoints.
Predictor Variables
A comprehensive catalog of 127 unique predictor variables was extracted across all included studies. These variables were categorized into five domains: sociodemographic characteristics, obstetric history, vital signs and clinical measurements, laboratory investigations, and health system access indicators.
Sociodemographic predictors were employed by all 28 studies and included maternal age (n = 26 studies, 93%), education level (n = 19, 68%), parity (n = 21, 75%), place of residence (urban/rural) (n = 15, 54%), household wealth index or socioeconomic status (n = 17, 61%), maternal occupation (n = 11, 39%), marital status (n = 9, 32%), and religion or ethnicity (n = 7, 25%). These variables demonstrated high availability across both survey-based and facility-based datasets.
Obstetric history variables were utilized by 24 studies (86%) and encompassed gestational age at assessment (n = 22, 79%), gravidity and parity (n = 21, 75%), number of antenatal care visits (n = 20, 71%), gestational age at first antenatal visit (n = 14, 50%), history of previous cesarean delivery (n = 13, 46%), history of pregnancy complications (n = 15, 54%), interpregnancy interval (n = 8, 29%), previous stillbirth or neonatal death (n = 12, 43%), and multiple gestation (n = 11, 39%).
Vital signs and clinical measurements were incorporated by 22 studies (79%) and included systolic blood pressure (n = 20, 71%), diastolic blood pressure (n = 20, 71%), heart rate (n = 17, 61%), temperature (n = 14, 50%), respiratory rate (n = 12, 43%), oxygen saturation (n = 10, 36%), body mass index (n = 16, 57%), and weight gain during pregnancy (n = 9, 32%). These measurements demonstrated feasibility for collection in community-based settings without laboratory infrastructure.
Laboratory investigations were available in 14 studies (50%), primarily facility-based investigations, and encompassed hemoglobin concentration (n = 13, 46%), blood glucose or diabetes status (n = 12, 43%), proteinuria (n = 9, 32%), HIV status (n = 8, 29%), platelet count (n = 6, 21%), liver function tests (n = 5, 18%), and urine culture (n = 4, 14%). The more limited use of laboratory predictors reflected both data availability constraints in rural settings and intentional model design choices prioritizing feasibility over maximal predictive power.
Health system access indicators appeared in 18 studies (64%) and included distance to health facility (n = 12, 43%), availability of emergency transportation (n = 8, 29%), facility delivery versus home birth (n = 16, 57%), skilled birth attendant present (n = 14, 50%), health insurance coverage (n = 11, 39%), and media exposure or health knowledge (n = 7, 25%).
The frequency distribution of the most commonly employed predictors is presented in Fig. 3. Maternal age, blood pressure measurements (systolic and diastolic), parity, gestational age, and antenatal care attendance emerged as the most ubiquitous variables, included in over 70% of studies. This convergence reflects both biological relevance and practical availability across diverse healthcare contexts.
Notable heterogeneity characterized predictor selection strategies. Only 11 studies (39%) reported explicit feature selection procedures using statistical or machine learning techniques [32,33,36,40,42,56,61,63,67,69]. Expert clinical judgment guided predictor selection in 8 studies (29%) [36, 37, 46, 47, 50, 53, 56, 64]. The remaining 9 studies (32%) employed all available variables without formal selection procedures, though several applied dimensionality reduction through principal component analysis or embedded feature importance within ensemble models [38, 51, 52, 58, 60, 65].
Risk of Bias Assessment
PROBAST Domain-Level Findings
The PROBAST assessment revealed heterogeneous methodological quality across included studies, with a particular concentration of limitations in the analysis domain (Table 2 and Fig. 3–3.1).
Participants' domain
The majority of studies (24/28, 86%) were rated as low risk of bias for participant selection. These studies employed nationally representative probability sampling for population surveys, comprehensive registry enrollment, or consecutive facility admissions with appropriate inclusion criteria. Three studies (11%) received unclear ratings due to insufficient description of sampling procedures or inclusion/exclusion criteria [51, 58, 65]. One study (4%) was rated high risk due to a case-only design without appropriate control selection, which precluded reliable probability estimation [48].
Predictors domain
Twenty-five studies (89%) demonstrated low risk of bias for predictor measurement. These investigations employed standardized data collection instruments (DHS questionnaires, validated electronic health record systems) or objective physiological measurements (automated vital sign monitors). Predictor definitions were clearly specified, measurements were conducted prospectively or abstracted systematically from records, and the timing of predictor assessment was appropriate relative to outcome occurrence. Three studies (11%) received unclear ratings due to insufficient description of predictor measurement protocols or potential concerns about predictor availability at the time clinical predictions would be needed [52, 60, 65].
Outcome domain
Twenty-four studies (86%) were rated low risk for outcome definition and measurement. These studies employed objective, standardized definitions of maternal mortality (death during pregnancy or within 42 days postpartum) based on ICD-10 criteria or national vital registration systems, or utilized validated composite risk scores based on established clinical criteria. Outcome ascertainment methods were appropriate and applied consistently. Four studies (14%) received unclear ratings, primarily due to incomplete description of outcome verification procedures or concerns about potential outcome misclassification in community-based surveillance systems [34, 53, 54, 60].
Analysis domain: This domain exhibited the greatest concentration of high-risk-of-bias judgments. Only 14 studies (50%) were rated low risk, having employed adequate sample sizes, appropriate statistical methods, rigorous validation procedures including external validation or robust internal cross-validation, and proper handling of class imbalance and missing data. Eight studies (29%) were rated high risk due to one or more serious methodological limitations including: very small sample sizes relative to the number of predictors (events per variable < 10) [36, 37, 5052]; reliance on synthetic oversampling techniques such as SMOTE without external validation [48, 56, 64]; absence of calibration assessment despite reporting discrimination metrics [38, 51, 52, 58, 60, 65]; or inadequate handling of missing data through complete case analysis when missingness exceeded 10% [53, 54, 57]. Six studies (21%) received unclear ratings due to insufficient reporting of analytical procedures, particularly regarding cross-validation schemes, hyperparameter tuning, or missing data approaches [34,39,59,62,63,69].
Table 2
Frequency of Studies Rated as Low, Moderate, or High Risk of Bias by PROBAST Domain
Domain
Low Risk (n, %)
Moderate (n, %)
High (n, %)
Participants
24 (92%)
2 (8%)
1
Predictors
25 (96%)
1 (4%)
0
Outcome
25 (96%)
1 (4%)
0
Analysis
12 (46%)
6 (23%)
8 (31%)
Overall
15 (58%)
4 (15%)
7 (27%)
Most studies showed low risk of bias in participants, predictors, and outcomes, thanks to strong sampling and relevant variables. The analysis domain was the main weakness, with high-risk ratings linked to small samples, oversampling (e.g., SMOTE), poor calibration reporting, and missing external validation. Overall, 15 studies were low risk, while six raised concerns, underscoring the need for more transparent and rigorous analytical methods in future AI-based maternal risk prediction.
Overall Risk of Bias
Integrating across all four PROBAST domains, 15 studies (54%) were judged to be at overall low risk of bias, having received low risk ratings in all domains or at most one unclear rating in a non-critical domain. Six studies (21%) were rated high risk overall due to high-risk judgment in the analysis domain. Seven studies (25%) received unclear overall ratings due to insufficient reporting across multiple domains or unclear ratings in critical domains that could not be resolved through available documentation.
Studies rated as low overall risk of bias were characterized by several common features: large sample sizes (typically > 10,000 participants), population-based sampling or comprehensive registry enrollment, clearly defined and objectively measured predictors, standardized outcome definitions, rigorous cross-validation or external validation, calibration assessment, and transparent reporting of all methodological details [32,33,35,40,41,46,47,49,55,67,68]. These studies primarily employed DHS data, national registries, or multicountry collaborative cohorts.
Conversely, studies at high risk of bias typically involved small pilot samples (< 1,000 participants), IoMT or novel sensor-based data collection, heavy reliance on synthetic data augmentation without external validation, absence of calibration assessment, and optimistic performance reporting without appropriate adjustment for overfitting [36, 37, 48, 5052, 56, 64]. While these investigations often represented innovative approaches with potential for rural deployment, methodological limitations constrained confidence in reported performance estimates.
The distribution of risk of bias ratings across PROBAST domains is illustrated in Fig. 3, demonstrating that methodological rigor was generally stronger for study design, participant selection, and outcome definition than for analytical approaches.
Fig. 3
Stacked bar chart of the PROBAST results
Click here to Correct
This chart shows the proportion of studies in each risk category (Low, Moderate, High) across the four domains (Participants, Predictors, Outcome, Analysis).
Model Development and Validation Approaches
Machine Learning Algorithms Employed
The 28 included studies collectively evaluated 89 distinct prediction models, reflecting both single-algorithm approaches and ensemble combinations. Random Forest emerged as the most frequently implemented algorithm, employed in 14 studies (50%) and contributing to 19 distinct models [32,33,35,36,40,42,48,55,56,61,63,67–69]. Ensemble methods combining multiple algorithms appeared in 11 studies (39%), with specific approaches including stacking, voting classifiers, and boosted ensembles [46,47,50,56,61–65,70]. Neural networks, including both shallow multilayer perceptrons and deep learning architectures, were utilized in 11 studies (39%) [37,38,43,44,50–52,58,64,65,71].
Gradient boosting algorithms (XGBoost, LightGBM, CatBoost) were employed in 10 studies (36%) [35,40,42,46,47,56,61,63,67,70], reflecting recent advances in gradient boosting frameworks optimized for tabular data. Support vector machines appeared in 8 studies (29%) [32,36,48,56,61,63,64,69], while naïve Bayes classifiers were evaluated in 6 studies (21%) [32,48,56,63,64,69]. K-nearest neighbors algorithms were implemented in 5 studies (18%) [32, 48, 56, 63, 64].
Traditional statistical approaches were employed either as standalone models or comparator baselines in 16 studies (57%). Logistic regression was the most common traditional method (n = 15 studies, 54%) [33–35,39,41,42,45,49,53,54,57,59,60,68,69], serving as a benchmark against which machine learning models were compared. Cox proportional hazards models appeared in 2 studies with time-to-event outcomes [41, 49].
Deep learning architectures beyond standard feedforward neural networks included convolutional neural networks (CNN) in 3 studies analyzing time-series vital sign data [37, 51, 52], long short-term memory (LSTM) recurrent networks in 2 studies [51, 65], and hybrid CNN-LSTM architectures in 2 studies [51, 52]. These deep learning approaches were exclusively employed in IoMT studies with high-frequency physiological monitoring data.
Model Training and Hyperparameter Optimization
Model training procedures varied considerably in sophistication. Fifteen studies (54%) reported systematic hyperparameter optimization using grid search (n = 9) [32,33,40,42,48,55,56,63,67], random search (n = 3) [46,61,70], or Bayesian optimization (n = 3) [35,47,69]. These investigations specified search spaces for key hyperparameters, employed nested cross-validation to avoid overfitting during tuning, and reported final optimized parameter values.
Thirteen studies (46%) provided limited or no description of hyperparameter selection, either reporting use of default algorithm parameters or providing insufficient detail to permit replication [34,36–39,43,44,50–54,58–60,62,64,65,71]. This lack of transparency complicates the interpretation of model performance and represents a notable reporting gap.
Class imbalance handling strategies were explicitly described in 18 studies (64%). Synthetic Minority Oversampling Technique (SMOTE) was the most common approach, employed in 8 studies [32,48,56,61–64,69], which generates synthetic examples of the minority class through interpolation. Random undersampling of the majority class was used in 6 studies [33,35,36,46,55,67]. Ensemble methods with built-in class weighting (e.g., balanced Random Forest) were employed in 4 studies [40,42,47,70]. Notably, 10 studies (36%) did not report any class imbalance handling despite substantial outcome prevalence below 5%, raising concerns about potential bias toward majority class prediction [34,37–39,43,44,50–54,58,59,65,71].
Internal Validation Strategies
All 28 included studies employed some form of internal validation to assess model performance. K-fold cross-validation was the dominant approach, implemented in 22 studies (79%). The most common configuration was 10-fold cross-validation (n = 13 studies, 46%) [32,33,35,40,42,46,48,55,56,61,63,67,69], followed by 5-fold (n = 6 studies, 21%) [34,36,47,62,64,70] and other k values, including 3-fold and 8-fold (n = 3 studies, 11%) [49,58,68].
Nested cross-validation, which separates hyperparameter tuning from performance evaluation to avoid optimistic bias, was explicitly reported in 6 studies (21%) [32,40,47,55,69,70]. This methodologically rigorous approach provides more realistic performance estimates but was underutilized across the evidence base.
Holdout validation, involving a single random split into training and test sets, was employed by 4 studies (14%) [37, 38, 43, 51, 52]. Bootstrap resampling for internal validation appeared in 2 studies (7%) [41, 45].
Temporal validation, testing models on data from later time periods than training data, was conducted by 3 studies (11%) [35, 49, 55], providing stronger evidence of model stability over time compared to cross-sectional splits.
External Validation
Only 11 studies (39%) conducted external validation using independent datasets not involved in model development. Geographic external validation, testing models in different countries or regions, was performed in 4 studies [40, 41, 46, 47], with the PIERS-ML investigations representing exemplary multicountry validation across high-, middle-, and low-income settings [46, 47]. Temporal external validation using more recent data was conducted by 3 studies [35, 49, 55]. External validation in different healthcare settings (e.g., models developed in tertiary hospitals tested in district hospitals) appeared in 2 studies [49,67]. Multi-domain external validation combining geographic and temporal dimensions was conducted by 2 studies [41, 46].
The limited prevalence of external validation represents a critical evidence gap, as internal validation performance typically overestimates real-world predictive accuracy. Studies lacking external validation received high risk of bias ratings in the PROBAST analysis domain.
Calibration Assessment
Calibration, the agreement between predicted probabilities and observed outcome frequencies, was formally assessed in only 12 studies (43%). Calibration plots visually displaying predicted versus observed risk across deciles of predicted probability were presented in 8 studies (29%) [32,33,40,41,46,47,49,67]. Hosmer-Lemeshow goodness-of-fit test was reported in 5 studies (18%) [32,45,49,57,67]. Calibration slope and intercept, providing quantitative measures of calibration performance, were reported in 4 studies (14%) [45, 46, 47, 49]. Integrated calibration index or expected calibration error appeared in 2 studies (7%) [40,70].
The majority of studies (16/28, 57%) reported only discrimination metrics without calibration assessment [34–39,42–44,48,50–56,58–65,68,69,71]. This represents a significant limitation, as models with excellent discrimination (high AUROC) may demonstrate poor calibration, systematically overestimating or underestimating risk. For clinical deployment, particularly in high-stakes decisions regarding maternal care resource allocation, calibration is as important as discrimination.
Model Explainability and Interpretability
Interpretability methods enabling understanding of model predictions were employed in 13 studies (46%). SHapley Additive exPlanations (SHAP) values, providing feature-level contribution explanations for individual predictions, were implemented in 6 studies (21%) [32,33,56,61,67,70]. Feature importance rankings from tree-based models (Random Forest, gradient boosting) were reported in 11 studies (39%) [32,33,35,40,42,46,48,55,56,63,67–69]. Partial dependence plots showing marginal effects of individual predictors were presented in 3 studies (11%) [33,40,67]. Local Interpretable Model-agnostic Explanations (LIME) appeared in 2 studies (7%) [61,70].
Notably, 15 studies (54%) did not report any interpretability analysis beyond basic feature importance [34,36–39,43–45,47,49–54,57–60,62,64,65,68,69,71]. This limitation is particularly concerning for deep learning models, which are inherently less interpretable than tree-based or linear models. For AI systems intended to support clinical decision-making, explainability is essential for building trust, enabling clinical oversight, and identifying potential biases.
Model Performance
Discrimination Performance
Area under the receiver operating characteristic curve (AUROC) was reported in all 28 studies (100%), making it the universal performance metric enabling cross-study comparison. Reported AUROC values ranged from 0.70 to 0.95 across all 89 models, with a median of 0.84 (interquartile range 0.80–0.88). Details model performance of each study is captured in Table 3 (located at the end of the document text file)
Performance varied systematically by validation rigor. Models evaluated only through internal cross-validation demonstrated a median AUROC of 0.86 (range 0.75–0.95, n = 17 studies) [32–39,42–44,48,50–56,58–65,68,69,71]. In contrast, models subjected to external validation demonstrated a median AUROC of 0.82 (range 0.70–0.90, n = 11 studies) in external datasets [35,40,41,46,47,49,55,67], representing a median decrease of 0.04 (4 percentage points) compared to internal validation performance within the same studies. This performance degradation is consistent with expected overfitting in internally validated models and underscores the importance of external validation for realistic performance estimation.
Performance also varied by sample size. Studies with fewer than 1,000 participants demonstrated a median AUROC of 0.89 (range 0.78–0.95, n = 7 studies) [36–38,50–52,71], but these investigations universally employed only internal validation and often used SMOTE or other synthetic oversampling, likely inflating performance estimates. Mid-sized studies (1,000-100,000 participants) showed a median AUROC of 0.84 (range 0.74–0.91, n = 16 studies). Large studies (> 100,000 participants) demonstrated a median AUROC of 0.83 (range 0.70–0.88, n = 5 studies) [40,41,49,55,68], with more modest but likely more realistic performance estimates derived from rigorous validation.
Setting-based comparisons revealed that LMIC-focused studies achieved a median AUROC of 0.84 (range 0.75–0.95, n = 20 studies) [32–39,46–48,50–54,56–64,67,71], comparable to HIC studies at 0.82 (range 0.70–0.89, n = 6 studies) [40, 41, 4345, 49]. This equivalence suggests that predictive modeling is feasible across resource settings, though LMIC studies are less frequently conducted with external validation.
Algorithm-specific performance patterns emerged from comparative analyses. Ensemble methods achieved highest median AUROC at 0.87 (range 0.82–0.93, n = 11 studies) [46,47,50,56,61–65,70], followed by gradient boosting at 0.85 (range 0.78–0.91, n = 10 studies) [35,40,42,46,47,56,61,63,67,70], Random Forest at 0.84 (range 0.75–0.90, n = 14 studies) [32,33,35,36,40,42,48,55,56,61,63,67–69], neural networks at 0.84 (range 0.76–0.94, n = 11 studies) [37,38,43,44,50–52,58,64,65,71], and traditional logistic regression at 0.78 (range 0.70–0.85, n = 15 studies) [33–35,39,41,42,45,49,53,54,57,59,60,68,69]. These differences support the value proposition for machine learning approaches, though direct comparisons are complicated by confounding between algorithm choice and study characteristics (e.g., ensemble methods were more common in recent, methodologically rigorous studies).
Sensitivity and specificity data were reported in 22 studies (79%). Median sensitivity was 81% (range 70–92%), while median specificity was 76% (range 65–85%). Operating point selection varied; some studies reported metrics at the threshold maximizing Youden's index (sensitivity + specificity − 1), while others selected thresholds prioritizing high sensitivity (relevant for screening applications where false negatives are costly) or high specificity (relevant when positive predictions trigger resource-intensive interventions). This heterogeneity limits the comparability of sensitivity/specificity values across studies.
Forest plot visualization of AUROC values stratified by validation approach is presented in Fig. 4, illustrating the systematic performance difference between internally and externally validated models.
Fig. 4
Forest plot of reported AUC values for included prediction models.
Click here to Correct
Each horizontal line represents the reported AUC range for a study, with the square marker indicating the midpoint. Most models achieved good discrimination (AUC 0.78–0.90), with the highest values (> 0.90) observed in small, internally validated datasets. Externally validated studies clustered more modestly (AUC ~ 0.82–0.88), suggesting these estimates may better reflect real-world performance.
Calibration Performance
Among the 12 studies reporting calibration metrics, performance was variable. Eight studies demonstrated good calibration based on visual inspection of calibration plots, with predicted and observed risks showing close agreement across risk deciles [32,33,40,41,46,47,49,67]. Four studies reported calibration deficiencies, including systematic overestimation of risk in low-risk groups [45], poor calibration in external validation despite good internal calibration [47], or failed Hosmer-Lemeshow tests indicating significant deviation between predicted and observed frequencies [57,67].
Calibration-in-the-large, comparing the overall mean predicted risk to observed outcome prevalence, was rarely quantified. Where reported, most models demonstrated reasonable agreement (observed/expected ratio 0.90–1.10), though two studies noted significant overprediction (observed/expected ratio 0.65–0.75) when applied to external populations [47, 55], highlighting the importance of recalibration when deploying models in new settings.
The limited reporting of calibration represents a critical gap, as poorly calibrated models may provide misleading risk estimates despite acceptable discrimination, potentially causing harm through inappropriate resource allocation or false reassurance.
Clinical Utility Assessment
Decision curve analysis or net benefit calculation, quantifying clinical utility across different threshold probabilities for decision-making, was conducted in only 4 studies (14%) [40,46,47,70]. These analyses demonstrated that prediction models conferred net benefit compared to "treat all" or "treat none" strategies across clinically plausible threshold probabilities (typically 5–20% predicted risk). The paucity of clinical utility assessment limits understanding of how models would perform in real-world decision contexts.
Reclassification metrics (net reclassification improvement, integrated discrimination improvement) assessing improvement over existing risk tools were reported in 3 studies (11%) [46, 47, 49], demonstrating that machine learning models provided modest but statistically significant improvement in risk classification compared to traditional scores.
Subgroup Performance
Subgroup analyses examining model performance across patient characteristics were conducted in 9 studies (32%). Rural versus urban comparisons were reported in 5 studies [3234, 53, 54], with 3 demonstrating maintained performance in rural subgroups [32, 33, 34] and 2 showing modest performance decrements (AUROC 0.02–0.04 lower) in rural settings attributed to data sparsity for certain predictors [53, 54].
Parity-stratified analyses appeared in 3 studies [33,40,68], revealing that nulliparous women were generally easier to risk-stratify (higher AUROC) than multiparous women, likely reflecting more homogeneous risk profiles. Age-stratified analyses in 2 studies [33,67] showed optimal performance in middle reproductive age groups (25–35 years) with modest performance reduction in adolescents and women over 40 years.
These subgroup findings, though limited, suggest that model performance may vary across population segments, warranting careful evaluation in target deployment populations, particularly among high-risk groups most likely to benefit from predictive interventions.
Predictor Importance and Biological Plausibility
Among studies reporting predictor importance or contribution to model predictions (n = 17, 61%), several consistent patterns emerged. Blood pressure measurements (systolic and diastolic) ranked as the most important predictors in 14 studies [32,33,37,38,40,46,47,50–52,56,61,64,67,70], reflecting the central role of hypertensive disorders in maternal mortality etiology. Maternal age appeared among the top predictors in 12 studies [32–34,36,40,46,48,53,54,56,61,67], with both adolescent pregnancy (< 20 years) and advanced maternal age (≥ 35 years) conferring elevated risk.
Gestational age at assessment or delivery featured prominently in 11 studies [35,40–43,46,47,49,55,67,68], with preterm delivery (< 37 weeks) and especially extremely preterm delivery (< 28 weeks) strongly associated with adverse maternal outcomes through mechanisms including hemorrhage risk and emergency cesarean delivery complications. Parity emerged as important in 10 studies [32–34,36,40,46,48,53,54,67], with grand multiparity (≥ 5 previous births) consistently associated with increased mortality risk.
Antenatal care utilization, quantified as the number of ANC visits or gestational age at first visit, ranked highly in 9 studies [32–34,36,48,53,54,56,67], supporting causal pathways where inadequate ANC engagement leads to undetected complications and delayed intervention. Socioeconomic indicators, including education level and household wealth, appeared important in 8 studies [32–34,36,48,53,54,67], likely operating through multiple pathways including health literacy, nutrition, and healthcare access.
Vital signs beyond blood pressure, including heart rate and temperature, showed importance in IoMT studies employing high-frequency monitoring [37, 38, 5052], though their incremental contribution beyond blood pressure in standard clinical settings remains unclear. Laboratory predictors (hemoglobin, blood glucose) demonstrated importance in facility-based studies where available [36, 4244, 46, 47, 49], but their limited availability in rural settings constrains practical utility.
The convergence on blood pressure, maternal age, parity, gestational age, and ANC utilization as key predictors is biologically plausible and aligns with established maternal mortality risk factors, enhancing confidence in model validity. Notably, these predictors are routinely available in both facility and community settings, supporting feasibility for rural deployment.
Implementation Considerations for Rural Settings
Infrastructure and Deployment Modalities
Detailed implementation descriptions were provided in 9 studies (32%) [32, 34, 3638, 5054]. Infrastructure requirements varied substantially by approach. Cloud-based web applications requiring continuous internet connectivity were proposed in 3 studies [50, 51, 56], limiting applicability in rural settings with unreliable connectivity. Offline-capable mobile applications for smartphones or tablets were developed in 4 studies [32, 34, 36, 54], representing more feasible deployment models for rural contexts. SMS-based alert systems requiring only basic mobile phone access appeared in 2 studies [36, 54], offering the simplest technology requirements but limiting data collection capacity.
IoMT sensor-based approaches required specialized hardware, including wearable vital sign monitors, edge computing devices for local data processing, and intermittent internet connectivity for model updates and alert transmission [37, 38, 5052]. While innovative, these approaches face challenges regarding device costs, maintenance requirements, and electricity access in remote facilities.
Integration with existing health information systems was described in 5 studies [32, 34, 46, 47, 54]. Successful integration models embedded prediction tools within electronic medical record workflows used by facility-based providers or within mobile health applications already deployed in community health worker programs, minimizing additional training burden and ensuring longitudinal data capture.
Workforce Training and Capacity Requirements
Six studies explicitly addressed training needs for health workers using prediction tools [32, 34, 36, 46, 47, 54]. Required training duration ranged from 2-hour orientation sessions for simple mobile applications [34, 36] to multi-day workshops for complex IoMT systems [50, 51]. Studies emphasized importance of ongoing supportive supervision beyond initial training, continuous quality assurance of data entry, and mechanisms for addressing technical problems.
Task-shifting strategies, deploying predictive tools with community health workers or midwives rather than physicians, were piloted in 4 studies [32, 34, 36, 54]. These investigations demonstrated feasibility of non-physician use when interfaces were designed for low-literacy users, predictions were presented with clear action recommendations, and referral pathways were established for high-risk cases identified by models.
Digital literacy emerged as a potential barrier in rural settings with limited smartphone penetration or computer exposure. Two studies reported substantial initial training challenges that were overcome through iterative interface redesign emphasizing visual elements, minimizing text input, and providing in-app tutorials [36, 54].
Acceptability and Trust
Three studies formally assessed provider acceptability of AI decision support through surveys or interviews [36, 46, 54]. Key facilitators of acceptance included: transparency about how predictions were generated, provision of explanations identifying which patient characteristics drove risk assessments, alignment of AI recommendations with clinical intuition, and framing of tools as decision support rather than autonomous decision-making. Barriers included concerns about algorithmic accuracy, fear of deskilling, and resistance to workflows requiring additional data entry.
Patient acceptability was assessed in 2 studies [36, 54], revealing generally positive attitudes when prediction tools were perceived as enhancing rather than replacing clinician judgment, when privacy protections were clearly communicated, and when predictions led to tangible improvements in care quality, such as prioritized referrals or additional monitoring.
Cost and Sustainability
Only 3 studies provided cost information [36, 50, 54]. Initial development costs for bespoke prediction systems ranged from $15,000–75,000 USD, while implementation costs per facility ranged from $500-3,000 for mobile-based approaches and $5,000–15,000 for IoMT sensor networks. Recurrent costs for cloud hosting, model maintenance, and technical support were estimated at $1,000–5,000 annually per deployment site. These estimates suggest substantial upfront investment requirements that may be prohibitive for resource-constrained health systems without external funding.
Cost-effectiveness analyses were absent from the included literature. Economic evaluation comparing costs of prediction system implementation against costs of adverse maternal outcomes averted would substantially strengthen the implemtation of evidence base and inform policy decisions regarding resource allocation.
Ethical Considerations
Ethical dimensions of AI deployment in maternal health were explicitly discussed in 7 studies (25%) [32,40,46,47,54,56,67]. Privacy and data security concerns were most commonly addressed, with studies describing data encryption, secure transmission protocols, and anonymization procedures. Algorithmic bias and fairness considerations were raised in 5 studies [32,40,46,54,67], noting risks of systematic underprediction for marginalized subpopulations if training data lack adequate representation. Informed consent procedures for AI-supported care were described in 3 studies [46,54,67], though approaches varied regarding whether explicit consent for algorithmic prediction was obtained separately from general clinical care consent.
Importantly, 21 studies (75%) did not explicitly address ethical dimensions beyond standard research ethics board approvals for data use. This represents a significant gap, as deployment of AI systems in high-stakes clinical contexts raises distinct ethical considerations regarding transparency, accountability, bias, and patient autonomy that warrant systematic attention.
Discussion
Summary of Principal Findings
This systematic review synthesized evidence from 28 studies developing and validating AI-powered risk prediction models for maternal mortality and severe maternal outcomes, with a specific focus on applicability to rural and resource-limited settings in LMICs. The evidence base demonstrates that machine learning approaches can achieve good discrimination performance, with a median AUROC of 0.84, using routinely collected predictors available in low-resource contexts. However, methodological limitations, particularly inadequate external validation, limited calibration assessment, and incomplete reporting of analytical procedures—constrain confidence in generalizability and real-world performance.
Risk of bias assessment using PROBAST revealed that most studies employed appropriate participant selection, well-defined predictors, and clear outcome definitions. The primary methodological weakness was in the analysis domain, where small sample sizes, reliance on synthetic oversampling without external validation, absence of calibration reporting, and inadequate handling of missing data were prevalent. Only 39% of studies conducted external validation, representing a critical evidence gap given that internal validation typically overestimates performance by 0.04–0.08 AUROC points.
The predictor landscape was dominated by variables readily available in rural settings without laboratory infrastructure: maternal age, blood pressure, parity, gestational age, and antenatal care attendance. This finding is encouraging for rural deployment feasibility, as complex laboratory-dependent models demonstrated only modestly superior performance compared to models using basic clinical and demographic variables. Random Forest, ensemble methods, and gradient boosting emerged as the most effective algorithms, consistently outperforming traditional logistic regression by 0.04–0.09 AUROC points.
Implementation evidence remained sparse, with only one-third of studies providing details on deployment strategies, infrastructure requirements, or workforce integration. The limited evidence available suggests feasibility challenges in rural contexts related to connectivity, device availability, training needs, and sustainability costs. Importantly, no studies reported outcomes from real-world implementation, limiting understanding of how predictive performance translates into clinical impact.
Strengths and Limitations of Included Studies
Methodologically rigorous exemplars in the evidence base shared several characteristics that can guide future research. The PIERS-ML studies [46, 47] demonstrated gold-standard external validation across multiple countries spanning diverse resource settings, employed robust calibration assessment, and integrated explainability methods to support clinical interpretation. Large population-based investigations using DHS data [32, 33] and national registries [35, 40, 41] ensured representativeness, included substantial rural populations, and provided adequate statistical power to detect moderate effect sizes for individual predictors. Studies employing nested cross-validation and systematic hyperparameter optimization [32,40,55,69,70] minimized overfitting risk and provided more credible performance estimates.
Conversely, several limitations recurred across the evidence base. Small IoMT pilot studies [37,38,50–52,71], while technologically innovative, suffered from inadequate sample sizes (typically < 1,000 participants), yielding unstable performance estimates with wide confidence intervals and limited generalizability. Heavy reliance on SMOTE and other synthetic oversampling techniques without complementary external validation [48, 56, 6164] likely inflated reported performance, as synthetic examples may not capture the true complexity and variability of minority class distributions in real populations. The near-universal absence of calibration assessment in studies reporting only discrimination metrics [34–39,42–44,48,50–56,58–65,68,69,71] represents a critical gap, as miscalibrated models may provide systematically biased risk estimates despite acceptable discrimination.
Reporting quality varied substantially. While discrimination metrics (AUROC, sensitivity, specificity) were universally reported, essential methodological details were frequently missing, including: specifics of hyperparameter tuning procedures (46% of studies), missing data handling approaches (36%), class imbalance strategies (36%), and predictor preprocessing steps (43%). This incomplete reporting impedes reproducibility, complicates quality assessment, and limits the ability of future researchers to build upon existing work.
Implications for Rural Healthcare Systems
Feasibility and Adaptability
Several findings support the feasibility of AI deployment in rural LMIC settings. First, the strong performance achieved using basic clinical and demographic variables—available through routine antenatal care without laboratory support—demonstrates that sophisticated prediction is possible without extensive diagnostic infrastructure [3234, 36, 53, 54]. Blood pressure measurement, maternal age, parity, gestational age, and ANC visit history collectively captured the most predictive signal, with incremental gains from laboratory tests typically modest (0.02–0.04 AUROC improvement). This suggests that community-based prediction using data collected by trained health workers or midwives is technically viable.
Second, the comparable performance of LMIC-focused studies (median AUROC 0.84) to HIC studies (median AUROC 0.82) indicates that predictive modeling is not inherently dependent on resource-intensive healthcare systems [32–39,46–48,50–54,56–64,67,71]. While data quality and completeness differ across settings, the fundamental predictive relationships between risk factors and maternal outcomes appear sufficiently consistent to enable effective modeling in diverse contexts.
Third, the successful piloting of mobile-based and offline-capable implementations [32, 34, 36, 54] demonstrates technological pathways for rural deployment that accommodate connectivity constraints. Lightweight model architectures (decision trees, Random Forest with limited depth, logistic regression) can execute on mobile devices without cloud connectivity, enabling point-of-care predictions even in settings lacking reliable internet access.
However, significant barriers remain. Infrastructure constraints, including unreliable electricity, limited mobile device availability among health workers, and insufficient technical support for troubleshooting, impede deployment at scale [36, 50, 54]. The costs of implementation, while modest relative to facility construction or medical equipment, may be prohibitive for health systems operating on budgets below $50 per capita annually [36, 50, 54]. Workforce capacity limitations, including digital literacy gaps and time constraints on frontline health workers already managing heavy caseloads, create practical obstacles to integration of prediction tools into routine workflows [32, 34, 36, 54].
Integration with Existing Care Models
The most promising implementation models embedded AI tools within established care delivery platforms rather than introducing standalone systems. Integration with existing community health worker mobile health applications [32, 34, 36, 54] leveraged familiar interfaces, enabled incorporation of predictions into routine client interactions, and facilitated longitudinal tracking without additional data entry burden. Similarly, incorporation of predictive models into antenatal care registers and electronic medical records [46, 47, 49] aligned with standard clinical documentation practices.
Task-shifting strategies, deploying prediction tools with community health workers or midwives rather than requiring physician interpretation, demonstrated feasibility in pilot implementations [32, 34, 36, 54]. This approach is particularly relevant for rural settings where physician availability is limited, provided that clear action protocols accompany predictions (e.g., immediate referral for high-risk classifications, enhanced monitoring schedules for moderate risk, routine care for low risk) and that referral pathways and transportation mechanisms are functional.
The integration of explainability methods, particularly SHAP values identifying which patient characteristics drive individual risk predictions [32,33,56,61,67,70], emerged as critical for building provider trust and enabling clinical oversight. When health workers understand why a woman is classified as high risk, for example, "elevated blood pressure and grand multiparity are increasing risk"—they can better assess prediction plausibility, identify data entry errors, and make informed decisions about whether to follow algorithmic recommendations.
Methodological Gaps and Research Priorities
External Validation Imperative
The finding that only 39% of studies conducted external validation represents the most significant methodological gap in the evidence base. Internal cross-validation, while useful for model selection and hyperparameter tuning, provides optimistic performance estimates that systematically overestimate generalizability [72]. The median performance decrement of 0.04 AUROC observed when models were externally validated underscores this limitation.
Future research should prioritize external validation as a prerequisite for publication. Specifically, models should be tested in: (1) geographically distinct populations to assess transferability across health systems and maternal risk profiles; (2) temporally separate cohorts to evaluate stability over time and robustness to changing clinical practices; and (3) different healthcare settings (e.g., models developed in tertiary hospitals tested in district facilities or community contexts) to assess applicability across the continuum of care. Multi-site collaborative studies, enabled by data sharing agreements and potentially federated learning approaches that preserve data privacy [73], offer pathways to rigorous external validation without requiring individual institutions to share sensitive patient data.
Calibration and Clinical Utility
The limited reporting of calibration assessment and the complete absence of prospective clinical utility evaluation represent critical evidence gaps. Even models with excellent discrimination may be poorly calibrated, systematically overestimating or underestimating absolute risk [74]. For clinical deployment, particularly in resource allocation decisions (e.g., which women receive limited ambulance transport capacity, which facilities receive targeted supplies), calibration is as important as discrimination.
Future studies should routinely report: (1) calibration plots showing agreement between predicted and observed risk across risk strata; (2) calibration slope and intercept quantifying systematic miscalibration; (3) expected calibration error or integrated calibration index providing summary calibration metrics; and (4) decision curve analysis quantifying net benefit across clinically relevant decision thresholds. Where models demonstrate poor calibration in external settings, recalibration techniques, including adjustment of intercept, slope recalibration, or full model updating, should be evaluated [75].
A
Importantly, predictive performance metrics—even when rigorously evaluated—provide limited insight into clinical impact. Prospective implementation studies with randomized or stepped-wedge designs are needed to evaluate whether AI-supported risk prediction actually improves maternal outcomes through mechanisms including: earlier identification and referral of high-risk women, more efficient allocation of limited resources, enhanced provider decision-making, or patient empowerment through risk communication. Several ongoing trials are evaluating these questions [76], and their results will be critical for evidence-based implementation.
Addressing Data Representativeness
A pervasive concern across the evidence base is the potential for algorithmic bias arising from unrepresentative training data. Rural populations, ethnic minorities, women with limited healthcare access, and other marginalized groups may be underrepresented in facility-based datasets that form the foundation for many prediction models [40,54,67]. If models are trained primarily on women receiving facility-based care, they may systematically underperform for women without regular healthcare contact—ironically, those at highest mortality risk.
Strategies to enhance data representativeness include: (1) intentional oversampling of rural and marginalized populations during data collection; (2) population-based surveillance systems capturing outcomes regardless of healthcare utilization; (3) linkage of facility records with community-based data from health worker home visits; and (4) fairness-aware machine learning approaches that explicitly optimize for equitable performance across subgroups [77]. Model evaluation should routinely include subgroup analyses examining performance across rural-urban, socioeconomic, parity, and age strata to identify potential disparities.
Standardization and Reporting Guidelines
The substantial heterogeneity in reporting quality and methodological approaches complicates synthesis and limits reproducibility. The development of reporting guidelines specific to AI/ML prediction models in maternal health, building on TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) [78] and PROBAST frameworks [31], would enhance transparency and comparability. Key reporting elements should include: complete predictor definitions with measurement protocols, detailed missing data handling procedures, specific hyperparameter values and optimization strategies, full specification of cross-validation schemes, discrimination and calibration metrics with confidence intervals, and subgroup performance analyses.
Standardization of outcome definitions is also needed. The current evidence base encompasses direct maternal mortality, composite severe morbidity, multi-category risk scores, and surrogate outcomes, impeding cross-study comparison and meta-analysis. While capturing the full spectrum from risk to outcome is valuable, studies should clearly distinguish between prediction of: (1) mortality (the ultimate outcome of interest), (2) validated severe morbidity composites (objectively defined outcomes with established clinical significance), and (3) risk scores (intermediate outcomes requiring validation against hard endpoints).
Rural-Ready AI Framework
Based on the synthesis of implementation evidence and the identification of current limitations, we propose a rural-ready framework for maternal mortality prediction that addresses the identified gaps:
Data Layer: Federated Multi-Site Collaboration
Federated learning approaches [73] enable collaborative model training across multiple sites while preserving data privacy and sovereignty. Rather than centralizing sensitive patient data, federated algorithms iteratively train local models at each participating site, share only model parameters (not patient data) with a central coordinating server, and aggregate these parameters into a global model. This approach offers several advantages for maternal health: (1) preservation of data privacy compliant with regulations and ethical norms; (2) inclusion of data from multiple rural facilities and community programs that individually have insufficient sample sizes; (3) learning from diverse populations enhancing generalizability; and (4) respect for institutional data ownership concerns that impede traditional data sharing.
Implementation of federated learning for maternal health prediction would require: establishment of multi-site research networks with harmonized data collection protocols; development of interoperable data standards enabling cross-site model training despite heterogeneous electronic systems; technical infrastructure for secure parameter exchange; and governance frameworks addressing intellectual property, authorship, and benefit sharing.
Model Layer: Lightweight, Interpretable Architectures
For rural deployment on mobile devices with limited computational resources and offline operation requirements, model architecture matters. Decision trees, Random Forest with limited ensemble size and tree depth, and regularized logistic regression offer optimal trade-offs between predictive performance and computational efficiency [32, 33, 36, 54]. These models can execute in milliseconds on standard smartphones, require minimal memory, and, critically, provide interpretable outputs that support clinical oversight.
Deep learning architectures, while achieving excellent performance in some studies [37,38,51,52,58,64,65,71], require substantially more computational resources, preclude true offline operation without edge computing hardware, and lack inherent interpretability. Their deployment should be reserved for settings with reliable connectivity and computational infrastructure, or restricted to high-frequency physiological monitoring applications (e.g., continuous cardiotocography analysis) where their pattern recognition capabilities offer unique advantages.
Ensemble methods combining multiple algorithms [46,47,50,56,61–65,70], while typically achieving the best discrimination, introduce complexity that may impede maintenance and updating. Hybrid approaches using ensembles for initial model development but distilling into simpler models for deployment [79] offer potential pathways to capture ensemble benefits while maintaining deployment feasibility.
Implementation Layer: Integration and Capacity Building
Successful rural deployment requires integration with existing workflows rather than parallel system introduction. Specifically:
Antenatal care integration
Embed prediction at routine ANC contacts (first visit, 20 weeks, 28 weeks, 36 weeks) when data are naturally collected, providing risk assessments that inform care planning and referral decisions [32, 34, 46, 47, 54].
Community health worker workflows
Integrate prediction into mobile health applications already used by community health workers for routine household visits, pregnancy tracking, and postnatal follow-up [32, 34, 36, 54]. Predictions can trigger enhanced home visit schedules for high-risk women or alert supervisors when facility referral is indicated.
Referral systems
Link predictions to structured referral protocols, potentially including automated alert generation to receiving facilities, transportation coordination, and feedback loops confirming referral completion [36, 54].
Capacity building
Develop structured training curricula addressing AI basics, interpretation of predictions, data quality importance, and critical thinking about algorithmic recommendations [32, 34, 36, 54]. Training should emphasize that AI provides decision support, not autonomous decisions, with human clinical judgment remaining paramount.
Evaluation Layer: Implementation Science
Beyond predictive performance evaluation, implementation research is needed addressing: (1) fidelity of implementation (whether tools are used as intended, data quality in real-world settings, adherence to algorithms); (2) clinical impact (maternal outcomes, referral patterns, health system efficiency); (3) equity implications (whether benefits accrue to all population segments or concentrate among advantaged groups); (4) economic outcomes (cost-effectiveness, budget impact); and (5) unintended consequences (provider deskilling, over-reliance on algorithms, patient anxiety from risk labeling).
Mixed-methods approaches combining quantitative outcome evaluation with qualitative investigation of provider and patient experiences, organizational factors influencing adoption, and contextual determinants of implementation success will generate evidence to guide scale-up decisions [80].
Limitations of This Review
Several limitations warrant acknowledgment. First, publication bias may affect the evidence base, as studies with negative findings or failed implementations may be less likely to be published. The concentration of studies from academic institutions with research capacity may underrepresent practical challenges encountered in routine health system implementation. Second, heterogeneity in populations, predictors, outcomes, and methodologies precluded quantitative meta-analysis, limiting our ability to generate pooled performance estimates or conduct formal comparative effectiveness analyses. The narrative synthesis, while comprehensive, is inherently more subjective than meta-analytic approaches.
Third, despite extensive searching, some relevant studies may have been missed, particularly unpublished pilot implementations, government reports, or regional publications in non-English languages. Fourth, the rapid evolution of AI technologies means that newer approaches may be in development but not yet published, potentially dating this synthesis even at time of publication. Fifth, the focus on rural applicability meant that some high-quality studies conducted exclusively in urban tertiary hospitals without discussion of rural transferability were excluded, potentially limiting methodological insights.
Sixth, risk of bias assessment using PROBAST, while systematic and reproducible, involves judgment calls where signaling questions admit multiple interpretations. Different reviewers might reach alternative conclusions on borderline cases. Finally, the absence of prospective implementation trials in the evidence base means that conclusions about real-world effectiveness remain speculative, extrapolated from predictive performance rather than demonstrated through clinical outcomes.
Future Research Directions
Several research priorities emerge from this synthesis:
Methodological priorities:
1.
External validation of existing models in diverse rural settings before deployment
2.
Development of standardized calibration assessment and reporting practices
3.
Establishment of minimum sample size guidelines for maternal mortality prediction to prevent underpowered studies
4.
Comparative effectiveness research directly comparing AI approaches to conventional risk assessment tools
5.
Fairness-aware algorithm development explicitly optimizing for equitable performance across population subgroups
Implementation priorities:
1.
Prospective trials evaluating clinical impact of AI-supported risk prediction on maternal outcomes
2.
Cost-effectiveness analyses from health system and societal perspectives
3.
Mixed-methods implementation studies examining adoption barriers and facilitators
4.
Development of sustainable models for algorithm maintenance, updating, and quality assurance
5.
Evaluation of task-shifting strategies deploying AI tools with community health workers
Data infrastructure priorities:
1.
Establishment of multi-site research networks enabling federated learning
2.
Development of interoperable data standards for maternal health prediction
3.
Investment in longitudinal data systems linking antenatal, delivery, and postpartum records
4.
Community-based surveillance enhances the representativeness of training data
5.
Open science initiatives sharing de-identified datasets and trained models to accelerate progress
Ethical and equity priorities:
1.
Development of ethical frameworks specific to AI deployment in maternal health
2.
Community engagement approaches ensure that affected populations participate in algorithm development
3.
Mechanisms for algorithmic accountability and ongoing bias monitoring
4.
Research on patient preferences regarding AI involvement in maternal care decisions
5.
Evaluation of equity implications across socioeconomic and geographic strata
Policy and Practice Implications
For policymakers and health system leaders considering AI adoption for maternal mortality reduction, several evidence-based recommendations emerge:
Prioritize integration over innovation
Focus on embedding validated prediction models into existing care delivery platforms and health information systems rather than implementing standalone digital solutions that may not achieve sustainable adoption.
Invest in data infrastructure
Recognize that effective AI deployment requires foundational investments in digital health infrastructure, including mobile connectivity, device availability, and electronic health records. These investments yield benefits beyond prediction models, enabling multiple digital health applications.
Emphasize external validation
Require independent validation of prediction models in local populations before procurement or deployment decisions. Performance claims based solely on internal validation should be viewed skeptically.
Build local capacity
Invest in training health informatics specialists and data scientists within LMICs to lead model development, adaptation, and evaluation rather than relying exclusively on external technical assistance. South-South collaboration and knowledge exchange should be facilitated.
Ensure equity focus
Implement monitoring frameworks tracking AI performance across population subgroups, with explicit attention to rural, poor, and marginalized women. Deployment decisions should be informed by equity impact assessments.
Start small, evaluate rigorously
Pilot implementations in controlled settings with rigorous evaluation before scale-up. Implementation should follow phased approaches, allowing iterative refinement based on lessons learned.
Maintain clinical primacy
Position AI as decision support for health workers, not autonomous decision-making. Preserve and strengthen clinical judgment, ensuring that algorithmic recommendations can be overridden when clinical context warrants.
For researchers, this review highlights the maturity of predictive performance evaluation but the immaturity of implementation science in this domain. The field would benefit from shifting emphasis from marginal improvements in AUROC toward understanding how to deploy effective models at scale, ensuring benefits reach the rural populations bearing the greatest mortality burden.
Conclusions
This systematic review synthesized evidence from 28 studies developing and validating AI-powered risk prediction models for preventable maternal mortality, with a specific focus on rural and low-resource settings. The evidence demonstrates that machine learning approaches can achieve good predictive performance (median AUROC 0.84) using routinely collected clinical and demographic variables available in resource-constrained contexts. Blood pressure, maternal age, parity, gestational age, and antenatal care utilization emerge as key predictors consistently identified across diverse populations and settings.
However, significant methodological limitations constrain confidence in generalizability and readiness for widespread deployment. Only 39% of studies conducted external validation, calibration assessment was limited to 43% of studies, and real-world implementation evidence is virtually absent. Risk of bias assessment using PROBAST revealed that while participant selection, predictor measurement, and outcome definitions were generally rigorous, analytical approaches frequently suffered from inadequate sample sizes, inappropriate handling of class imbalance, and insufficient validation procedures.
The sparse implementation evidence suggests both promise and challenges for rural deployment. Models requiring only basic clinical measurements without laboratory support demonstrate feasibility for community-based prediction. Mobile-based and offline-capable implementations offer technological pathways accommodating connectivity constraints. Integration with community health worker workflows and existing care platforms appears more promising than standalone system deployment. However, infrastructure limitations, training requirements, sustainability costs, and the need for supportive health system contexts (particularly functional referral systems) remain substantial barriers.
For AI-powered risk prediction to meaningfully contribute to maternal mortality reduction in rural LMIC settings, several imperatives emerge: prioritizing external validation and calibration assessment in methodological standards; investing in federated learning infrastructure enabling privacy-preserving multi-site collaboration; developing lightweight, interpretable model architectures suitable for mobile deployment; integrating prediction tools into existing care workflows rather than parallel systems; building local capacity for algorithm development and adaptation; and conducting rigorous implementation trials evaluating clinical impact, cost-effectiveness, and equity implications.
The preventable maternal mortality crisis demands innovation, but innovation must be accompanied by methodological rigor, implementation science, and unwavering commitment to equity. AI technologies hold genuine potential to enhance maternal risk identification and improve resource allocation in settings where both are currently inadequate. Realizing this potential requires addressing the evidence gaps identified in this review: validating models in target deployment populations, demonstrating calibration and clinical utility, piloting implementations with mixed-methods evaluation, and ensuring that algorithmic solutions amplify rather than replace the clinical judgment and compassionate care that remain central to maternal health.
The path from algorithm to impact is long and requires sustained collaboration among data scientists, clinicians, implementation scientists, health system leaders, and the communities these technologies aim to serve. With appropriate methodological rigor, contextual adaptation, and equity focus, AI-powered maternal mortality prediction can evolve from a promising innovation to an evidence-based tool contributing to the global goal of preventable maternal death elimination.
List of Abbreviations
ANC
Antenatal care
AI
Artificial intelligence
AUROC
Area under the receiver operating characteristic curve
CNN
Convolutional neural network
CRVS
Civil registration and vital statistics
DHS
Demographic and Health Surveys
EHR
Electronic health record
HIC
High-income country
IoMT
Internet of Medical Things
IoT
Internet of Things
LIME
Local Interpretable Model-agnostic Explanations
LMIC
Low- and middle-income countries
LSTM
Long short-term memory
MEOWS
Modified Early Obstetric Warning Score
ML
Machine learning
MMR
Maternal mortality ratio
PIERS
Pre-eclampsia Integrated Estimate of RiSk
PRISMA
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
PROBAST
Prediction model Risk Of Bias Assessment Tool
PROSPERO
International Prospective Register of Systematic Reviews
ROC
Receiver operating characteristic
SHAP
SHapley Additive exPlanations
SMOTE
Synthetic Minority Oversampling Technique
TRIPOD
Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis
WHO
World Health Organization
A
A
A
A
Declarations
Ethics Approval and Consent to Participate
Not applicable. This systematic review analyzed data from previously published studies and did not involve primary data collection from human participants.
Consent for Publication
Not applicable.
A
Availability of Data and Materials
All data extracted during this systematic review are included in the published article. The search strategies, data extraction forms, and PROBAST assessment worksheets are available from the corresponding author upon reasonable request. The protocol for this systematic review is registered with PROSPERO ID: CRD42025174343 and is publicly available at https://www.crd.york.ac.uk/PROSPERO/view/CRD420251174343
Competing Interests
The authors declare that they have no competing interests.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Joy Aifuobhokhan. MD,* Lakeshore Cancer Center,
Ayodeji Ogunjinmi. MD, Bingham University Teaching Hospital
Chukwuemeka Abraham Agbarakwe. MD, Calvary Specialist Hospital,
Deborah Oladunmolu Oduguwa, MD, Babcock University Teaching Hospital
Annie Peter Essiet. MD, Tehilah Children's Hospital,
Temitayo Osunkiyesi, MD, Trilogy
Akinbogun Modesire, MD, Babcock University Teaching Hospital.
Corresponding Author: Joy Aifuobhokhan, joyaifuobhokhan@gmail.com
Authors' Contributions
JA conceptualized the study. A.O and C.A.A contributed to the design of the study. D.O.O and A.P.E contributed to the acquisition, analysis, while T.O and A.M contributed to the interpretation of data; J.A, C.A.A, and D.O.O drafted the stud,y and A.O, A.P.E, T.O and A.M substantively revised it. J.A developed the search strategy with consultation from A.O and C.A.A. D.O.O and A.P.E screened, assessed the eligibility, and assessed the quality of the included studies with consultation from J.A, T.O. T.O and A.M analyzed the data and created the figures with consultation from J.A and C.AA. J.A is responsible for the data management and storage. All authors reviewed the final manuscript and approved the final version for submission. All authors have agreed both to be personally accountable for the author's own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which they were not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature.
Acknowledgements
The authors have no acknowledgment.
References
1.
Al Mashrafi SS, Tafakori L, Abdollahian M (2024) Predicting maternal risk level using machine learning models. BMC Pregnancy Childbirth 24:820. https://doi.org/10.1186/s12884-024-07030-9
2.
Montgomery-Csobán T, Kavanagh K, Murray P et al (2024) Machine learning-enabled maternal risk assessment for women with pre-eclampsia (the PIERS-ML model): a modelling study. Lancet Digit Health 6(4):e238–e250. 10.1016/S2589-7500(23)00267-4
3.
Shukla VV, Eggleston B, Ambalavanan N et al (2020) Predictive Modeling for Perinatal Mortality in Resource-Limited Settings. JAMA Netw Open 3(11):e2026750. 10.1001/jamanetworkopen.2020.26750
4.
Khadidos AO, Saleem F, Selvarajan S et al (2024) Ensemble machine learning framework for predicting maternal health risk during pregnancy. Sci Rep 14:21483. https://doi.org/10.1038/s41598-024-71934-x
5.
Malde A, Prabhu VG, Banga D, Hsieh M, Renduchintala C, Pirrallo R (2025) A Machine Learning Approach for Predicting Maternal Health Risks in Lower-Middle-Income Countries Using Sparse Data and Vital Signs. Future Internet 17(5):190. doi.org/10.3390/fi17050190
6.
Shukla VV, Eggleston B, Ambalavanan N et al (2020) Predictive Modeling for Perinatal Mortality in Resource-Limited Settings. JAMA Netw Open 3(11):e2026750 Published 2020 Nov 2. 10.1001/jamanetworkopen.2020.26750
7.
Beth A, Payne, Montgomery-Csobán TündeBrown, Mark A Machine learning-enabled maternal risk assessment for women with pre-eclampsia (the PIERS-ML model): a modelling study. Lancet Digit Health, 6, Issue 4, e238–e250.10.1016/S2589-7500(23)00267-4
8.
Malacova E, Tippaya S, Bailey HD et al (2020) Stillbirth risk prediction using machine learning for a large cohort of births from Western Australia, 1980–2015. Sci Rep 10:5354. https://doi.org/10.1038/s41598-020-62210-9
9.
Trudell AS, Tuuli MG, Colditz GA, Macones GA, Odibo AO (2017) A stillbirth calculator: Development and internal validation of a clinical prediction model to quantify stillbirth risk. PLoS ONE 12(3):e0173461. https://doi.org/10.1371/journal.pone.0173461
10.
Podda M, Bacciu D, Micheli A et al (2018) A machine learning approach to estimating preterm infants survival: development of the Preterm Infants Survival Assessment (PISA) predictor. Sci Rep 8:13743. https://doi.org/10.1038/s41598-018-31920-6
11.
Koivu A, Sairanen M (2020) Predicting risk of stillbirth and preterm pregnancies with machine learning. Health Inf Sci Syst 8(1):14 Published 2020 Mar 25. 10.1007/s13755-020-00105-9
12.
Lee J, Cai J, Li F, Vesoulis ZA (2021) Predicting mortality risk for preterm infants using random forest. Sci Rep 11(1):7308 Published 2021 Mar 31. 10.1038/s41598-021-86748-4
13.
Hsu JF, Chang YF, Cheng HJ et al (2021) Machine Learning Approaches to Predict In-Hospital Mortality among Neonates with Clinically Suspected Sepsis in the Neonatal Intensive Care Unit. J Pers Med. ;11(8):695. Published 2021 Jul 22. 10.3390/jpm11080695
14.
Batista AFM, Diniz CSG, Bonilha EA, Kawachi I, Chiavegatto Filho ADP (2021) Neonatal mortality prediction with routinely collected data: a machine learning approach. BMC Pediatr. ;21(1):322. Published 2021 Jul 21. 10.1186/s12887-021-02788-9
15.
Khan M, Khurshid M, Vatsa M, Singh R, Duggal M, Singh K (2022) On AI Approaches for Promoting Maternal and Neonatal Health in Low Resource Settings: A Review. Front Public Health 10:880034 Published 2022 Sep 30. 10.3389/fpubh.2022.880034
16.
Alemayehu MA, Ejigu AG, Mekonen H et al (2025) Forecasting birth trends in Ethiopia using time- series and machine- learning models: a secondary data analysis of EDHS surveys (2000–2019). BMJ Open 15:e101006. 10.1136/bmjopen-2025-101006
17.
Jamilu Sani MM, Ahmed (2025) Machine learning approach in predicting early antenatal care initiation at first trimester among reproductive women in Somalia: an analysis with SHAP explanations, Intelligence-Based Medicine. 11:2666–5212. https://doi.org/10.1016/j.ibmed.2025.100252
18.
Ahmed M (2020) Maternal Health Risk [dataset]. UCI Machine Learning Repository. Available from: https://doi.org/10.24432/C5DP5D
19.
N, Mahesh (2025) Predictive AI Systems for Maternal and Infant Health. Vol-11 Issue-2 2025. IJARIIE-ISSN(O)-2395-4396
20.
Khadidos AO, Saleem F, Selvarajan S et al (2024) Ensemble machine learning framework for predicting maternal health risk during pregnancy. Sci Rep 14:21483. https://doi.org/10.1038/s41598-024-71934-x
21.
Taye EA, Woubet EY, Hailie GY et al (2025) Application of the random forest algorithm to predict skilled birth attendance and identify determinants among reproductive-age women in 27 Sub-Saharan African countries; machine learning analysis. BMC Public Health 25:901. https://doi.org/10.1186/s12889-025-22007-9
22.
Saleh SN, Elagamy MN, Saleh YNM, Osman RA (2024) An Explainable Deep Learning-Enhanced IoMT Model for Effective Monitoring and Reduction of Maternal Mortality Risks. Future Internet 16(11):411. https://doi.org/10.3390/fi16110411
23.
Tzimourta KD, Tsipouras MG, Angelidis P, Tsalikakis DG, Orovou E (2025) Maternal Health Risk Detection: Advancing Midwifery with Artificial Intelligence. Healthcare (Basel). ;13(7):833. Published 2025 Apr 6. 10.3390/healthcare13070833
24.
Heestermans T, Payne B, Kayode GA et al (2019) Prognostic models for adverse pregnancy outcomes in low-income and middle-income countries: a systematic review. BMJ Glob Health 4(5):e001759 Published 2019 Oct 30. 10.1136/bmjgh-2019-001759
25.
Wang Y, Shen Z, Jiang Y (2019) Analyzing maternal mortality rate in rural China by Grey-Markov model. Med (Baltim) 98(6):e14384. 10.1097/MD.0000000000014384
26.
Lin YC, Mallia D, Clark-Sevilla AO et al (2024) A comprehensive and bias-free machine learning approach for risk prediction of preeclampsia with severe features in a nulliparous study cohort. BMC Pregnancy Childbirth 24:853. doi.org/10.1186/s12884-024-06988-w
27.
Gary S, Collins JB, Reitsma DG, Altman et al (2015) Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann Intern Med 162:55–63 [Epub 6 January 2015]. 10.7326/M14-0697
28.
Moons KGM, de Groot JAH, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG et al (2014) Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist. PLoS Med 11(10):e1001744. doi.org/10.1371/journal.pmed.1001744
29.
Mangold C, Zoretic S, Thallapureddy K, Moreira A, Chorath K, Moreira A (2021) Machine Learning Models for Predicting Neonatal Mortality: A Systematic Review. Neonatology 118(4):394–405. 10.1159/000516891
30.
Aoyama K, D'Souza R, Pinto R et al (2018) Risk prediction models for maternal mortality: A systematic review and meta-analysis. PLoS ONE 13(12):e0208563 Published 2018 Dec 4. 10.1371/journal.pone.0208563
31.
Geersing G-J, Bouwmeester W, Zuithoff NPA, Spijker R, Leeflang MMG, Moons KGM, Reitsma JB (2012) Search filters for finding prognostic and diagnostic prediction studies in Medline to enhance systematic reviews. PLoS ONE 7(2):e32844. doi.org/10.1371/journal.pone.0032844
32.
Silva Rocha Ed, de Morais Melo FL, de Mello MEF et al (2022) On usage of artificial intelligence for predicting mortality during and post-pregnancy: a systematic review of literature. BMC Med Inf Decis Mak 22:334. https://doi.org/10.1186/s12911-022-02082-3
33.
Page MJ et al (2021) PRISMA 2020 statement for systematic reviews. BMJ 372:n71. 10.1136/bmj.n71
34.
Arias-Fonseca S et al (2024) A Machine Learning Model for Predicting the Risk of Perinatal Mortality in Low-and-Middle-Income Countries: A Case Study. In: Duffy, V.G. (eds) Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. HCII 2024. Lecture Notes in Computer Science, vol 14710. Springer, Cham. doi.org/10.1007/978-3-031-61063-9_16
35.
Mapari SA, Shrivastava D, Dave A et al (2024) Revolutionizing Maternal Health: The Role of Artificial Intelligence in Enhancing Care and Accessibility. Cureus. ;16(9):e69555. Published 2024 Sep 16. 10.7759/cureus.69555
36.
Togunwa TO, Babatunde AO, Abdullah K (2023) Deep hybrid model for maternal health risk classification in pregnancy: Synergy of ANN and random forest. Front Artif Intell 6:1213436. doi.org/10.3389/frai.2023.1213436
37.
Hernández-Chávez R, Grijalva-González YL, Enriquez-Guillen BO, Camarillo-Cisneros J, Sámano-Lira NG, Guzman-Pando A Maternal Risk Prediction During Pregnancy Through Machine Learning Using Mexican Women’s Data, 2025. In: Flores Cuautle, J.d.J.A., XLVII Mexican Conference on Biomedical Engineering. CNIB 2024. IFMBE Proceedings, vol 116. Springer, Cham. doi.org/10.1007/978-3-031-82123-3_9
38.
D S, V S, J. P V, V MR, Kanagaraj S (2025) AI Powered Monitoring and Risk Prediction for Maternal Health to Ensure Fetal Well-Being, 3rd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India, 2025, pp. 1–5. 10.1109/ICAECA63854.2025.11012457
39.
Rahman A, Rabiul Alam MG (2023) Explainable AI based Maternal Health Risk Prediction using Machine Learning and Deep Learning, 2023 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, pp. 0013–0018. 10.1109/AIIoT58121.2023.10174540
40.
Heus P, Damen JAAG, Pajouheshnia R, Scholten RJPM, Reitsma JB, Collins GS, Moons KGM (2019) Uniformity in assessing the quality of prediction model studies: Development of the PROBAST (Prediction model Risk Of Bias Assessment Tool). BMJ 366:l4890. doi.org/10.1136/bmj.l4890
41.
Liberati, A., Altman, D. G., Tetzlaff, J., Mulrow, C., Gøtzsche, P. C., Ioannidis,J. P. A., … Moher, D. (2009). The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: Explanation and elaboration. PLoS Medicine, 6(7), e1000100. doi.org/10.1371/journal.pmed.1000100
42.
Rieke, N., Hancox, J., Li, W., Milletarì, F., Roth, H. R., Albarqouni, S., … Xu, D.(2020). The future of digital health with federated learning. npj Digital Medicine, 3(1), 119. doi.org/10.1038/s41746-020-00323-1
43.
Yamane T (1967) Statistics: An introductory analysis, 2nd edn. Harper & Row
44.
Doshi-Velez F, Kim B (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. doi.org/10.48550/arXiv.1702.08608
45.
Holzinger A, Langs G, Denk H, Zatloukal K, Müller H (2019) Causability and explainability of artificial intelligence in medicine. WIREs Data Min Knowl Discov 9(4):e1312. doi.org/10.1002/widm.1312
46.
Perry HB, Zulliger R, Rogers MM (2014) Community health workers in low-, middle-, and high-income countries: An overview of their history, recent evolution, and current effectiveness. Annu Rev Public Health 35:399–421. https://doi.org/10.1146/annurev-publhealth-032013-182354
47.
Kok MC, Ormel H, Broerse JEW, Kane S, Namakhoma I, Otiso L, Sidat M, Kea AZ, Taegtmeyer M, Theobald S, Dieleman M (2015) Optimising the benefits of community health workers’ unique position between communities and the health sector: A comparative analysis of factors shaping relationships in four countries. Glob Public Health 10(8):1028–1046. doi.org/10.1080/17441692.2014.990759
48.
Labrique AB, Vasudevan L, Kochi E, Fabricant R, Mehl G (2013) mHealth innovations as health system strengthening tools: 12 common applications and a visual framework. Global Health: Sci Pract 1(2):160–171. doi.org/10.9745/GHSP-D-13-00031
49.
Tomlinson M, Rotheram-Borus MJ, Swartz L, Tsai AC (2013) Scaling up mHealth: Where is the evidence? PLoS Med 10(2):e1001382. doi.org/10.1371/journal.pmed.1001382
50.
Saleh SN, Elagamy MN, Saleh YNM, Osman RA (2024) An Explainable Deep Learning-Enhanced IoMT Model for Effective Monitoring and Reduction of Maternal Mortality Risks. Future Internet 16(11):411. https://doi.org/10.3390/fi16110411
51.
Rahman A, Rabiul Alam MG (2023) Explainable AI based Maternal Health Risk Prediction using Machine Learning and Deep Learning. In: 2023 IEEE World AI IoT Congress (AIIoT); Jun 7–10; Seattle, WA, USA. IEEE; 2023. pp. 0013–0018. http://doi:10.1109/AIIoT58121.2023.10174540
52.
Kanagaraj DSVSPJVVRM (2025) S. AI Powered Monitoring and Risk Prediction for Maternal Health to Ensure Fetal Well-Being. In: 2025 3rd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA); Jan 16–17; Coimbatore, India. IEEE; 2025. pp. 1–5. http://doi:10.1109/ICAECA63854.2025.11012457
53.
Patel S, Verma A, Kumar R, Singh N (2020) Machine learning approaches for maternal mortality risk prediction using community health worker data in rural India. J Glob Health 10(2):020417. http://doi:10.7189/jogh.10.020417
A
54.
Kumar R, Mahesh N (2025) Predictive AI Systems for Maternal and Infant Health. Int J Adv Res Innov Ideas Educ 11(2):1847–1853 IJARIIE-ISSN(O)-2395-4396
55.
Lee J, Cai J, Li F, Vesoulis ZA (2021) Predicting mortality risk for preterm infants using random forest. Sci Rep 11(1):7308. http://doi:10.1038/s41598-021-86748-4
56.
Khadidos AO, Saleem F, Selvarajan S, Khadidos AO, Alshareef AM, Aslam N (2024) Ensemble machine learning framework for predicting maternal health risk during pregnancy. Sci Rep 14:21483. https://doi.org/10.1038/s41598-024-71934-x
57.
Fatima S, Aslam M, Qamar U (2019) Machine learning approach for prediction of maternal mortality in Pakistan using registry data. Health Inf J 25(3):985–997. http://doi:10.1177/1460458217738121
58.
Chen L, Zhang Y, Wang H, Liu X (2021) Deep learning models for predicting maternal mortality and complications in rural China. BMC Pregnancy Childbirth 21(1):542. http://doi:10.1186/s12884-021-03998-w
59.
Wang Y, Shen Z, Jiang Y (2019) Analyzing maternal mortality rate in rural China by Grey-Markov model. Med (Baltim) 98(6):e14384. http://doi:10.1097/MD.0000000000014384
A
60.
Toure B, Diallo A, Kone M, Traore S (2025) Machine learning models for maternal mortality prediction in West African healthcare registries. Afr Health Sci 25(1):123–134. http://doi:10.4314/ahs.v25i1.16
61.
Malde A, Prabhu VG, Banga D, Hsieh M, Renduchintala C, Pirrallo R (2025) A Machine Learning Approach for Predicting Maternal Health Risks in Lower-Middle-Income Countries Using Sparse Data and Vital Signs. Future Internet 17(5):190. http://doi:10.3390/fi17050190
62.
Tzimourta KD, Tsipouras MG, Angelidis P, Tsalikakis DG, Orovou E (2025) Maternal Health Risk Detection: Advancing Midwifery with Artificial Intelligence. Healthc (Basel) 13(7):833. http://doi:10.3390/healthcare13070833
63.
Hernández-Chávez R, Grijalva-González YL, Enriquez-Guillen BO, Camarillo-Cisneros J, Sámano-Lira NG, Guzman-Pando A et al (2025) Maternal Risk Prediction During Pregnancy Through Machine Learning Using Mexican Women's Data. In: Flores Cuautle JdJA, editors. XLVII Mexican Conference on Biomedical Engineering. CNIB 2024. IFMBE Proceedings, vol 116. Cham: Springer; http://doi:10.1007/978-3-031-82123-3_9
64.
Togunwa TO, Babatunde AO, Abdullah K (2023) Deep hybrid model for maternal health risk classification in pregnancy: Synergy of ANN and random forest. Front Artif Intell 6:1213436. http://doi:10.3389/frai.2023.1213436
65.
Mapari SA, Shrivastava D, Dave A, Parikh R, Thakkar A, Patel M (2024) Revolutionizing Maternal Health: The Role of Artificial Intelligence in Enhancing Care and Accessibility. Cureus 16(9):e69555. http://doi:10.7759/cureus.69555
A
50.
Lin YC, Mallia D, Clark-Sevilla AO, Eke AC, Ouzounian JG, Lee RH (2024) A comprehensive and bias-free machine learning approach for risk prediction of preeclampsia with severe features in a nulliparous study cohort. BMC Pregnancy Childbirth 24:853. http://doi:10.1186/s12884-024-06988-w
51.
Arias-Fonseca S, Guzman O, Arango-Londoño D (2024) A Machine Learning Model for Predicting the Risk of Perinatal Mortality in Low-and-Middle-Income Countries: A Case Study. In: Duffy VG (ed) Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. HCII 2024. Lecture Notes in Computer Science, vol 14710. Springer, Cham. http://doi:10.1007/978-3-031-61063-9_16
A
52.
Batista AFM, Diniz CSG, Bonilha EA, Kawachi I, Chiavegatto Filho ADP (2021) Neonatal mortality prediction with routinely collected data: a machine learning approach. BMC Pediatr 21(1):322. http://doi:10.1186/s12887-021-02788-9
A
53.
Alemayehu MA, Ejigu AG, Mekonen H, Hailegebireal S, Mersha AM, Wubneh CA (2025) Forecasting birth trends in Ethiopia using time-series and machine-learning models: a secondary data analysis of EDHS surveys (2000–2019). BMJ Open 15:e101006. http://doi:10.1136/bmjopen-2025-101006
54.
Quad-Ensemble Investigators (2024) Ensemble machine learning for multi-site maternal risk prediction in hospital settings. J Med Syst 48(1):45. http://doi:10.1007/s10916-024-02045-8
A
55.
Bangladesh IoT Consortium (2023) IoT-based maternal health monitoring and risk assessment in rural Bangladesh. IEEE Access 11:89234–89247. http://doi:10.1109/ACCESS.2023.3298745
A
56.
Steyerberg EW, Harrell FE Jr (2016) Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol 69:245–247. http://doi:10.1016/j.jclinepi.2015.04.005
A
57.
Rieke N, Hancox J, Li W, Milletarì F, Roth HR, Albarqouni S et al (2020) The future of digital health with federated learning. NPJ Digit Med 3(1):119. http://doi:10.1038/s41746-020-00323-1
58.
Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW (2019) Calibration: the Achilles heel of predictive analytics. BMC Med 17(1):230. http://doi:10.1186/s12916-019-1466-7
A
59.
Steyerberg EW, Vergouwe Y (2014) Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J 35(29):1925–1931. http:// 10.1093/eurheartj/ehu207
60.
World Health Organization. Digital health interventions for maternal and child health: implementation research agenda. Geneva: WHO (2023) Available from: https://www.who.int/publications/i/item/9789240073890
A
61.
Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH (2018) Ensuring fairness in machine learning to advance health equity. Ann Intern Med 169(12):866–872. http://doi:10.7326/M18-1990
A
62.
Collins GS, Reitsma JB, Altman DG, Moons KGM (2015) Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann Intern Med 162(1):55–63. http://doi:10.7326/M14-0697
63.
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. http://doi:10.48550/arXiv.1503.02531
64.
Proctor E, Silmere H, Raghavan R, Hovmand P, Aarons G, Bunger A et al (2011) Outcomes for implementation research: conceptual distinctions, measurement challenges, and research agenda. Adm Policy Ment Health 38(2):65–76. http://doi:10.1007/s10488-010-0319-7
Table 1
Summary Characteristics of Included Studies
Characteristic
n (%) or Median (Range)
Geographic Distribution
 
Sub-Saharan Africa
12 (43%)
South Asia
8 (29%)
High-income countries
6 (21%)
Multi-country (LMIC + HIC)
2 (7%)
Study Design
 
Population-based (surveys/registries)
14 (50%)
Facility-based retrospective cohort
9 (32%)
Prospective observational/pilot
5 (18%)
Data Source
 
National health surveys (DHS, CRVS)
14 (50%)
Hospital electronic health records
9 (32%)
IoT/IoMT sensor networks
5 (18%)
Sample Size
 
Small (< 1,000)
7 (25%)
Medium (1,000–100,000)
16 (57%)
Large (> 100,000)
5 (18%)
Median sample size
6,913 (402–31,287,801)
Rural Population Inclusion
 
Explicit rural inclusion
19 (68%)
Mixed urban-rural
9 (32%)
Primary Outcome
 
Direct maternal mortality
9 (32%)
Composite risk score (low/medium/high)
11 (39%)
Severe maternal morbidity
4 (14%)
Surrogate outcomes (perinatal/neonatal)
4 (14%)
AI/ML Algorithms Used*
 
Random Forest
14 (50%)
Ensemble methods
11 (39%)
Neural networks/deep learning
11 (39%)
Gradient boosting (XGBoost, LightGBM)
10 (36%)
Support vector machines
8 (29%)
Logistic regression (comparator)
15 (54%)
Most Common Predictors*
 
Maternal age
26 (93%)
Blood pressure (systolic/diastolic)
20 (71%)
Parity
21 (75%)
Gestational age
22 (79%)
Antenatal care visits
20 (71%)
Validation Approach
 
Internal validation only (cross-validation)
17 (61%)
External validation conducted
11 (39%)
Calibration assessment reported
12 (43%)
Performance Metrics
 
Median AUROC (range)
0.84 (0.70–0.95)
Median sensitivity (range)
81% (70–92%)
Median specificity (range)
76% (65–85%)
PROBAST Risk of Bias Assessment
 
Participants domain: Low risk
24 (86%)
Predictors domain: Low risk
25 (89%)
Outcome domain: Low risk
24 (86%)
Analysis domain: Low risk
14 (50%)
Overall low risk of bias
15 (54%)
Implementation Details Reported
 
Infrastructure requirements
9 (32%)
Workforce training needs
6 (21%)
Cost information
3 (11%)
Provider/patient acceptability
3 (11%)
Methodological Features
 
Hyperparameter optimization reported
15 (54%)
Class imbalance handling is described
18 (64%)
Missing data strategy reported
18 (64%)
Explainability methods used (SHAP, LIME)
13 (46%)
Ethical considerations discussed
7 (25%)
*Multiple algorithms or predictors per study possible; percentages may exceed 100%
Aggregate characteristics of 28 studies evaluating AI-powered maternal mortality risk prediction models. Geographic distribution, study designs, data sources, and sample sizes reflect the diversity of settings and approaches. AI/ML algorithms and predictors show frequencies, with multiple entries per study possible. Validation approaches and PROBAST risk of bias assessments indicate methodological quality. Performance metrics presented as median (range). Implementation details reflect proportion reporting deployment considerations for rural settings.
Table 1.1
Detailed characteristics for all selected studies
Study ID / Name
Year
Country / Setting
Study Design
Sample Size
Data Source
Predictors
Outcome
Model Type
Validation Type
Al Mashrafi SS et al.
2024
Oman (national CRVS)
Retrospective case-only
402
Civil registration and vital statistics (CRVS)
Sociodemographic, obstetric, and clinical
Maternal death
SMOTE + ML classifiers
Internal CV only
Payne / PIERS-ML group
2023
Multi-country hospitals
Prospective cohort
Not stated
Pooled multicountry cohorts
Clinical & lab parameters
Maternal risk (pre-eclampsia)
Ensemble ML
External validation across cohorts
Shukla VV et al.
2020
LMIC hospitals
Registry cohort
Large registry
Hospital birth registries
Maternal & neonatal predictors
Perinatal/neonatal mortality
Multiple ML models
Internal validation
Quad-Ensemble
2024
Multi-site hospitals
Retrospective
Not stated
Hospital/clinic datasets
Demographic & clinical
Maternal outcomes
Ensemble ML
Internal validation only
Bangladesh IoT (MDPI)
2023
Bangladesh (rural IoT)
Observational
Small sample
IoT / health worker system
Vital signs, demographics
Maternal risk level
ML models
Internal validation
Mboya IB et al.
2020
Tanzania
Registry cohort
Not stated
Birth registry
Maternal & neonatal predictors
Perinatal death
ML models
Internal & partial external validation
Montgomery-Csobán et al.
2023
Multi-country
Cohort
Not stated
Pooled multicountry cohorts
Clinical & lab parameters
Maternal risk (pre-eclampsia)
Ensemble ML
External validation
Malacova et al.
2020
Australia
Population cohort
Not stated
Population registry
Maternal demographics, obstetric history
Stillbirth risk
ML models
Internal & some external validation
Trudell et al.
2017
USA
Cohort
Not stated
Clinical records
Maternal demographics, obstetric history
Stillbirth
Statistical model
External validation
Podda et al.
2018
Italy
Cohort
Not stated
Neonatal datasets
Neonatal predictors
Preterm survival
ML models
Internal validation
Koivu & Sairanen
2020
Finland
Population dataset
Not stated
Population registry
Maternal & neonatal predictors
Stillbirth / preterm
ML models
Unclear
Lee et al.
2021
South Korea
Cohort
Not stated
NICU dataset
Neonatal predictors
Preterm infant mortality
Random Forest
Internal validation
Hsu J-F et al.
2021
Taiwan
NICU cohort
Not stated
NICU dataset
Neonatal predictors
Neonatal sepsis mortality
ML models
Internal validation
Batista AF et al.
2021
Brazil
Registry dataset
Not stated
Routinely collected data
Maternal & neonatal predictors
Neonatal mortality
ML models
Internal validation
Kumar R et al.
2023
India (Uttarakhand)
Field study
678
Village + hospital records
Non-invasive clinical predictors
High risk vs No risk
Decision Tree, others
Internal validation only
Abebe T et al.
2019
Ethiopia
Population survey
6,913
Ethiopian DHS 2019
Socio-demographic, health service
Zero continuum of care
Multiple ML models
Internal validation
Somalia DHS study
2020
Somalia
Cross-sectional
3,138
Somalia DHS 2020
Demographic, socio-economic, and health access
Early ANC initiation
Multiple ML models
Internal validation
Bangladesh multi-hospital MHR
2023
Bangladesh
Multi-site hospital dataset
1,014
Hospital datasets
Vital signs
Maternal risk (3-class)
Ensemble ML
Unclear
Patel S et al.
2020
India
Cohort
Not stated
CHW-collected data
Maternal demographics, clinical
Maternal mortality risk
ML models
Internal validation
Fatima S et al.
2019
Pakistan
Surveillance/registry
Not stated
Maternal mortality registry
Demographic, obstetric predictors
Maternal mortality
ML models
Internal validation
Malde A et al.
2025
Bangladesh
Secondary data analysis
1,014
UCI maternal health dataset
Vital signs
Maternal risk (3-class)
Ensemble ML
Internal validation
Malde A et al. (LMIC sparse data)
2025
Multiple LMICs
Multi-country sparse datasets
Not stated
Sparse vital-sign datasets
Vital signs
Maternal risk (multi-class)
ML models
Internal validation
Taye EA et al.
2025
27 Sub-Saharan African countries
Cross-sectional
Large DHS sample
DHS datasets
Socio-demographic, obstetric
Skilled birth attendance
Random Forest
Internal validation
Saleh SN et al.
2024
IoMT pilot (country not stated)
Pilot study
Not stated
IoMT device data
Vital signs
Maternal risk/adverse events
Deep learning
Internal validation
Tzimourta KD et al.
2025
Not stated
Structured dataset
1,014
Physiological dataset
Vital signs
Maternal risk (3-class)
ML classifiers
Internal validation
Chen L et al.
2021
China (rural)
Retrospective
Not stated
EMR / vital signs
Clinical predictors
Maternal mortality/complications
ML models
Unclear
Toure B. et al.
2025
LMIC (country not stated)
Registry/surveillance
Not stated
Registry data
Demographic, obstetric predictors
Maternal mortality / severe outcome
ML models
Unclear
Wang Y et al.
2019
China (rural)
Registry
Not stated
Population registry
Demographic, clinical predictors
Maternal death / severe outcome
Grey-Markov
Internal validation
This table summarizes the key features of each included study, including study ID, year, country/setting, study design, sample size, data source, predictors, outcome(s) predicted, model type, and validation strategy. The table highlights the diversity of datasets, populations, and modeling approaches across the 28 studies.
Table 3
Model Performance Dataset
Study
AUC (range)
Sensitivity
Specificity
Calibration reported
External validation
Al Mashrafi et al., 2024
0.70–0.78
70–80%
65–75%
No
No
Payne / PIERS-ML, 2023
0.84–0.90
80–88%
78–85%
Yes (plots, intercept/slope)
Yes
Shukla et al., 2020
0.78–0.86
75–85%
70–80%
Yes (plots)
No
Quad-Ensemble, 2024
0.82–0.89
78–86%
70–82%
No
No
Bangladesh IoT ML, 2023
0.90–0.95
85–92%
75–82%
No
No
Mboya et al., 2020
0.83–0.87
80–88%
75–83%
Partial (HL)
Yes
Montgomery-Csobán PIERS-ML, 2023
0.85–0.90
82–88%
78–84%
Yes (plots)
Yes
Malacova et al., 2020
0.84–0.88
80–85%
78–83%
Yes (plots)
Yes
Trudell et al., 2017
0.74–0.80
70–80%
68–75%
Yes (Hosmer–Lemeshow)
Yes
Podda et al., 2018
0.82–0.89
78–86%
74–82%
Yes
Yes
Koivu & Sairanen, 2020
0.80–0.85
75–85%
72–80%
No
Yes
Lee et al., 2021
0.80–0.86
78–85%
70–80%
No
No
Hsu J-F et al., 2021
0.80–0.87
78–85%
72–82%
No
No
Batista AF et al., 2021
0.81–0.86
76–84%
74–82%
Yes
No
Kumar R et al., India
0.90–0.94
85–92%
70–78%
No
No
Abebe T et al., 2019
0.82–0.88
78–86%
75–83%
Yes
No
Somalia DHS, 2020
0.81–0.87
78–85%
72–80%
No
No
Bangladesh MHR dataset
0.78–0.83
72–80%
70–78%
No
No
Patel S et al., 2020
0.80–0.85
75–82%
72–80%
Yes (plots)
Yes
Fatima S et al., 2019
0.74–0.80
70–78%
65–75%
No
No
Malde A et al., 2025
0.82–0.88
78–86%
75–83%
No
No
Taye EA et al., 2025
0.82–0.89
78–86%
74–82%
Partial
No
Saleh SN et al., 2024
0.90–0.95
85–92%
70–80%
No
No
Tzimourta KD et al., 2025
0.80–0.85
75–82%
70–78%
No
No
Heestermans et al., 2025 (systematic review)
0.80–0.90
Variable
Variable
Rarely reported
No
Wang Y et al., 2019
0.76–0.82
70–80%
68–76%
Limited
No
Table 3. Reported performance of maternal and perinatal outcome prediction models across included studies.
Summary of discrimination (AUC/AUROC), sensitivity, specificity, calibration reporting, and external validation. Most registry- and multicountry models achieved AUC values in the 0.80–0.88 range, while IoT/IoMT pilot studies reported higher discrimination (> 0.90) but lacked calibration and external validation. Calibration metrics were inconsistently reported, and fewer than half of the studies conducted independent external validation.
Fig. 3.1
PROBAST risk of bias summary for included studies
Click here to Correct
It presents risk of bias judgments per domain and overall for each model. This assessment informed the interpretation of model robustness and generalizability to rural health settings.
Color coding of PROBAST risk of bias assessments:
🟩 Low risk of bias – Domain judged at low risk, with minimal concerns about applicability or methodological rigor.
🟨 Moderate risk of bias – Domain judged at moderate risk, usually due to limited reporting, smaller sample size, or partial external validation.
🟥 High risk of bias – Domain judged at high risk, often due to poor participant representativeness, weak analysis methods, overfitting, or lack of validation.
Domains assessed
Participants, Predictors, Outcome, Analysis, Overall (PROBAST framework).
Total words in MS: 15097
Total words in Title: 15
Total words in Abstract: 405
Total Keyword count: 10
Total Images in MS: 7
Total Tables in MS: 4
Total Reference count: 80