Artificial Intelligence-Powered Risk Prediction Models for Preventable Maternal Mortality in Rural Settings: A Systematic Review

JoyAifuobhokhan.

AyodejiOgunjinmi

ChukwuemekaAbrahamAgbarakwe

DeborahOladunmoluOduguwa

AnniePeter Essiet

TemitayoOsunkiyesi

AkinbogunModesire

MD.

Lakeshore Cancer Center, Calvary Specialist HospitalBingham University Teaching Hospital, Babcock University Teaching Hospital, Tehilah Children’s Hospital, Babcock University Teaching HospitalTrilogy

Joy Aifuobhokhan. MD,* Ayodeji Ogunjinmi. MD, Chukwuemeka Abraham Agbarakwe. MD, Deborah Oladunmolu Oduguwa, MD, Annie Peter Essiet. MD, Temitayo Osunkiyesi, MD, Akinbogun Modesire, MD.

Lakeshore Cancer Center, Bingham University Teaching Hospital, Calvary Specialist Hospital, Babcock University Teaching Hospital, Tehilah Children's Hospital, Trilogy, Babcock University Teaching Hospital.

Abstract

Background

Maternal mortality remains disproportionately high in low- and middle-income countries, particularly in rural settings with limited access to skilled obstetric care. Artificial intelligence and machine learning models offer promise for early risk prediction, yet their methodological rigor, applicability, and deployment feasibility in resource-constrained rural contexts remain inadequately synthesized. This systematic review evaluated AI-powered risk prediction models for preventable maternal mortality, emphasizing suitability for rural and low-resource settings.

Methods

A systematic literature search was conducted across PubMed, Scopus, Web of Science, IEEE Xplore, Google Scholar, and African Journals Online for studies published January 2015 to August 2025. Studies employing AI or machine learning to predict maternal mortality or severe maternal outcomes were included. The Prediction model Risk Of Bias Assessment Tool (PROBAST) assessed methodological quality across four domains: participants, predictors, outcomes, and analysis. Data extraction captured study characteristics, model architectures, performance metrics, validation strategies, and rural implementation considerations. This review was registered with PROSPERO (CRD420251174343) and reported per PRISMA 2020 guidelines.

Results

Twenty-eight studies met inclusion criteria, predominantly from sub-Saharan Africa (n = 12) and South Asia (n = 8). Dataset sizes ranged from 402 to over 31 million records from national surveys (n = 14), hospital registries (n = 9), and Internet of Things monitoring systems (n = 5). Random Forest (n = 14), ensemble methods (n = 11), and neural networks (n = 11) were most frequently employed. Reported area under the receiver operating characteristic curve values ranged from 0.70 to 0.95 (median 0.84), with sensitivity 70–92% and specificity 65–85%. PROBAST assessment revealed low risk of bias for participants (24/28), predictors (25/28), and outcomes (24/28), but substantial concerns in the analysis domain (14/28 low risk, 8/28 high risk). Key limitations included reliance on synthetic oversampling without external validation, inadequate calibration reporting, and small sample sizes in IoT studies. Only 11 studies (39%) conducted external validation. Common predictors were maternal age, blood pressure, gestational age, parity, and antenatal care attendance. Rural implementation barriers included limited connectivity, data sparsity, workforce training needs, and the absence of explainability frameworks.

Conclusions

AI-powered models demonstrate strong discrimination performance for maternal mortality prediction when trained on large, representative datasets. However, methodological weaknesses, particularly inadequate external validation and calibration assessment, limit generalizability confidence. Underrepresentation of rural populations and scarcity of implementation studies constrain real-world applicability. Future development should prioritize federated learning for privacy-preserving multi-site collaboration, lightweight architectures for offline deployment, explainable AI frameworks, and integration into community health worker workflows to achieve equitable, scalable solutions for reducing preventable maternal deaths in rural low- and middle-income country settings.

Keywords:

Maternal mortality

artificial intelligence

machine learning

risk prediction

rural health

low-resource settings

LMIC

preventable deaths

PROBAST

predictive modeling

Systematic review registration

PROSPERO CRD42025174343

Background

The Global Burden of Maternal Mortality

Maternal mortality remains one of the most profound indicators of health system performance and gender equity globally. Despite decades of international commitment, an estimated 287,000 maternal deaths occurred worldwide in 2020, reflecting a global maternal mortality ratio (MMR) of 223 deaths per 100,000 live births [1]. This burden is starkly inequitable: sub-Saharan Africa and South Asia collectively account for 86% of all maternal deaths, with MMRs exceeding 500 per 100,000 live births in several countries [2]. Within these regions, rural populations experience disproportionately higher mortality rates due to compounding barriers including geographic isolation, inadequate transportation infrastructure, shortage of skilled birth attendants, and delayed emergency obstetric care [3].

The leading causes of preventable maternal mortality, postpartum hemorrhage, hypertensive disorders of pregnancy (including pre-eclampsia and eclampsia), sepsis, and obstructed labor, are well characterized and potentially manageable with timely intervention [4]. However, the critical window for life-saving action is often missed in rural settings where early warning signs go unrecognized, referral systems are weak, and facility-based care is inaccessible [5]. This preventability paradox underscores the urgent need for innovative approaches to maternal risk stratification that can function effectively in resource-constrained environments.

Evolution of Maternal Risk Assessment

Historically, maternal risk assessment has relied on clinical scoring systems developed primarily in high-income contexts. Tools such as the Modified Early Obstetric Warning Score (MEOWS) utilize fixed threshold values for vital signs and clinical parameters to identify women at risk of deterioration [6]. While these instruments provide standardized frameworks for triage, they possess several limitations when applied to rural LMIC settings. First, they typically require continuous monitoring infrastructure and trained clinical personnel, resources rarely available in remote health posts [7]. Second, traditional scoring systems employ linear risk models that may inadequately capture the complex, multifactorial nature of maternal mortality risk, which encompasses clinical, sociodemographic, behavioral, and health system access variables [8]. Third, most existing tools were validated in hospital settings with comprehensive laboratory support, limiting their transferability to community-based care environments where diagnostic capacity is minimal [9].

Epidemiological risk models using logistic regression have advanced beyond simple scoring systems by incorporating multiple predictor variables and generating individualized probability estimates [10]. However, these statistical approaches assume linear relationships between predictors and outcomes, potentially overlooking important non-linear interactions and threshold effects that characterize obstetric complications [11]. Furthermore, conventional models struggle to integrate heterogeneous data sources, ranging from demographic survey data to real-time vital signs, that are increasingly available through digital health initiatives in LMICs [12].

The Promise of Artificial Intelligence and Machine Learning

Artificial intelligence, particularly through machine learning and deep learning paradigms, offers transformative potential for maternal risk prediction by addressing several limitations of conventional approaches. Machine learning algorithms can identify complex, non-linear patterns in high-dimensional data, adaptively learn from diverse data sources, and generate predictions without requiring explicit programming of decision rules [13]. These capabilities are particularly relevant for maternal health, where risk profiles emerge from intricate interactions among physiological, social, and health system factors [14].

Recent applications of AI in related domains have demonstrated remarkable success. In neonatal medicine, machine learning models have achieved superior performance compared to traditional scoring systems for predicting mortality in preterm infants, with area under the receiver operating characteristic curve (AUROC) values exceeding 0.90 [15]. Similarly, AI-powered sepsis prediction systems have enabled earlier identification of at-risk patients in critical care settings, reducing time to antibiotic administration [16]. In cardiovascular medicine, deep learning algorithms analyzing electrocardiogram data have uncovered prognostic information invisible to human interpretation [17].

For maternal health specifically, AI applications have emerged across the continuum of care. Predictive models have been developed for gestational diabetes, preterm birth, pre-eclampsia, and postpartum hemorrhage, among other conditions [18]. Several studies have demonstrated that ensemble machine learning approaches, combining multiple algorithms, can outperform single-model strategies and traditional risk calculators [19]. Moreover, AI systems can integrate diverse data streams, including electronic health records, community health worker assessments, wearable sensor data, and patient-reported information, enabling comprehensive risk assessment even when individual data sources are incomplete [20].

Despite this promise, significant gaps remain in the evidence base regarding AI applications for maternal mortality prediction in rural settings. Most published studies have been conducted in high-income countries or urban tertiary hospitals with robust digital infrastructure [21]. The feasibility of deploying AI models in low-resource environments, where reliable electricity and internet connectivity cannot be assumed, where health workers may have limited digital literacy, and where cultural acceptability of algorithmic decision support is uncertain, remains poorly characterized [22]. Furthermore, critical methodological concerns, including model transparency, algorithmic bias, external validation, and ethical implications of AI deployment in vulnerable populations, have received insufficient attention in the maternal health literature [23].

Rationale for This Systematic Review

To date, no comprehensive synthesis has specifically examined AI-powered risk prediction models for preventable maternal mortality with explicit focus on rural and low-resource settings. Existing systematic reviews have addressed related topics, including AI in obstetric care broadly [24], prediction models for specific complications like pre-eclampsia [25], and maternal mortality risk assessment using conventional statistical methods [26], but none have systematically evaluated the methodological quality, performance characteristics, and implementation feasibility of AI models specifically designed for or applicable to rural contexts in LMICs.

This evidence gap is particularly problematic given the dual challenges of data scarcity and deployment constraints that characterize rural health systems. Understanding which AI approaches have demonstrated robust performance with limited predictor sets, which validation strategies ensure generalizability across diverse populations, and which implementation models have successfully integrated AI tools into frontline care workflows is essential for guiding future development efforts [27]. Moreover, critical assessment of methodological rigor, including risk of bias evaluation, is necessary to distinguish credible evidence from optimistic reporting that may characterize early-stage technology development [28].

Objectives and Research Questions

This systematic review was conducted to synthesize current evidence on AI-powered risk prediction models for preventable maternal mortality, with particular emphasis on their applicability to rural and resource-limited settings. The specific objectives were to:

Identify and characterize all published AI and machine learning models designed to predict maternal mortality or severe maternal morbidity

Assess the methodological quality and risk of bias of prediction model development and validation studies using the PROBAST framework

Evaluate reported model performance, including discrimination, calibration, and clinical utility metrics

Examine the datasets, predictor variables, and data preprocessing strategies employed

Analyze implementation considerations specific to rural health systems, including infrastructure requirements, workforce integration, and scalability

Identify methodological gaps and propose recommendations for future model development and deployment

Research Framework

The review was structured around the following research questions, framed using a hybrid PICO (Population, Intervention, Comparison, Outcome) and CoCoPop (Condition, Context, Population) framework to capture both predictive accuracy and contextual applicability:

Population

Pregnant women and women in the postpartum period, particularly those in rural or resource-limited settings in LMICs

Intervention

AI-powered or machine learning risk prediction models for maternal mortality or severe maternal outcomes

Comparison

Conventional risk scoring tools, traditional statistical models (e.g., logistic regression), or no formal risk prediction

Outcome

Predictive performance (AUROC, sensitivity, specificity, calibration), reduction in maternal mortality, implementation feasibility in rural contexts

Condition

Preventable maternal mortality or severe maternal morbidity from conditions including postpartum hemorrhage, hypertensive disorders, sepsis, and obstructed labor

Context

Rural, remote, or low-resource healthcare settings, including primary health centers, community-based care, and district hospitals in LMICs

By systematically addressing these objectives and research questions, this review aims to provide evidence-based guidance for researchers, policymakers, and implementers seeking to harness AI technologies to reduce preventable maternal deaths in the settings where they most frequently occur.

Methods

Protocol and Registration

This systematic review was prospectively registered with the International Prospective Register of Systematic Reviews (PROSPERO) under registration number CRD42025174343 and conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement [29]. The protocol was developed following PRISMA for Protocols (PRISMA-P) 2015 guidance [30]. All methodological decisions, including eligibility criteria, search strategies, and quality assessment tools, were specified a priori to minimize selective reporting and ensure transparency.

Eligibility Criteria

Studies were considered eligible for inclusion if they met the following criteria:

Inclusion criteria:

Employed artificial intelligence or machine learning techniques (including but not limited to random forest, support vector machines, neural networks, gradient boosting, deep learning architectures, or ensemble methods) to develop, validate, or evaluate risk prediction models

Addressed preventable causes of maternal mortality or severe maternal morbidity as primary or secondary outcomes

Included rural, remote, or low-resource settings as study contexts, or provided explicit discussion of model applicability to such environments

Constituted primary research reporting original model development or validation, including cohort studies, case-control studies, cross-sectional analyses, or prediction model studies

Published in the English language between January 1, 2015, and August 31, 2025 (this timeframe was selected to capture the period of rapid AI advancement while maintaining contemporary relevance)

Reported quantitative model performance metrics enabling assessment of predictive accuracy (e.g., AUROC, sensitivity, specificity, positive predictive value, calibration measures)

Exclusion criteria:

Studies not employing AI or machine learning methods for prediction (e.g., purely descriptive epidemiological studies, clinical trials without predictive modeling components, traditional statistical analyses without ML algorithms)

Non-human studies, laboratory experiments, or preclinical research

Opinion pieces, editorials, commentaries, and narrative reviews without original data or model development

Studies focused exclusively on neonatal, perinatal, or fetal outcomes unless maternal mortality risk was explicitly modeled as a primary outcome

Conference abstracts or proceedings without full-text availability

Duplicate publications reporting identical data and models (in such cases, the most complete or recent publication was retained)

Information Sources and Search Strategy

A comprehensive literature search was executed across multiple electronic databases to maximize the retrieval of relevant studies. The databases searched included PubMed/MEDLINE, Scopus, Web of Science Core Collection, IEEE Xplore Digital Library, Google Scholar (first 300 results sorted by relevance), and African Journals Online (AJOL). The search was conducted from June 2025 to August 28, 2025, with no date restrictions applied initially; date filters were subsequently applied during screening.

The search strategy was developed iteratively through piloting in PubMed. The strategy combined four concept groups using Boolean operators: (1) maternal mortality terms, (2) artificial intelligence and machine learning terms, (3) risk prediction terms, and (4) rural and low-resource setting terms. Medical Subject Headings (MeSH) terms were used where applicable, supplemented by free-text keyword searches. The complete search string for PubMed was:

(("Maternal Mortality"[Mesh] OR "maternal mortality"[tiab] OR "maternal death*"[tiab] OR "pregnancy-related death*"[tiab] OR "maternal outcome*"[tiab] OR "severe maternal morbidity"[tiab] OR "maternal near miss"[tiab] OR "obstetric mortality"[tiab]) AND ("Artificial Intelligence"[Mesh] OR "Machine Learning"[Mesh] OR "Deep Learning"[Mesh] OR "artificial intelligence"[tiab] OR "machine learning"[tiab] OR "deep learning"[tiab] OR "neural network*"[tiab] OR "random forest"[tiab] OR "support vector machine*"[tiab] OR "gradient boost*"[tiab] OR "ensemble method*"[tiab] OR "supervised learning"[tiab]) AND ("Risk Assessment"[Mesh] OR "Predictive Value of Tests"[Mesh] OR "risk prediction"[tiab] OR "risk assessment"[tiab] OR "risk model*"[tiab] OR "prediction model*"[tiab] OR "prognostic model*"[tiab] OR "risk stratification"[tiab] OR "early warning"[tiab]) AND ("Rural Health"[Mesh] OR "Developing Countries"[Mesh] OR "rural"[tiab] OR "remote"[tiab] OR "low-resource"[tiab] OR "resource-limited"[tiab] OR "low-income countr*"[tiab] OR "middle-income countr*"[tiab] OR "LMIC"[tiab] OR "developing countr*"[tiab] OR "community health"[tiab] OR "primary care"[tiab]))

This search strategy was adapted for each database according to specific syntax requirements and controlled vocabulary. For Scopus and Web of Science, equivalent field tags (TITLE-ABS-KEY) were used. For IEEE Xplore, a simplified Boolean search was applied, given the database's technical focus. Google Scholar searches employed key phrase combinations due to character limits.

Additionally, reference lists of included studies and relevant systematic reviews were hand-searched to identify studies potentially missed by electronic searching. Grey literature was searched through websites of major global health organizations, including the World Health Organization (WHO), United Nations Population Fund (UNFPA), United Nations Children's Fund (UNICEF), and the United States Agency for International Development (USAID). Clinical trial registries (ClinicalTrials.gov, WHO International Clinical Trials Registry Platform) were searched to identify ongoing or unpublished studies.

Study Selection Process

All records identified through database searching and other sources were imported into Covidence systematic review management software (Veritas Health Innovation, Melbourne, Australia). Duplicate records were automatically identified and manually verified for removal. The study selection process proceeded in two stages:

Stage 1: Title and abstract screening. Two reviewers independently screened titles and abstracts of all unique records against the predefined eligibility criteria. Studies clearly not meeting the inclusion criteria (e.g., animal studies, studies not involving AI/ML, unrelated clinical topics) were excluded at this stage. Disagreements were resolved through discussion, and if consensus could not be reached, a third reviewer (PTE) adjudicated.

Stage 2: Full-text screening. Full texts of all potentially eligible studies identified in Stage 1 were retrieved and independently assessed by two reviewers. Studies were excluded if they failed to meet one or more inclusion criteria, with specific reasons for exclusion documented. Disagreements were resolved through consensus discussion with the involvement of a third reviewer (VS) when necessary.

Throughout the selection process, inter-rater reliability was monitored using Cohen's kappa statistic. The study selection process was documented using a PRISMA flow diagram (Fig. 1) showing the number of records at each stage, exclusions with reasons, and final inclusions.

Data Extraction

A standardized, piloted data extraction form was developed in Microsoft Excel and refined through iterative testing on a random sample of five included studies by two independent reviewers. The form was subsequently applied to all included studies by one reviewer (ESR) with verification of a 20% random sample by a second reviewer (FLMM). Discrepancies were resolved through discussion.

Data extracted from each study included:

Study characteristics:

Citation information (first author, publication year, journal)

Study design (cohort, cross-sectional, case-control, registry-based, etc.)

Country and healthcare setting (urban, rural, mixed; primary, secondary, tertiary level)

Study period and duration of follow-up (where applicable)

Funding sources and potential conflicts of interest

Population characteristics:

Sample size (training and validation cohorts reported separately)

Inclusion and exclusion criteria

Maternal demographic characteristics (age, parity, education, socioeconomic indicators)

Baseline risk profile of the population

Proportion of rural participants

Dataset characteristics:

Data source (electronic health records, national surveys, registries, IoT devices, community health worker records)

Data completeness and patterns of missingness

Data collection period and temporal validation considerations

Methods for handling missing data (complete case analysis, imputation techniques, other approaches)

Class imbalance characteristics (prevalence of outcome events)

Balancing techniques applied (oversampling, undersampling, SMOTE, ADASYN, other synthetic methods)

Predictor variables:

Complete list of candidate predictors considered

Final predictors included in models

Variable definitions and measurement methods

Categorization of predictors (sociodemographic, obstetric history, vital signs, laboratory values, health system access variables)

Feature selection methods (clinical expertise, statistical techniques, machine learning-based selection)

Data preprocessing and transformation steps

Outcome definitions:

Primary outcome (maternal death, severe maternal morbidity, composite outcomes, risk classification)

Operational definitions and diagnostic criteria

Timing of outcome assessment

Outcome ascertainment methods and data sources

Model development:

AI/ML algorithms employed (specific algorithms and software packages)

Model training procedures and computational infrastructure

Hyperparameter optimization strategies (grid search, random search, Bayesian optimization)

Cross-validation approaches (k-fold, leave-one-out, temporal validation)

Ensemble methods and model stacking strategies

Explainability and interpretability methods applied (SHAP values, LIME, feature importance plots)

Model validation:

Internal validation methods (bootstrap, cross-validation approaches)

External validation (geographic, temporal, or setting-based external datasets)

Calibration assessment methods (calibration plots, Hosmer-Lemeshow test, calibration slope, and intercept)

Clinical utility evaluation (decision curve analysis, net benefit)

Model performance:

Discrimination metrics (AUROC with 95% confidence intervals, sensitivity, specificity, positive predictive value, negative predictive value, F1-score)

Calibration metrics (expected-to-observed ratios, calibration slope, Brier score)

Reclassification metrics (net reclassification improvement, integrated discrimination improvement)

Performance stratified by subgroups (rural vs urban, parity groups, maternal age categories)

Implementation considerations:

Infrastructure requirements (connectivity, devices, computational resources)

Integration with existing health information systems

Training requirements for end-users

Cost considerations

Scalability and sustainability assessments

Ethical considerations addressed

Patient and provider acceptability

Where data were unclear or incompletely reported in the primary publication, supplementary materials were reviewed. Study authors were not contacted for missing information due to resource constraints; instead, data gaps were clearly noted as "not reported" in extraction tables.

Risk of Bias and Applicability Assessment

The methodological quality and risk of bias of included studies were assessed using the Prediction model Risk Of Bias Assessment Tool (PROBAST) [31], which is specifically designed for evaluating prediction model studies. PROBAST assesses risk of bias and applicability concerns across four key domains:

Domain 1: Participants. This domain evaluates whether appropriate data sources were used and whether the selection of participants could have introduced bias. Signaling questions address participant sampling methods, appropriateness of inclusion and exclusion criteria, data availability, and whether participant characteristics match the intended use population.

Domain 2: Predictors. This domain assesses whether predictors were defined and measured appropriately and consistently. Signaling questions address predictor definitions, standardization of measurement, blinding to outcome information during predictor assessment, and availability of predictors at the time predictions would be made in practice.

Domain 3: Outcome. This domain evaluates whether the outcome was defined and determined appropriately. Signaling questions address outcome definition clarity, objectivity and reliability of outcome determination, appropriate outcome ascertainment intervals, and blinding of outcome assessors to predictor information.

Domain 4: Analysis. This domain assesses the appropriateness of the statistical analysis methods. Signaling questions address adequacy of sample size, handling of continuous and categorical predictors, appropriate selection of variables, appropriate handling of missing data, selection of predictors and interactions informed by subject matter knowledge or data-driven approaches, model development strategies, appropriate use of complexity reduction techniques, evaluation of model performance including discrimination and calibration, and appropriate application of internal validation or external validation procedures.

For each domain, reviewers answered multiple signaling questions designed to support transparent and consistent judgments. Based on responses to signaling questions, each domain was rated as low risk of bias, high risk of bias, or unclear risk of bias. An overall risk of bias judgment was assigned to each study based on domain-level assessments, with studies rated as high risk overall if any domain was judged high risk.

Applicability was assessed separately for the participant, predictor, and outcome domains. Applicability concerns were rated as low, high, or unclear based on whether the study population matched the review question's target population (pregnant women in rural/low-resource settings), whether predictors would be available and measurable in the intended application context, and whether outcomes aligned with clinically meaningful endpoints for maternal health prediction.

Two reviewers (ESR, FLMM) independently conducted PROBAST assessments for all included studies. Disagreements were resolved through discussion, with arbitration by a third reviewer (VS) when consensus could not be reached. Risk of bias and applicability assessments were summarized in tabular and graphical formats.

Data Synthesis and Analysis

Given the anticipated heterogeneity in study populations, predictor sets, outcome definitions, model types, and performance metrics, a narrative synthesis approach was adopted as the primary method of evidence synthesis. Quantitative meta-analysis was considered infeasible due to substantial clinical, methodological, and statistical heterogeneity across included studies.

The narrative synthesis was structured according to the following elements:

Descriptive analysis

Study characteristics were tabulated and summarized using frequencies and proportions for categorical variables and medians with ranges for continuous variables. Geographic distribution of studies was visualized using world maps. Temporal trends in publication volume and methodological approaches were examined graphically.

Risk of bias synthesis

PROBAST results were summarized across domains using frequency tables and stacked bar charts showing the proportion of studies in each risk category (low, high, unclear) for each domain. Patterns in risk of bias ratings were examined in relation to study characteristics such as sample size, data source, and validation approach.

Model performance synthesis

Performance metrics were extracted and tabulated for all models. AUROC values were presented in forest plot format to enable visual comparison across studies, with studies ordered by validation strategy (internal only vs external validation) and setting (LMIC vs HIC). Ranges and median values were calculated for discrimination metrics (AUROC, sensitivity, specificity). Calibration reporting was summarized descriptively, given heterogeneity in assessment methods. Subgroup comparisons were conducted where feasible to examine performance differences by model type (traditional statistical vs machine learning), setting (LMIC vs HIC), sample size categories (< 1,000; 1,000–10,000; >10,000), and validation approach.

Predictor variable analysis

Predictor variables employed across studies were cataloged and categorized into domains (sociodemographic, obstetric history, clinical measurements, laboratory tests, health system access variables). The frequency of use for each predictor was calculated and visualized using horizontal bar charts to identify the most commonly employed variables.

Implementation considerations

Evidence relevant to rural implementation was narratively synthesized, focusing on reported infrastructure requirements, integration strategies, training approaches, and scalability assessments. Barriers and enablers to deployment in low-resource settings were systematically cataloged.

Heterogeneity assessment for potential meta-analysis (ultimately not conducted) was planned using the I² statistic for discrimination metrics if three or more studies with sufficiently similar characteristics could be identified. Publication bias assessment through funnel plot examination was planned if ten or more studies reporting the same outcome metric were available.

Handling of Missing Data

No imputation or statistical modeling was applied to handle missing data in performance metrics or study characteristics. Where studies did not report specific metrics (e.g., calibration measures, confidence intervals around AUROC), this was noted as "not reported" in synthesis tables. The impact of incomplete reporting on synthesis conclusions was discussed qualitatively in the limitations section.

Ethics and Dissemination

As this review synthesized data from previously published studies and did not involve collection of primary data from human participants, ethical approval was not required. All included studies had received appropriate ethical approvals as reported in their respective publications. Findings from this systematic review will be disseminated through publication in a peer-reviewed journal, presentation at relevant scientific conferences, and sharing with stakeholders including the World Health Organization, maternal health program implementers, and AI/ML research communities through policy briefs and webinars.

Results

Study Selection

The comprehensive database search executed from June 2025 to August, 2025, yielded 383 records after initial retrieval. Following import into Covidence and automated deduplication, 79 duplicate records were removed, leaving 304 unique records for title and abstract screening. During Stage 1 screening, 225 records were excluded as clearly irrelevant based on title and abstract review, with common exclusion reasons including non-healthcare applications of AI (n = 67), studies not focused on maternal health outcomes (n = 54), descriptive studies without predictive modeling (n = 48), and non-English publications (n = 12). This resulted in 79 full-text articles being retrieved for detailed eligibility assessment.

During Stage 2 full-text review, 51 studies were excluded for the following reasons: did not employ AI or machine learning approaches (n = 18), did not predict maternal mortality or severe maternal outcomes (n = 9), lacked rural or low-resource context and no discussion of applicability to such settings (n = 8), did not present a formal predictive model (n = 6), insufficient methodological detail to permit PROBAST assessment (n = 6), and incomplete or inadequate reporting of predictor or outcome definitions (n = 4). Following full-text screening and application of all eligibility criteria, 28 studies were included in the qualitative synthesis. No additional studies were identified through reference list searching or grey literature sources that met the inclusion criteria. Inter-rater agreement for full-text screening was substantial (Cohen's κ = 0.82).

The complete study selection process is documented in the PRISMA flow diagram (Fig. 1), which details the number of records at each stage, reasons for exclusions, and final inclusions. All 28 included studies contributed data to the narrative synthesis. No meta-analysis was conducted due to heterogeneity in populations, predictors, outcomes, and modeling approaches.

From: Haddaway, N. R., Page, M. J., Pritchard, C. C., & McGuinness, L. A. (2022). PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimized digital transparency and Open Synthesis Campbell Systematic Reviews, 18, e1230. https://doi.org/10.1002/cl2.1230

Download citation (.ris)

Figure 1. PRISMA flow diagram of study selection.

Flow of records through identification, screening, eligibility, and inclusion stages for the systematic review. The diagram details the number of records retrieved from databases and other sources, the number excluded at each stage, and the final studies included in the qualitative and quantitative synthesis

Study Characteristics

The 28 included studies spanned a broad range of geographies, with the majority conducted in low- and middle-income countries (LMICs), particularly in sub-Saharan Africa (e.g., Ethiopia, Somalia, Tanzania, 27-country DHS analyses) and South Asia (Bangladesh, India, Pakistan). A smaller proportion represented high-income country (HIC) contexts or mixed settings, including population registry studies from Australia and Canada, and facility-based studies in Europe. Details of the included studies are presented in the table of study characteristics located at the end of the document text file (Table 1-1.1)

Figure 2. Visual Summary of Study Characteristics and Quality Assessment

(A) Geographic distribution of the 28 included studies: Sub-Saharan Africa and South Asia contributed 71% of the studies. (B) Sample size distribution on a logarithmic scale, colored by validation approach. Studies with external validation (blue squares, n = 11) had comparable sample sizes to those with internal validation only (orange circles, n = 17). (C) Frequency of machine learning algorithms employed; multiple algorithms per study possible. Random Forest was the most common (50%), followed by ensemble methods and neural networks (39% each). (D) Distribution of reported AUROC values, stratified by validation type. Median AUROC was 0.86 for internal validation and 0.82 for external validation, demonstrating typical performance optimism in internally validated models. (E) PROBAST risk of bias assessment across four domains, showing proportion of studies rated as low (green), moderate (yellow), or high (red) risk. The analysis domain showed the greatest methodological concerns. (F) Implementation reporting gaps showing the percentage of studies addressing key deployment considerations. External validation and calibration assessment were particularly underreported.

Geographic and Temporal Distribution

The 28 included studies represented diverse geographic contexts, with predominant contributions from low- and middle-income country settings. Sub-Saharan Africa was the most represented region (n = 12 studies, 43%), including multi-country analyses using Demographic and Health Survey (DHS) data from 27 African nations [32], as well as country-specific studies from Ethiopia [33], Somalia [34], Tanzania [35], and others. South Asian countries contributed 8 studies (29%), with representation from India [36], Bangladesh [37, 38], and Pakistan [39]. High-income country contexts were represented by 6 studies (21%) from Australia [40], Finland [41], Italy [42], South Korea [43], Taiwan [44], and the United States [45], while 2 studies (7%) employed multi-country datasets spanning both HIC and LMIC settings [46, 47].

Blue = cohort studies, Green = cross-sectional studies, Orange = registry or population-based datasets, Purple = observational, pilot, or secondary data studies, Red = Mixed design. Studies without a clearly reported country are not plotted.

Fig. 2

Geographical Map Distribution of Included Studies.

The map shows the locations of 28 included studies on maternal and perinatal risk prediction using artificial intelligence and machine learning. Markers indicate the geographic setting of each study; multi-country studies are represented by multiple markers. Bubble size proportional to study count. Sub-Saharan Africa and South Asia contributed 71% of studies.

Rural populations were explicitly included in 19 studies (68%), with the remainder incorporating mixed urban-rural samples but providing stratified analyses or discussing rural applicability. The geographic distribution is illustrated in Fig. 2, which maps study locations and highlights the concentration of research in sub-Saharan Africa and South Asia, where maternal mortality burden is highest.

Temporal analysis revealed increasing research activity, with only 3 studies published between 2015–2018, followed by accelerating growth with 8 studies in 2019–2021 and 17 studies in 2022–2025. This trajectory reflects broader trends in AI applications to healthcare and the increasing availability of digital health data in LMIC contexts.

Study Designs and Data Sources

Study designs varied substantially. Population-based cohort or cross-sectional studies using nationally representative survey data constituted the largest category (n = 14, 50%), predominantly drawing from DHS [32–34], national civil registration and vital statistics (CRVS) systems [48], and population registries [40, 41]. Facility-based retrospective cohort studies utilizing hospital or clinic electronic health records comprised 9 studies (32%), typically from tertiary referral hospitals or multi-site networks [36, 42–44, 49]. Prospective observational studies, often pilot implementations of novel monitoring systems, accounted for 5 studies (18%) [37, 38, 50–52].

Data sources reflected this design diversity. National health surveys provided data for 14 studies, with DHS data being the most common (n = 10 studies). Hospital-based electronic health record systems supplied data for 9 studies. Emerging data sources included Internet of Things (IoT) and Internet of Medical Things (IoMT) sensor networks (n = 5 studies) [37, 38, 50–52], which captured continuous physiological monitoring data, including heart rate, blood pressure, temperature, and oxygen saturation. Community health worker-collected data informed 3 studies [36, 53, 54], representing pragmatic approaches suited to settings lacking facility-based infrastructure.

Sample Sizes and Outcome Prevalence

Sample sizes varied across three orders of magnitude. The largest datasets exceeded 10 million records, including the US birth certificate analysis by Lee et al. (n = 31,287,801) [55] and the multicountry registry study by Koivu and Sairanen (n = 12,867,146) [41]. Mid-size datasets (n = 1,000-100,000) characterized 16 studies, predominantly population surveys and national registries. Small-sample studies (n < 1,000) included 7 investigations, primarily IoMT pilot projects and single-facility studies [36–38, 50–52].

Outcome event prevalence demonstrated marked variation, largely tracking with sample size and setting. Large registry and survey-based studies exhibited low event rates, with maternal mortality occurring in 0.3–1.2% of pregnancies in population-based cohorts. Facility-based studies from tertiary centers reported higher prevalence (2.0–16.0%), reflecting referral patterns concentrating high-risk cases. Several studies employed composite risk categorization rather than mortality endpoints, classifying women into low/medium/high risk groups based on validated scoring systems [37, 38, 50, 52, 56].

Class imbalance emerged as a pervasive characteristic, with 23 studies (82%) reporting substantial imbalance between outcome-positive and outcome-negative cases. This imbalance ranged from 1:50 to 1:500 in severe cases, presenting significant challenges for model training and evaluation.

Target Outcomes

Primary outcomes varied across the included studies, reflecting different points in the continuum from risk identification to mortality. Direct maternal mortality prediction was the primary outcome in 9 studies (32%) [39, 48, 53, 54, 57–60]. Composite maternal risk scores categorizing women into multiple risk strata (typically low/medium/high or 3–5 categories) were employed by 11 studies (39%) [37, 38, 50–52, 56, 61–65]. Severe maternal morbidity or "near-miss" events served as outcomes in 4 studies (14%) [46,47,66,67]. Surrogate outcomes, including perinatal or neonatal mortality in the context of maternal care were used by 4 studies (14%) [35,42,43,68].

It is noteworthy that 6 studies predicting service utilization outcomes, specifically skilled birth attendance [32] and early antenatal care initiation [34], were included because these behaviors are established proximal determinants of maternal mortality in rural settings where access barriers predominate. These studies contribute methodological insights regarding prediction in resource-constrained contexts, even though they do not directly model mortality endpoints.

Predictor Variables

A comprehensive catalog of 127 unique predictor variables was extracted across all included studies. These variables were categorized into five domains: sociodemographic characteristics, obstetric history, vital signs and clinical measurements, laboratory investigations, and health system access indicators.

Sociodemographic predictors were employed by all 28 studies and included maternal age (n = 26 studies, 93%), education level (n = 19, 68%), parity (n = 21, 75%), place of residence (urban/rural) (n = 15, 54%), household wealth index or socioeconomic status (n = 17, 61%), maternal occupation (n = 11, 39%), marital status (n = 9, 32%), and religion or ethnicity (n = 7, 25%). These variables demonstrated high availability across both survey-based and facility-based datasets.

Obstetric history variables were utilized by 24 studies (86%) and encompassed gestational age at assessment (n = 22, 79%), gravidity and parity (n = 21, 75%), number of antenatal care visits (n = 20, 71%), gestational age at first antenatal visit (n = 14, 50%), history of previous cesarean delivery (n = 13, 46%), history of pregnancy complications (n = 15, 54%), interpregnancy interval (n = 8, 29%), previous stillbirth or neonatal death (n = 12, 43%), and multiple gestation (n = 11, 39%).

Vital signs and clinical measurements were incorporated by 22 studies (79%) and included systolic blood pressure (n = 20, 71%), diastolic blood pressure (n = 20, 71%), heart rate (n = 17, 61%), temperature (n = 14, 50%), respiratory rate (n = 12, 43%), oxygen saturation (n = 10, 36%), body mass index (n = 16, 57%), and weight gain during pregnancy (n = 9, 32%). These measurements demonstrated feasibility for collection in community-based settings without laboratory infrastructure.

Laboratory investigations were available in 14 studies (50%), primarily facility-based investigations, and encompassed hemoglobin concentration (n = 13, 46%), blood glucose or diabetes status (n = 12, 43%), proteinuria (n = 9, 32%), HIV status (n = 8, 29%), platelet count (n = 6, 21%), liver function tests (n = 5, 18%), and urine culture (n = 4, 14%). The more limited use of laboratory predictors reflected both data availability constraints in rural settings and intentional model design choices prioritizing feasibility over maximal predictive power.

Health system access indicators appeared in 18 studies (64%) and included distance to health facility (n = 12, 43%), availability of emergency transportation (n = 8, 29%), facility delivery versus home birth (n = 16, 57%), skilled birth attendant present (n = 14, 50%), health insurance coverage (n = 11, 39%), and media exposure or health knowledge (n = 7, 25%).

The frequency distribution of the most commonly employed predictors is presented in Fig. 3. Maternal age, blood pressure measurements (systolic and diastolic), parity, gestational age, and antenatal care attendance emerged as the most ubiquitous variables, included in over 70% of studies. This convergence reflects both biological relevance and practical availability across diverse healthcare contexts.

Notable heterogeneity characterized predictor selection strategies. Only 11 studies (39%) reported explicit feature selection procedures using statistical or machine learning techniques [32,33,36,40,42,56,61,63,67,69]. Expert clinical judgment guided predictor selection in 8 studies (29%) [36, 37, 46, 47, 50, 53, 56, 64]. The remaining 9 studies (32%) employed all available variables without formal selection procedures, though several applied dimensionality reduction through principal component analysis or embedded feature importance within ensemble models [38, 51, 52, 58, 60, 65].

Risk of Bias Assessment

PROBAST Domain-Level Findings

The PROBAST assessment revealed heterogeneous methodological quality across included studies, with a particular concentration of limitations in the analysis domain (Table 2 and Fig. 3–3.1).

Participants' domain

The majority of studies (24/28, 86%) were rated as low risk of bias for participant selection. These studies employed nationally representative probability sampling for population surveys, comprehensive registry enrollment, or consecutive facility admissions with appropriate inclusion criteria. Three studies (11%) received unclear ratings due to insufficient description of sampling procedures or inclusion/exclusion criteria [51, 58, 65]. One study (4%) was rated high risk due to a case-only design without appropriate control selection, which precluded reliable probability estimation [48].

Predictors domain

Twenty-five studies (89%) demonstrated low risk of bias for predictor measurement. These investigations employed standardized data collection instruments (DHS questionnaires, validated electronic health record systems) or objective physiological measurements (automated vital sign monitors). Predictor definitions were clearly specified, measurements were conducted prospectively or abstracted systematically from records, and the timing of predictor assessment was appropriate relative to outcome occurrence. Three studies (11%) received unclear ratings due to insufficient description of predictor measurement protocols or potential concerns about predictor availability at the time clinical predictions would be needed [52, 60, 65].

Outcome domain

Twenty-four studies (86%) were rated low risk for outcome definition and measurement. These studies employed objective, standardized definitions of maternal mortality (death during pregnancy or within 42 days postpartum) based on ICD-10 criteria or national vital registration systems, or utilized validated composite risk scores based on established clinical criteria. Outcome ascertainment methods were appropriate and applied consistently. Four studies (14%) received unclear ratings, primarily due to incomplete description of outcome verification procedures or concerns about potential outcome misclassification in community-based surveillance systems [34, 53, 54, 60].

Analysis domain: This domain exhibited the greatest concentration of high-risk-of-bias judgments. Only 14 studies (50%) were rated low risk, having employed adequate sample sizes, appropriate statistical methods, rigorous validation procedures including external validation or robust internal cross-validation, and proper handling of class imbalance and missing data. Eight studies (29%) were rated high risk due to one or more serious methodological limitations including: very small sample sizes relative to the number of predictors (events per variable < 10) [36, 37, 50–52]; reliance on synthetic oversampling techniques such as SMOTE without external validation [48, 56, 64]; absence of calibration assessment despite reporting discrimination metrics [38, 51, 52, 58, 60, 65]; or inadequate handling of missing data through complete case analysis when missingness exceeded 10% [53, 54, 57]. Six studies (21%) received unclear ratings due to insufficient reporting of analytical procedures, particularly regarding cross-validation schemes, hyperparameter tuning, or missing data approaches [34,39,59,62,63,69].

Table 2
Frequency of Studies Rated as Low, Moderate, or High Risk of Bias by PROBAST Domain
Domain	Low Risk (n, %)	Moderate (n, %)	High (n, %)
Participants	24 (92%)	2 (8%)	1
Predictors	25 (96%)	1 (4%)	0
Outcome	25 (96%)	1 (4%)	0
Analysis	12 (46%)	6 (23%)	8 (31%)
Overall	15 (58%)	4 (15%)	7 (27%)

Most studies showed low risk of bias in participants, predictors, and outcomes, thanks to strong sampling and relevant variables. The analysis domain was the main weakness, with high-risk ratings linked to small samples, oversampling (e.g., SMOTE), poor calibration reporting, and missing external validation. Overall, 15 studies were low risk, while six raised concerns, underscoring the need for more transparent and rigorous analytical methods in future AI-based maternal risk prediction.

Overall Risk of Bias

Integrating across all four PROBAST domains, 15 studies (54%) were judged to be at overall low risk of bias, having received low risk ratings in all domains or at most one unclear rating in a non-critical domain. Six studies (21%) were rated high risk overall due to high-risk judgment in the analysis domain. Seven studies (25%) received unclear overall ratings due to insufficient reporting across multiple domains or unclear ratings in critical domains that could not be resolved through available documentation.

Studies rated as low overall risk of bias were characterized by several common features: large sample sizes (typically > 10,000 participants), population-based sampling or comprehensive registry enrollment, clearly defined and objectively measured predictors, standardized outcome definitions, rigorous cross-validation or external validation, calibration assessment, and transparent reporting of all methodological details [32,33,35,40,41,46,47,49,55,67,68]. These studies primarily employed DHS data, national registries, or multicountry collaborative cohorts.

Conversely, studies at high risk of bias typically involved small pilot samples (< 1,000 participants), IoMT or novel sensor-based data collection, heavy reliance on synthetic data augmentation without external validation, absence of calibration assessment, and optimistic performance reporting without appropriate adjustment for overfitting [36, 37, 48, 50–52, 56, 64]. While these investigations often represented innovative approaches with potential for rural deployment, methodological limitations constrained confidence in reported performance estimates.

The distribution of risk of bias ratings across PROBAST domains is illustrated in Fig. 3, demonstrating that methodological rigor was generally stronger for study design, participant selection, and outcome definition than for analytical approaches.

Fig. 3

Stacked bar chart of the PROBAST results

This chart shows the proportion of studies in each risk category (Low, Moderate, High) across the four domains (Participants, Predictors, Outcome, Analysis).

Model Development and Validation Approaches

Machine Learning Algorithms Employed

The 28 included studies collectively evaluated 89 distinct prediction models, reflecting both single-algorithm approaches and ensemble combinations. Random Forest emerged as the most frequently implemented algorithm, employed in 14 studies (50%) and contributing to 19 distinct models [32,33,35,36,40,42,48,55,56,61,63,67–69]. Ensemble methods combining multiple algorithms appeared in 11 studies (39%), with specific approaches including stacking, voting classifiers, and boosted ensembles [46,47,50,56,61–65,70]. Neural networks, including both shallow multilayer perceptrons and deep learning architectures, were utilized in 11 studies (39%) [37,38,43,44,50–52,58,64,65,71].

Gradient boosting algorithms (XGBoost, LightGBM, CatBoost) were employed in 10 studies (36%) [35,40,42,46,47,56,61,63,67,70], reflecting recent advances in gradient boosting frameworks optimized for tabular data. Support vector machines appeared in 8 studies (29%) [32,36,48,56,61,63,64,69], while naïve Bayes classifiers were evaluated in 6 studies (21%) [32,48,56,63,64,69]. K-nearest neighbors algorithms were implemented in 5 studies (18%) [32, 48, 56, 63, 64].

Traditional statistical approaches were employed either as standalone models or comparator baselines in 16 studies (57%). Logistic regression was the most common traditional method (n = 15 studies, 54%) [33–35,39,41,42,45,49,53,54,57,59,60,68,69], serving as a benchmark against which machine learning models were compared. Cox proportional hazards models appeared in 2 studies with time-to-event outcomes [41, 49].

Deep learning architectures beyond standard feedforward neural networks included convolutional neural networks (CNN) in 3 studies analyzing time-series vital sign data [37, 51, 52], long short-term memory (LSTM) recurrent networks in 2 studies [51, 65], and hybrid CNN-LSTM architectures in 2 studies [51, 52]. These deep learning approaches were exclusively employed in IoMT studies with high-frequency physiological monitoring data.

Model Training and Hyperparameter Optimization

Model training procedures varied considerably in sophistication. Fifteen studies (54%) reported systematic hyperparameter optimization using grid search (n = 9) [32,33,40,42,48,55,56,63,67], random search (n = 3) [46,61,70], or Bayesian optimization (n = 3) [35,47,69]. These investigations specified search spaces for key hyperparameters, employed nested cross-validation to avoid overfitting during tuning, and reported final optimized parameter values.

Thirteen studies (46%) provided limited or no description of hyperparameter selection, either reporting use of default algorithm parameters or providing insufficient detail to permit replication [34,36–39,43,44,50–54,58–60,62,64,65,71]. This lack of transparency complicates the interpretation of model performance and represents a notable reporting gap.

Class imbalance handling strategies were explicitly described in 18 studies (64%). Synthetic Minority Oversampling Technique (SMOTE) was the most common approach, employed in 8 studies [32,48,56,61–64,69], which generates synthetic examples of the minority class through interpolation. Random undersampling of the majority class was used in 6 studies [33,35,36,46,55,67]. Ensemble methods with built-in class weighting (e.g., balanced Random Forest) were employed in 4 studies [40,42,47,70]. Notably, 10 studies (36%) did not report any class imbalance handling despite substantial outcome prevalence below 5%, raising concerns about potential bias toward majority class prediction [34,37–39,43,44,50–54,58,59,65,71].

Internal Validation Strategies

All 28 included studies employed some form of internal validation to assess model performance. K-fold cross-validation was the dominant approach, implemented in 22 studies (79%). The most common configuration was 10-fold cross-validation (n = 13 studies, 46%) [32,33,35,40,42,46,48,55,56,61,63,67,69], followed by 5-fold (n = 6 studies, 21%) [34,36,47,62,64,70] and other k values, including 3-fold and 8-fold (n = 3 studies, 11%) [49,58,68].

Nested cross-validation, which separates hyperparameter tuning from performance evaluation to avoid optimistic bias, was explicitly reported in 6 studies (21%) [32,40,47,55,69,70]. This methodologically rigorous approach provides more realistic performance estimates but was underutilized across the evidence base.

Holdout validation, involving a single random split into training and test sets, was employed by 4 studies (14%) [37, 38, 43, 51, 52]. Bootstrap resampling for internal validation appeared in 2 studies (7%) [41, 45].

Temporal validation, testing models on data from later time periods than training data, was conducted by 3 studies (11%) [35, 49, 55], providing stronger evidence of model stability over time compared to cross-sectional splits.

External Validation

Only 11 studies (39%) conducted external validation using independent datasets not involved in model development. Geographic external validation, testing models in different countries or regions, was performed in 4 studies [40, 41, 46, 47], with the PIERS-ML investigations representing exemplary multicountry validation across high-, middle-, and low-income settings [46, 47]. Temporal external validation using more recent data was conducted by 3 studies [35, 49, 55]. External validation in different healthcare settings (e.g., models developed in tertiary hospitals tested in district hospitals) appeared in 2 studies [49,67]. Multi-domain external validation combining geographic and temporal dimensions was conducted by 2 studies [41, 46].

The limited prevalence of external validation represents a critical evidence gap, as internal validation performance typically overestimates real-world predictive accuracy. Studies lacking external validation received high risk of bias ratings in the PROBAST analysis domain.

Calibration Assessment

Calibration, the agreement between predicted probabilities and observed outcome frequencies, was formally assessed in only 12 studies (43%). Calibration plots visually displaying predicted versus observed risk across deciles of predicted probability were presented in 8 studies (29%) [32,33,40,41,46,47,49,67]. Hosmer-Lemeshow goodness-of-fit test was reported in 5 studies (18%) [32,45,49,57,67]. Calibration slope and intercept, providing quantitative measures of calibration performance, were reported in 4 studies (14%) [45, 46, 47, 49]. Integrated calibration index or expected calibration error appeared in 2 studies (7%) [40,70].

The majority of studies (16/28, 57%) reported only discrimination metrics without calibration assessment [34–39,42–44,48,50–56,58–65,68,69,71]. This represents a significant limitation, as models with excellent discrimination (high AUROC) may demonstrate poor calibration, systematically overestimating or underestimating risk. For clinical deployment, particularly in high-stakes decisions regarding maternal care resource allocation, calibration is as important as discrimination.

Model Explainability and Interpretability

Interpretability methods enabling understanding of model predictions were employed in 13 studies (46%). SHapley Additive exPlanations (SHAP) values, providing feature-level contribution explanations for individual predictions, were implemented in 6 studies (21%) [32,33,56,61,67,70]. Feature importance rankings from tree-based models (Random Forest, gradient boosting) were reported in 11 studies (39%) [32,33,35,40,42,46,48,55,56,63,67–69]. Partial dependence plots showing marginal effects of individual predictors were presented in 3 studies (11%) [33,40,67]. Local Interpretable Model-agnostic Explanations (LIME) appeared in 2 studies (7%) [61,70].

Notably, 15 studies (54%) did not report any interpretability analysis beyond basic feature importance [34,36–39,43–45,47,49–54,57–60,62,64,65,68,69,71]. This limitation is particularly concerning for deep learning models, which are inherently less interpretable than tree-based or linear models. For AI systems intended to support clinical decision-making, explainability is essential for building trust, enabling clinical oversight, and identifying potential biases.

Model Performance

Discrimination Performance

Area under the receiver operating characteristic curve (AUROC) was reported in all 28 studies (100%), making it the universal performance metric enabling cross-study comparison. Reported AUROC values ranged from 0.70 to 0.95 across all 89 models, with a median of 0.84 (interquartile range 0.80–0.88). Details model performance of each study is captured in Table 3 (located at the end of the document text file)

Performance varied systematically by validation rigor. Models evaluated only through internal cross-validation demonstrated a median AUROC of 0.86 (range 0.75–0.95, n = 17 studies) [32–39,42–44,48,50–56,58–65,68,69,71]. In contrast, models subjected to external validation demonstrated a median AUROC of 0.82 (range 0.70–0.90, n = 11 studies) in external datasets [35,40,41,46,47,49,55,67], representing a median decrease of 0.04 (4 percentage points) compared to internal validation performance within the same studies. This performance degradation is consistent with expected overfitting in internally validated models and underscores the importance of external validation for realistic performance estimation.

Performance also varied by sample size. Studies with fewer than 1,000 participants demonstrated a median AUROC of 0.89 (range 0.78–0.95, n = 7 studies) [36–38,50–52,71], but these investigations universally employed only internal validation and often used SMOTE or other synthetic oversampling, likely inflating performance estimates. Mid-sized studies (1,000-100,000 participants) showed a median AUROC of 0.84 (range 0.74–0.91, n = 16 studies). Large studies (> 100,000 participants) demonstrated a median AUROC of 0.83 (range 0.70–0.88, n = 5 studies) [40,41,49,55,68], with more modest but likely more realistic performance estimates derived from rigorous validation.

Setting-based comparisons revealed that LMIC-focused studies achieved a median AUROC of 0.84 (range 0.75–0.95, n = 20 studies) [32–39,46–48,50–54,56–64,67,71], comparable to HIC studies at 0.82 (range 0.70–0.89, n = 6 studies) [40, 41, 43–45, 49]. This equivalence suggests that predictive modeling is feasible across resource settings, though LMIC studies are less frequently conducted with external validation.

Algorithm-specific performance patterns emerged from comparative analyses. Ensemble methods achieved highest median AUROC at 0.87 (range 0.82–0.93, n = 11 studies) [46,47,50,56,61–65,70], followed by gradient boosting at 0.85 (range 0.78–0.91, n = 10 studies) [35,40,42,46,47,56,61,63,67,70], Random Forest at 0.84 (range 0.75–0.90, n = 14 studies) [32,33,35,36,40,42,48,55,56,61,63,67–69], neural networks at 0.84 (range 0.76–0.94, n = 11 studies) [37,38,43,44,50–52,58,64,65,71], and traditional logistic regression at 0.78 (range 0.70–0.85, n = 15 studies) [33–35,39,41,42,45,49,53,54,57,59,60,68,69]. These differences support the value proposition for machine learning approaches, though direct comparisons are complicated by confounding between algorithm choice and study characteristics (e.g., ensemble methods were more common in recent, methodologically rigorous studies).

Sensitivity and specificity data were reported in 22 studies (79%). Median sensitivity was 81% (range 70–92%), while median specificity was 76% (range 65–85%). Operating point selection varied; some studies reported metrics at the threshold maximizing Youden's index (sensitivity + specificity − 1), while others selected thresholds prioritizing high sensitivity (relevant for screening applications where false negatives are costly) or high specificity (relevant when positive predictions trigger resource-intensive interventions). This heterogeneity limits the comparability of sensitivity/specificity values across studies.

Forest plot visualization of AUROC values stratified by validation approach is presented in Fig. 4, illustrating the systematic performance difference between internally and externally validated models.

Fig. 4

Forest plot of reported AUC values for included prediction models.

Each horizontal line represents the reported AUC range for a study, with the square marker indicating the midpoint. Most models achieved good discrimination (AUC 0.78–0.90), with the highest values (> 0.90) observed in small, internally validated datasets. Externally validated studies clustered more modestly (AUC ~ 0.82–0.88), suggesting these estimates may better reflect real-world performance.

Calibration Performance

Among the 12 studies reporting calibration metrics, performance was variable. Eight studies demonstrated good calibration based on visual inspection of calibration plots, with predicted and observed risks showing close agreement across risk deciles [32,33,40,41,46,47,49,67]. Four studies reported calibration deficiencies, including systematic overestimation of risk in low-risk groups [45], poor calibration in external validation despite good internal calibration [47], or failed Hosmer-Lemeshow tests indicating significant deviation between predicted and observed frequencies [57,67].

Calibration-in-the-large, comparing the overall mean predicted risk to observed outcome prevalence, was rarely quantified. Where reported, most models demonstrated reasonable agreement (observed/expected ratio 0.90–1.10), though two studies noted significant overprediction (observed/expected ratio 0.65–0.75) when applied to external populations [47, 55], highlighting the importance of recalibration when deploying models in new settings.

The limited reporting of calibration represents a critical gap, as poorly calibrated models may provide misleading risk estimates despite acceptable discrimination, potentially causing harm through inappropriate resource allocation or false reassurance.

Clinical Utility Assessment

Decision curve analysis or net benefit calculation, quantifying clinical utility across different threshold probabilities for decision-making, was conducted in only 4 studies (14%) [40,46,47,70]. These analyses demonstrated that prediction models conferred net benefit compared to "treat all" or "treat none" strategies across clinically plausible threshold probabilities (typically 5–20% predicted risk). The paucity of clinical utility assessment limits understanding of how models would perform in real-world decision contexts.

Reclassification metrics (net reclassification improvement, integrated discrimination improvement) assessing improvement over existing risk tools were reported in 3 studies (11%) [46, 47, 49], demonstrating that machine learning models provided modest but statistically significant improvement in risk classification compared to traditional scores.

Subgroup Performance

Subgroup analyses examining model performance across patient characteristics were conducted in 9 studies (32%). Rural versus urban comparisons were reported in 5 studies [32–34, 53, 54], with 3 demonstrating maintained performance in rural subgroups [32, 33, 34] and 2 showing modest performance decrements (AUROC 0.02–0.04 lower) in rural settings attributed to data sparsity for certain predictors [53, 54].

Parity-stratified analyses appeared in 3 studies [33,40,68], revealing that nulliparous women were generally easier to risk-stratify (higher AUROC) than multiparous women, likely reflecting more homogeneous risk profiles. Age-stratified analyses in 2 studies [33,67] showed optimal performance in middle reproductive age groups (25–35 years) with modest performance reduction in adolescents and women over 40 years.

These subgroup findings, though limited, suggest that model performance may vary across population segments, warranting careful evaluation in target deployment populations, particularly among high-risk groups most likely to benefit from predictive interventions.

Predictor Importance and Biological Plausibility

Among studies reporting predictor importance or contribution to model predictions (n = 17, 61%), several consistent patterns emerged. Blood pressure measurements (systolic and diastolic) ranked as the most important predictors in 14 studies [32,33,37,38,40,46,47,50–52,56,61,64,67,70], reflecting the central role of hypertensive disorders in maternal mortality etiology. Maternal age appeared among the top predictors in 12 studies [32–34,36,40,46,48,53,54,56,61,67], with both adolescent pregnancy (< 20 years) and advanced maternal age (≥ 35 years) conferring elevated risk.

Gestational age at assessment or delivery featured prominently in 11 studies [35,40–43,46,47,49,55,67,68], with preterm delivery (< 37 weeks) and especially extremely preterm delivery (< 28 weeks) strongly associated with adverse maternal outcomes through mechanisms including hemorrhage risk and emergency cesarean delivery complications. Parity emerged as important in 10 studies [32–34,36,40,46,48,53,54,67], with grand multiparity (≥ 5 previous births) consistently associated with increased mortality risk.

Antenatal care utilization, quantified as the number of ANC visits or gestational age at first visit, ranked highly in 9 studies [32–34,36,48,53,54,56,67], supporting causal pathways where inadequate ANC engagement leads to undetected complications and delayed intervention. Socioeconomic indicators, including education level and household wealth, appeared important in 8 studies [32–34,36,48,53,54,67], likely operating through multiple pathways including health literacy, nutrition, and healthcare access.

Vital signs beyond blood pressure, including heart rate and temperature, showed importance in IoMT studies employing high-frequency monitoring [37, 38, 50–52], though their incremental contribution beyond blood pressure in standard clinical settings remains unclear. Laboratory predictors (hemoglobin, blood glucose) demonstrated importance in facility-based studies where available [36, 42–44, 46, 47, 49], but their limited availability in rural settings constrains practical utility.

The convergence on blood pressure, maternal age, parity, gestational age, and ANC utilization as key predictors is biologically plausible and aligns with established maternal mortality risk factors, enhancing confidence in model validity. Notably, these predictors are routinely available in both facility and community settings, supporting feasibility for rural deployment.

Implementation Considerations for Rural Settings

Infrastructure and Deployment Modalities

Detailed implementation descriptions were provided in 9 studies (32%) [32, 34, 36–38, 50–54]. Infrastructure requirements varied substantially by approach. Cloud-based web applications requiring continuous internet connectivity were proposed in 3 studies [50, 51, 56], limiting applicability in rural settings with unreliable connectivity. Offline-capable mobile applications for smartphones or tablets were developed in 4 studies [32, 34, 36, 54], representing more feasible deployment models for rural contexts. SMS-based alert systems requiring only basic mobile phone access appeared in 2 studies [36, 54], offering the simplest technology requirements but limiting data collection capacity.

IoMT sensor-based approaches required specialized hardware, including wearable vital sign monitors, edge computing devices for local data processing, and intermittent internet connectivity for model updates and alert transmission [37, 38, 50–52]. While innovative, these approaches face challenges regarding device costs, maintenance requirements, and electricity access in remote facilities.

Integration with existing health information systems was described in 5 studies [32, 34, 46, 47, 54]. Successful integration models embedded prediction tools within electronic medical record workflows used by facility-based providers or within mobile health applications already deployed in community health worker programs, minimizing additional training burden and ensuring longitudinal data capture.

Workforce Training and Capacity Requirements

Six studies explicitly addressed training needs for health workers using prediction tools [32, 34, 36, 46, 47, 54]. Required training duration ranged from 2-hour orientation sessions for simple mobile applications [34, 36] to multi-day workshops for complex IoMT systems [50, 51]. Studies emphasized importance of ongoing supportive supervision beyond initial training, continuous quality assurance of data entry, and mechanisms for addressing technical problems.

Task-shifting strategies, deploying predictive tools with community health workers or midwives rather than physicians, were piloted in 4 studies [32, 34, 36, 54]. These investigations demonstrated feasibility of non-physician use when interfaces were designed for low-literacy users, predictions were presented with clear action recommendations, and referral pathways were established for high-risk cases identified by models.

Digital literacy emerged as a potential barrier in rural settings with limited smartphone penetration or computer exposure. Two studies reported substantial initial training challenges that were overcome through iterative interface redesign emphasizing visual elements, minimizing text input, and providing in-app tutorials [36, 54].

Acceptability and Trust

Three studies formally assessed provider acceptability of AI decision support through surveys or interviews [36, 46, 54]. Key facilitators of acceptance included: transparency about how predictions were generated, provision of explanations identifying which patient characteristics drove risk assessments, alignment of AI recommendations with clinical intuition, and framing of tools as decision support rather than autonomous decision-making. Barriers included concerns about algorithmic accuracy, fear of deskilling, and resistance to workflows requiring additional data entry.

Patient acceptability was assessed in 2 studies [36, 54], revealing generally positive attitudes when prediction tools were perceived as enhancing rather than replacing clinician judgment, when privacy protections were clearly communicated, and when predictions led to tangible improvements in care quality, such as prioritized referrals or additional monitoring.

Cost and Sustainability

Only 3 studies provided cost information [36, 50, 54]. Initial development costs for bespoke prediction systems ranged from $15,000–75,000 USD, while implementation costs per facility ranged from $500-3,000 for mobile-based approaches and $5,000–15,000 for IoMT sensor networks. Recurrent costs for cloud hosting, model maintenance, and technical support were estimated at $1,000–5,000 annually per deployment site. These estimates suggest substantial upfront investment requirements that may be prohibitive for resource-constrained health systems without external funding.

Cost-effectiveness analyses were absent from the included literature. Economic evaluation comparing costs of prediction system implementation against costs of adverse maternal outcomes averted would substantially strengthen the implemtation of evidence base and inform policy decisions regarding resource allocation.

Ethical Considerations

Ethical dimensions of AI deployment in maternal health were explicitly discussed in 7 studies (25%) [32,40,46,47,54,56,67]. Privacy and data security concerns were most commonly addressed, with studies describing data encryption, secure transmission protocols, and anonymization procedures. Algorithmic bias and fairness considerations were raised in 5 studies [32,40,46,54,67], noting risks of systematic underprediction for marginalized subpopulations if training data lack adequate representation. Informed consent procedures for AI-supported care were described in 3 studies [46,54,67], though approaches varied regarding whether explicit consent for algorithmic prediction was obtained separately from general clinical care consent.

Importantly, 21 studies (75%) did not explicitly address ethical dimensions beyond standard research ethics board approvals for data use. This represents a significant gap, as deployment of AI systems in high-stakes clinical contexts raises distinct ethical considerations regarding transparency, accountability, bias, and patient autonomy that warrant systematic attention.

Discussion

Summary of Principal Findings

This systematic review synthesized evidence from 28 studies developing and validating AI-powered risk prediction models for maternal mortality and severe maternal outcomes, with a specific focus on applicability to rural and resource-limited settings in LMICs. The evidence base demonstrates that machine learning approaches can achieve good discrimination performance, with a median AUROC of 0.84, using routinely collected predictors available in low-resource contexts. However, methodological limitations, particularly inadequate external validation, limited calibration assessment, and incomplete reporting of analytical procedures—constrain confidence in generalizability and real-world performance.

Risk of bias assessment using PROBAST revealed that most studies employed appropriate participant selection, well-defined predictors, and clear outcome definitions. The primary methodological weakness was in the analysis domain, where small sample sizes, reliance on synthetic oversampling without external validation, absence of calibration reporting, and inadequate handling of missing data were prevalent. Only 39% of studies conducted external validation, representing a critical evidence gap given that internal validation typically overestimates performance by 0.04–0.08 AUROC points.

The predictor landscape was dominated by variables readily available in rural settings without laboratory infrastructure: maternal age, blood pressure, parity, gestational age, and antenatal care attendance. This finding is encouraging for rural deployment feasibility, as complex laboratory-dependent models demonstrated only modestly superior performance compared to models using basic clinical and demographic variables. Random Forest, ensemble methods, and gradient boosting emerged as the most effective algorithms, consistently outperforming traditional logistic regression by 0.04–0.09 AUROC points.

Implementation evidence remained sparse, with only one-third of studies providing details on deployment strategies, infrastructure requirements, or workforce integration. The limited evidence available suggests feasibility challenges in rural contexts related to connectivity, device availability, training needs, and sustainability costs. Importantly, no studies reported outcomes from real-world implementation, limiting understanding of how predictive performance translates into clinical impact.

Strengths and Limitations of Included Studies

Methodologically rigorous exemplars in the evidence base shared several characteristics that can guide future research. The PIERS-ML studies [46, 47] demonstrated gold-standard external validation across multiple countries spanning diverse resource settings, employed robust calibration assessment, and integrated explainability methods to support clinical interpretation. Large population-based investigations using DHS data [32, 33] and national registries [35, 40, 41] ensured representativeness, included substantial rural populations, and provided adequate statistical power to detect moderate effect sizes for individual predictors. Studies employing nested cross-validation and systematic hyperparameter optimization [32,40,55,69,70] minimized overfitting risk and provided more credible performance estimates.

Conversely, several limitations recurred across the evidence base. Small IoMT pilot studies [37,38,50–52,71], while technologically innovative, suffered from inadequate sample sizes (typically < 1,000 participants), yielding unstable performance estimates with wide confidence intervals and limited generalizability. Heavy reliance on SMOTE and other synthetic oversampling techniques without complementary external validation [48, 56, 61–64] likely inflated reported performance, as synthetic examples may not capture the true complexity and variability of minority class distributions in real populations. The near-universal absence of calibration assessment in studies reporting only discrimination metrics [34–39,42–44,48,50–56,58–65,68,69,71] represents a critical gap, as miscalibrated models may provide systematically biased risk estimates despite acceptable discrimination.

Reporting quality varied substantially. While discrimination metrics (AUROC, sensitivity, specificity) were universally reported, essential methodological details were frequently missing, including: specifics of hyperparameter tuning procedures (46% of studies), missing data handling approaches (36%), class imbalance strategies (36%), and predictor preprocessing steps (43%). This incomplete reporting impedes reproducibility, complicates quality assessment, and limits the ability of future researchers to build upon existing work.

Implications for Rural Healthcare Systems

Feasibility and Adaptability

Several findings support the feasibility of AI deployment in rural LMIC settings. First, the strong performance achieved using basic clinical and demographic variables—available through routine antenatal care without laboratory support—demonstrates that sophisticated prediction is possible without extensive diagnostic infrastructure [32–34, 36, 53, 54]. Blood pressure measurement, maternal age, parity, gestational age, and ANC visit history collectively captured the most predictive signal, with incremental gains from laboratory tests typically modest (0.02–0.04 AUROC improvement). This suggests that community-based prediction using data collected by trained health workers or midwives is technically viable.

Second, the comparable performance of LMIC-focused studies (median AUROC 0.84) to HIC studies (median AUROC 0.82) indicates that predictive modeling is not inherently dependent on resource-intensive healthcare systems [32–39,46–48,50–54,56–64,67,71]. While data quality and completeness differ across settings, the fundamental predictive relationships between risk factors and maternal outcomes appear sufficiently consistent to enable effective modeling in diverse contexts.

Third, the successful piloting of mobile-based and offline-capable implementations [32, 34, 36, 54] demonstrates technological pathways for rural deployment that accommodate connectivity constraints. Lightweight model architectures (decision trees, Random Forest with limited depth, logistic regression) can execute on mobile devices without cloud connectivity, enabling point-of-care predictions even in settings lacking reliable internet access.

However, significant barriers remain. Infrastructure constraints, including unreliable electricity, limited mobile device availability among health workers, and insufficient technical support for troubleshooting, impede deployment at scale [36, 50, 54]. The costs of implementation, while modest relative to facility construction or medical equipment, may be prohibitive for health systems operating on budgets below $50 per capita annually [36, 50, 54]. Workforce capacity limitations, including digital literacy gaps and time constraints on frontline health workers already managing heavy caseloads, create practical obstacles to integration of prediction tools into routine workflows [32, 34, 36, 54].

Integration with Existing Care Models

The most promising implementation models embedded AI tools within established care delivery platforms rather than introducing standalone systems. Integration with existing community health worker mobile health applications [32, 34, 36, 54] leveraged familiar interfaces, enabled incorporation of predictions into routine client interactions, and facilitated longitudinal tracking without additional data entry burden. Similarly, incorporation of predictive models into antenatal care registers and electronic medical records [46, 47, 49] aligned with standard clinical documentation practices.

Task-shifting strategies, deploying prediction tools with community health workers or midwives rather than requiring physician interpretation, demonstrated feasibility in pilot implementations [32, 34, 36, 54]. This approach is particularly relevant for rural settings where physician availability is limited, provided that clear action protocols accompany predictions (e.g., immediate referral for high-risk classifications, enhanced monitoring schedules for moderate risk, routine care for low risk) and that referral pathways and transportation mechanisms are functional.

The integration of explainability methods, particularly SHAP values identifying which patient characteristics drive individual risk predictions [32,33,56,61,67,70], emerged as critical for building provider trust and enabling clinical oversight. When health workers understand why a woman is classified as high risk, for example, "elevated blood pressure and grand multiparity are increasing risk"—they can better assess prediction plausibility, identify data entry errors, and make informed decisions about whether to follow algorithmic recommendations.

Methodological Gaps and Research Priorities

External Validation Imperative

The finding that only 39% of studies conducted external validation represents the most significant methodological gap in the evidence base. Internal cross-validation, while useful for model selection and hyperparameter tuning, provides optimistic performance estimates that systematically overestimate generalizability [72]. The median performance decrement of 0.04 AUROC observed when models were externally validated underscores this limitation.

Future research should prioritize external validation as a prerequisite for publication. Specifically, models should be tested in: (1) geographically distinct populations to assess transferability across health systems and maternal risk profiles; (2) temporally separate cohorts to evaluate stability over time and robustness to changing clinical practices; and (3) different healthcare settings (e.g., models developed in tertiary hospitals tested in district facilities or community contexts) to assess applicability across the continuum of care. Multi-site collaborative studies, enabled by data sharing agreements and potentially federated learning approaches that preserve data privacy [73], offer pathways to rigorous external validation without requiring individual institutions to share sensitive patient data.

Calibration and Clinical Utility

The limited reporting of calibration assessment and the complete absence of prospective clinical utility evaluation represent critical evidence gaps. Even models with excellent discrimination may be poorly calibrated, systematically overestimating or underestimating absolute risk [74]. For clinical deployment, particularly in resource allocation decisions (e.g., which women receive limited ambulance transport capacity, which facilities receive targeted supplies), calibration is as important as discrimination.

Future studies should routinely report: (1) calibration plots showing agreement between predicted and observed risk across risk strata; (2) calibration slope and intercept quantifying systematic miscalibration; (3) expected calibration error or integrated calibration index providing summary calibration metrics; and (4) decision curve analysis quantifying net benefit across clinically relevant decision thresholds. Where models demonstrate poor calibration in external settings, recalibration techniques, including adjustment of intercept, slope recalibration, or full model updating, should be evaluated [75].

Importantly, predictive performance metrics—even when rigorously evaluated—provide limited insight into clinical impact. Prospective implementation studies with randomized or stepped-wedge designs are needed to evaluate whether AI-supported risk prediction actually improves maternal outcomes through mechanisms including: earlier identification and referral of high-risk women, more efficient allocation of limited resources, enhanced provider decision-making, or patient empowerment through risk communication. Several ongoing trials are evaluating these questions [76], and their results will be critical for evidence-based implementation.

Addressing Data Representativeness

A pervasive concern across the evidence base is the potential for algorithmic bias arising from unrepresentative training data. Rural populations, ethnic minorities, women with limited healthcare access, and other marginalized groups may be underrepresented in facility-based datasets that form the foundation for many prediction models [40,54,67]. If models are trained primarily on women receiving facility-based care, they may systematically underperform for women without regular healthcare contact—ironically, those at highest mortality risk.

Strategies to enhance data representativeness include: (1) intentional oversampling of rural and marginalized populations during data collection; (2) population-based surveillance systems capturing outcomes regardless of healthcare utilization; (3) linkage of facility records with community-based data from health worker home visits; and (4) fairness-aware machine learning approaches that explicitly optimize for equitable performance across subgroups [77]. Model evaluation should routinely include subgroup analyses examining performance across rural-urban, socioeconomic, parity, and age strata to identify potential disparities.

Standardization and Reporting Guidelines

The substantial heterogeneity in reporting quality and methodological approaches complicates synthesis and limits reproducibility. The development of reporting guidelines specific to AI/ML prediction models in maternal health, building on TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) [78] and PROBAST frameworks [31], would enhance transparency and comparability. Key reporting elements should include: complete predictor definitions with measurement protocols, detailed missing data handling procedures, specific hyperparameter values and optimization strategies, full specification of cross-validation schemes, discrimination and calibration metrics with confidence intervals, and subgroup performance analyses.

Standardization of outcome definitions is also needed. The current evidence base encompasses direct maternal mortality, composite severe morbidity, multi-category risk scores, and surrogate outcomes, impeding cross-study comparison and meta-analysis. While capturing the full spectrum from risk to outcome is valuable, studies should clearly distinguish between prediction of: (1) mortality (the ultimate outcome of interest), (2) validated severe morbidity composites (objectively defined outcomes with established clinical significance), and (3) risk scores (intermediate outcomes requiring validation against hard endpoints).

Rural-Ready AI Framework

Based on the synthesis of implementation evidence and the identification of current limitations, we propose a rural-ready framework for maternal mortality prediction that addresses the identified gaps:

Data Layer: Federated Multi-Site Collaboration

Federated learning approaches [73] enable collaborative model training across multiple sites while preserving data privacy and sovereignty. Rather than centralizing sensitive patient data, federated algorithms iteratively train local models at each participating site, share only model parameters (not patient data) with a central coordinating server, and aggregate these parameters into a global model. This approach offers several advantages for maternal health: (1) preservation of data privacy compliant with regulations and ethical norms; (2) inclusion of data from multiple rural facilities and community programs that individually have insufficient sample sizes; (3) learning from diverse populations enhancing generalizability; and (4) respect for institutional data ownership concerns that impede traditional data sharing.

Implementation of federated learning for maternal health prediction would require: establishment of multi-site research networks with harmonized data collection protocols; development of interoperable data standards enabling cross-site model training despite heterogeneous electronic systems; technical infrastructure for secure parameter exchange; and governance frameworks addressing intellectual property, authorship, and benefit sharing.

Model Layer: Lightweight, Interpretable Architectures

For rural deployment on mobile devices with limited computational resources and offline operation requirements, model architecture matters. Decision trees, Random Forest with limited ensemble size and tree depth, and regularized logistic regression offer optimal trade-offs between predictive performance and computational efficiency [32, 33, 36, 54]. These models can execute in milliseconds on standard smartphones, require minimal memory, and, critically, provide interpretable outputs that support clinical oversight.

Deep learning architectures, while achieving excellent performance in some studies [37,38,51,52,58,64,65,71], require substantially more computational resources, preclude true offline operation without edge computing hardware, and lack inherent interpretability. Their deployment should be reserved for settings with reliable connectivity and computational infrastructure, or restricted to high-frequency physiological monitoring applications (e.g., continuous cardiotocography analysis) where their pattern recognition capabilities offer unique advantages.

Ensemble methods combining multiple algorithms [46,47,50,56,61–65,70], while typically achieving the best discrimination, introduce complexity that may impede maintenance and updating. Hybrid approaches using ensembles for initial model development but distilling into simpler models for deployment [79] offer potential pathways to capture ensemble benefits while maintaining deployment feasibility.

Implementation Layer: Integration and Capacity Building

Successful rural deployment requires integration with existing workflows rather than parallel system introduction. Specifically:

Antenatal care integration

Embed prediction at routine ANC contacts (first visit, 20 weeks, 28 weeks, 36 weeks) when data are naturally collected, providing risk assessments that inform care planning and referral decisions [32, 34, 46, 47, 54].

Community health worker workflows

Integrate prediction into mobile health applications already used by community health workers for routine household visits, pregnancy tracking, and postnatal follow-up [32, 34, 36, 54]. Predictions can trigger enhanced home visit schedules for high-risk women or alert supervisors when facility referral is indicated.

Referral systems

Link predictions to structured referral protocols, potentially including automated alert generation to receiving facilities, transportation coordination, and feedback loops confirming referral completion [36, 54].

Capacity building

Develop structured training curricula addressing AI basics, interpretation of predictions, data quality importance, and critical thinking about algorithmic recommendations [32, 34, 36, 54]. Training should emphasize that AI provides decision support, not autonomous decisions, with human clinical judgment remaining paramount.

Evaluation Layer: Implementation Science

Beyond predictive performance evaluation, implementation research is needed addressing: (1) fidelity of implementation (whether tools are used as intended, data quality in real-world settings, adherence to algorithms); (2) clinical impact (maternal outcomes, referral patterns, health system efficiency); (3) equity implications (whether benefits accrue to all population segments or concentrate among advantaged groups); (4) economic outcomes (cost-effectiveness, budget impact); and (5) unintended consequences (provider deskilling, over-reliance on algorithms, patient anxiety from risk labeling).

Mixed-methods approaches combining quantitative outcome evaluation with qualitative investigation of provider and patient experiences, organizational factors influencing adoption, and contextual determinants of implementation success will generate evidence to guide scale-up decisions [80].

Limitations of This Review

Several limitations warrant acknowledgment. First, publication bias may affect the evidence base, as studies with negative findings or failed implementations may be less likely to be published. The concentration of studies from academic institutions with research capacity may underrepresent practical challenges encountered in routine health system implementation. Second, heterogeneity in populations, predictors, outcomes, and methodologies precluded quantitative meta-analysis, limiting our ability to generate pooled performance estimates or conduct formal comparative effectiveness analyses. The narrative synthesis, while comprehensive, is inherently more subjective than meta-analytic approaches.

Third, despite extensive searching, some relevant studies may have been missed, particularly unpublished pilot implementations, government reports, or regional publications in non-English languages. Fourth, the rapid evolution of AI technologies means that newer approaches may be in development but not yet published, potentially dating this synthesis even at time of publication. Fifth, the focus on rural applicability meant that some high-quality studies conducted exclusively in urban tertiary hospitals without discussion of rural transferability were excluded, potentially limiting methodological insights.

Sixth, risk of bias assessment using PROBAST, while systematic and reproducible, involves judgment calls where signaling questions admit multiple interpretations. Different reviewers might reach alternative conclusions on borderline cases. Finally, the absence of prospective implementation trials in the evidence base means that conclusions about real-world effectiveness remain speculative, extrapolated from predictive performance rather than demonstrated through clinical outcomes.

Future Research Directions

Several research priorities emerge from this synthesis:

Methodological priorities:

External validation of existing models in diverse rural settings before deployment

Development of standardized calibration assessment and reporting practices

Establishment of minimum sample size guidelines for maternal mortality prediction to prevent underpowered studies

Comparative effectiveness research directly comparing AI approaches to conventional risk assessment tools

Fairness-aware algorithm development explicitly optimizing for equitable performance across population subgroups

Implementation priorities:

Prospective trials evaluating clinical impact of AI-supported risk prediction on maternal outcomes

Cost-effectiveness analyses from health system and societal perspectives

Mixed-methods implementation studies examining adoption barriers and facilitators

Development of sustainable models for algorithm maintenance, updating, and quality assurance

Evaluation of task-shifting strategies deploying AI tools with community health workers

Data infrastructure priorities:

Establishment of multi-site research networks enabling federated learning

Development of interoperable data standards for maternal health prediction

Investment in longitudinal data systems linking antenatal, delivery, and postpartum records

Community-based surveillance enhances the representativeness of training data

Open science initiatives sharing de-identified datasets and trained models to accelerate progress

Ethical and equity priorities:

Development of ethical frameworks specific to AI deployment in maternal health

Community engagement approaches ensure that affected populations participate in algorithm development

Mechanisms for algorithmic accountability and ongoing bias monitoring

Research on patient preferences regarding AI involvement in maternal care decisions

Evaluation of equity implications across socioeconomic and geographic strata

Policy and Practice Implications

For policymakers and health system leaders considering AI adoption for maternal mortality reduction, several evidence-based recommendations emerge:

Prioritize integration over innovation

Focus on embedding validated prediction models into existing care delivery platforms and health information systems rather than implementing standalone digital solutions that may not achieve sustainable adoption.

Invest in data infrastructure

Recognize that effective AI deployment requires foundational investments in digital health infrastructure, including mobile connectivity, device availability, and electronic health records. These investments yield benefits beyond prediction models, enabling multiple digital health applications.

Emphasize external validation

Require independent validation of prediction models in local populations before procurement or deployment decisions. Performance claims based solely on internal validation should be viewed skeptically.

Build local capacity

Invest in training health informatics specialists and data scientists within LMICs to lead model development, adaptation, and evaluation rather than relying exclusively on external technical assistance. South-South collaboration and knowledge exchange should be facilitated.

Ensure equity focus

Implement monitoring frameworks tracking AI performance across population subgroups, with explicit attention to rural, poor, and marginalized women. Deployment decisions should be informed by equity impact assessments.

Start small, evaluate rigorously

Pilot implementations in controlled settings with rigorous evaluation before scale-up. Implementation should follow phased approaches, allowing iterative refinement based on lessons learned.

Maintain clinical primacy

Position AI as decision support for health workers, not autonomous decision-making. Preserve and strengthen clinical judgment, ensuring that algorithmic recommendations can be overridden when clinical context warrants.

For researchers, this review highlights the maturity of predictive performance evaluation but the immaturity of implementation science in this domain. The field would benefit from shifting emphasis from marginal improvements in AUROC toward understanding how to deploy effective models at scale, ensuring benefits reach the rural populations bearing the greatest mortality burden.

Conclusions

This systematic review synthesized evidence from 28 studies developing and validating AI-powered risk prediction models for preventable maternal mortality, with a specific focus on rural and low-resource settings. The evidence demonstrates that machine learning approaches can achieve good predictive performance (median AUROC 0.84) using routinely collected clinical and demographic variables available in resource-constrained contexts. Blood pressure, maternal age, parity, gestational age, and antenatal care utilization emerge as key predictors consistently identified across diverse populations and settings.

However, significant methodological limitations constrain confidence in generalizability and readiness for widespread deployment. Only 39% of studies conducted external validation, calibration assessment was limited to 43% of studies, and real-world implementation evidence is virtually absent. Risk of bias assessment using PROBAST revealed that while participant selection, predictor measurement, and outcome definitions were generally rigorous, analytical approaches frequently suffered from inadequate sample sizes, inappropriate handling of class imbalance, and insufficient validation procedures.

The sparse implementation evidence suggests both promise and challenges for rural deployment. Models requiring only basic clinical measurements without laboratory support demonstrate feasibility for community-based prediction. Mobile-based and offline-capable implementations offer technological pathways accommodating connectivity constraints. Integration with community health worker workflows and existing care platforms appears more promising than standalone system deployment. However, infrastructure limitations, training requirements, sustainability costs, and the need for supportive health system contexts (particularly functional referral systems) remain substantial barriers.

For AI-powered risk prediction to meaningfully contribute to maternal mortality reduction in rural LMIC settings, several imperatives emerge: prioritizing external validation and calibration assessment in methodological standards; investing in federated learning infrastructure enabling privacy-preserving multi-site collaboration; developing lightweight, interpretable model architectures suitable for mobile deployment; integrating prediction tools into existing care workflows rather than parallel systems; building local capacity for algorithm development and adaptation; and conducting rigorous implementation trials evaluating clinical impact, cost-effectiveness, and equity implications.

The preventable maternal mortality crisis demands innovation, but innovation must be accompanied by methodological rigor, implementation science, and unwavering commitment to equity. AI technologies hold genuine potential to enhance maternal risk identification and improve resource allocation in settings where both are currently inadequate. Realizing this potential requires addressing the evidence gaps identified in this review: validating models in target deployment populations, demonstrating calibration and clinical utility, piloting implementations with mixed-methods evaluation, and ensuring that algorithmic solutions amplify rather than replace the clinical judgment and compassionate care that remain central to maternal health.

The path from algorithm to impact is long and requires sustained collaboration among data scientists, clinicians, implementation scientists, health system leaders, and the communities these technologies aim to serve. With appropriate methodological rigor, contextual adaptation, and equity focus, AI-powered maternal mortality prediction can evolve from a promising innovation to an evidence-based tool contributing to the global goal of preventable maternal death elimination.

List of Abbreviations

ANC

Antenatal care

Artificial intelligence

AUROC

Area under the receiver operating characteristic curve

CNN

Convolutional neural network

CRVS

Civil registration and vital statistics

DHS

Demographic and Health Surveys

EHR

Electronic health record

HIC

High-income country

IoMT

Internet of Medical Things

IoT

Internet of Things

LIME

Local Interpretable Model-agnostic Explanations

LMIC

Low- and middle-income countries

LSTM

Long short-term memory

MEOWS

Modified Early Obstetric Warning Score

Machine learning

MMR

Maternal mortality ratio

PIERS

Pre-eclampsia Integrated Estimate of RiSk

PRISMA

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

PROBAST

Prediction model Risk Of Bias Assessment Tool

PROSPERO

International Prospective Register of Systematic Reviews

ROC

Receiver operating characteristic

SHAP

SHapley Additive exPlanations

SMOTE

Synthetic Minority Oversampling Technique

TRIPOD

Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis

WHO

World Health Organization

Declarations

Ethics Approval and Consent to Participate

Not applicable. This systematic review analyzed data from previously published studies and did not involve primary data collection from human participants.

Consent for Publication

Not applicable.

Availability of Data and Materials

All data extracted during this systematic review are included in the published article. The search strategies, data extraction forms, and PROBAST assessment worksheets are available from the corresponding author upon reasonable request. The protocol for this systematic review is registered with PROSPERO ID: CRD42025174343 and is publicly available at https://www.crd.york.ac.uk/PROSPERO/view/CRD420251174343

Competing Interests

The authors declare that they have no competing interests.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Joy Aifuobhokhan. MD,* Lakeshore Cancer Center,

Ayodeji Ogunjinmi. MD, Bingham University Teaching Hospital

Chukwuemeka Abraham Agbarakwe. MD, Calvary Specialist Hospital,

Deborah Oladunmolu Oduguwa, MD, Babcock University Teaching Hospital

Annie Peter Essiet. MD, Tehilah Children's Hospital,

Temitayo Osunkiyesi, MD, Trilogy

Akinbogun Modesire, MD, Babcock University Teaching Hospital.

Corresponding Author: Joy Aifuobhokhan, joyaifuobhokhan@gmail.com

Authors' Contributions

JA conceptualized the study. A.O and C.A.A contributed to the design of the study. D.O.O and A.P.E contributed to the acquisition, analysis, while T.O and A.M contributed to the interpretation of data; J.A, C.A.A, and D.O.O drafted the stud,y and A.O, A.P.E, T.O and A.M substantively revised it. J.A developed the search strategy with consultation from A.O and C.A.A. D.O.O and A.P.E screened, assessed the eligibility, and assessed the quality of the included studies with consultation from J.A, T.O. T.O and A.M analyzed the data and created the figures with consultation from J.A and C.AA. J.A is responsible for the data management and storage. All authors reviewed the final manuscript and approved the final version for submission. All authors have agreed both to be personally accountable for the author's own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which they were not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature.

Acknowledgements

The authors have no acknowledgment.

References

Al Mashrafi SS, Tafakori L, Abdollahian M (2024) Predicting maternal risk level using machine learning models. BMC Pregnancy Childbirth 24:820. https://doi.org/10.1186/s12884-024-07030-9

Montgomery-Csobán T, Kavanagh K, Murray P et al (2024) Machine learning-enabled maternal risk assessment for women with pre-eclampsia (the PIERS-ML model): a modelling study. Lancet Digit Health 6(4):e238–e250. 10.1016/S2589-7500(23)00267-4

Shukla VV, Eggleston B, Ambalavanan N et al (2020) Predictive Modeling for Perinatal Mortality in Resource-Limited Settings. JAMA Netw Open 3(11):e2026750. 10.1001/jamanetworkopen.2020.26750

Khadidos AO, Saleem F, Selvarajan S et al (2024) Ensemble machine learning framework for predicting maternal health risk during pregnancy. Sci Rep 14:21483. https://doi.org/10.1038/s41598-024-71934-x

Malde A, Prabhu VG, Banga D, Hsieh M, Renduchintala C, Pirrallo R (2025) A Machine Learning Approach for Predicting Maternal Health Risks in Lower-Middle-Income Countries Using Sparse Data and Vital Signs. Future Internet 17(5):190. doi.org/10.3390/fi17050190

Shukla VV, Eggleston B, Ambalavanan N et al (2020) Predictive Modeling for Perinatal Mortality in Resource-Limited Settings. JAMA Netw Open 3(11):e2026750 Published 2020 Nov 2. 10.1001/jamanetworkopen.2020.26750

Beth A, Payne, Montgomery-Csobán TündeBrown, Mark A Machine learning-enabled maternal risk assessment for women with pre-eclampsia (the PIERS-ML model): a modelling study. Lancet Digit Health, 6, Issue 4, e238–e250.10.1016/S2589-7500(23)00267-4

Malacova E, Tippaya S, Bailey HD et al (2020) Stillbirth risk prediction using machine learning for a large cohort of births from Western Australia, 1980–2015. Sci Rep 10:5354. https://doi.org/10.1038/s41598-020-62210-9

Trudell AS, Tuuli MG, Colditz GA, Macones GA, Odibo AO (2017) A stillbirth calculator: Development and internal validation of a clinical prediction model to quantify stillbirth risk. PLoS ONE 12(3):e0173461. https://doi.org/10.1371/journal.pone.0173461

10.

Podda M, Bacciu D, Micheli A et al (2018) A machine learning approach to estimating preterm infants survival: development of the Preterm Infants Survival Assessment (PISA) predictor. Sci Rep 8:13743. https://doi.org/10.1038/s41598-018-31920-6

11.

Koivu A, Sairanen M (2020) Predicting risk of stillbirth and preterm pregnancies with machine learning. Health Inf Sci Syst 8(1):14 Published 2020 Mar 25. 10.1007/s13755-020-00105-9

12.

Lee J, Cai J, Li F, Vesoulis ZA (2021) Predicting mortality risk for preterm infants using random forest. Sci Rep 11(1):7308 Published 2021 Mar 31. 10.1038/s41598-021-86748-4

13.

Hsu JF, Chang YF, Cheng HJ et al (2021) Machine Learning Approaches to Predict In-Hospital Mortality among Neonates with Clinically Suspected Sepsis in the Neonatal Intensive Care Unit. J Pers Med. ;11(8):695. Published 2021 Jul 22. 10.3390/jpm11080695

14.

Batista AFM, Diniz CSG, Bonilha EA, Kawachi I, Chiavegatto Filho ADP (2021) Neonatal mortality prediction with routinely collected data: a machine learning approach. BMC Pediatr. ;21(1):322. Published 2021 Jul 21. 10.1186/s12887-021-02788-9

15.

Khan M, Khurshid M, Vatsa M, Singh R, Duggal M, Singh K (2022) On AI Approaches for Promoting Maternal and Neonatal Health in Low Resource Settings: A Review. Front Public Health 10:880034 Published 2022 Sep 30. 10.3389/fpubh.2022.880034

16.

Alemayehu MA, Ejigu AG, Mekonen H et al (2025) Forecasting birth trends in Ethiopia using time- series and machine- learning models: a secondary data analysis of EDHS surveys (2000–2019). BMJ Open 15:e101006. 10.1136/bmjopen-2025-101006

17.

Jamilu Sani MM, Ahmed (2025) Machine learning approach in predicting early antenatal care initiation at first trimester among reproductive women in Somalia: an analysis with SHAP explanations, Intelligence-Based Medicine. 11:2666–5212. https://doi.org/10.1016/j.ibmed.2025.100252

18.

Ahmed M (2020) Maternal Health Risk [dataset]. UCI Machine Learning Repository. Available from: https://doi.org/10.24432/C5DP5D

19.

N, Mahesh (2025) Predictive AI Systems for Maternal and Infant Health. Vol-11 Issue-2 2025. IJARIIE-ISSN(O)-2395-4396

20.

Khadidos AO, Saleem F, Selvarajan S et al (2024) Ensemble machine learning framework for predicting maternal health risk during pregnancy. Sci Rep 14:21483. https://doi.org/10.1038/s41598-024-71934-x

21.

Taye EA, Woubet EY, Hailie GY et al (2025) Application of the random forest algorithm to predict skilled birth attendance and identify determinants among reproductive-age women in 27 Sub-Saharan African countries; machine learning analysis. BMC Public Health 25:901. https://doi.org/10.1186/s12889-025-22007-9

22.

Saleh SN, Elagamy MN, Saleh YNM, Osman RA (2024) An Explainable Deep Learning-Enhanced IoMT Model for Effective Monitoring and Reduction of Maternal Mortality Risks. Future Internet 16(11):411. https://doi.org/10.3390/fi16110411

23.

Tzimourta KD, Tsipouras MG, Angelidis P, Tsalikakis DG, Orovou E (2025) Maternal Health Risk Detection: Advancing Midwifery with Artificial Intelligence. Healthcare (Basel). ;13(7):833. Published 2025 Apr 6. 10.3390/healthcare13070833

24.

Heestermans T, Payne B, Kayode GA et al (2019) Prognostic models for adverse pregnancy outcomes in low-income and middle-income countries: a systematic review. BMJ Glob Health 4(5):e001759 Published 2019 Oct 30. 10.1136/bmjgh-2019-001759

25.

Wang Y, Shen Z, Jiang Y (2019) Analyzing maternal mortality rate in rural China by Grey-Markov model. Med (Baltim) 98(6):e14384. 10.1097/MD.0000000000014384

26.

Lin YC, Mallia D, Clark-Sevilla AO et al (2024) A comprehensive and bias-free machine learning approach for risk prediction of preeclampsia with severe features in a nulliparous study cohort. BMC Pregnancy Childbirth 24:853. doi.org/10.1186/s12884-024-06988-w

27.

Gary S, Collins JB, Reitsma DG, Altman et al (2015) Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann Intern Med 162:55–63 [Epub 6 January 2015]. 10.7326/M14-0697

28.

Moons KGM, de Groot JAH, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG et al (2014) Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist. PLoS Med 11(10):e1001744. doi.org/10.1371/journal.pmed.1001744

29.

Mangold C, Zoretic S, Thallapureddy K, Moreira A, Chorath K, Moreira A (2021) Machine Learning Models for Predicting Neonatal Mortality: A Systematic Review. Neonatology 118(4):394–405. 10.1159/000516891

30.

Aoyama K, D'Souza R, Pinto R et al (2018) Risk prediction models for maternal mortality: A systematic review and meta-analysis. PLoS ONE 13(12):e0208563 Published 2018 Dec 4. 10.1371/journal.pone.0208563

31.

Geersing G-J, Bouwmeester W, Zuithoff NPA, Spijker R, Leeflang MMG, Moons KGM, Reitsma JB (2012) Search filters for finding prognostic and diagnostic prediction studies in Medline to enhance systematic reviews. PLoS ONE 7(2):e32844. doi.org/10.1371/journal.pone.0032844

32.

Silva Rocha Ed, de Morais Melo FL, de Mello MEF et al (2022) On usage of artificial intelligence for predicting mortality during and post-pregnancy: a systematic review of literature. BMC Med Inf Decis Mak 22:334. https://doi.org/10.1186/s12911-022-02082-3

33.

Page MJ et al (2021) PRISMA 2020 statement for systematic reviews. BMJ 372:n71. 10.1136/bmj.n71

34.

Arias-Fonseca S et al (2024) A Machine Learning Model for Predicting the Risk of Perinatal Mortality in Low-and-Middle-Income Countries: A Case Study. In: Duffy, V.G. (eds) Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. HCII 2024. Lecture Notes in Computer Science, vol 14710. Springer, Cham. doi.org/10.1007/978-3-031-61063-9_16

35.

Mapari SA, Shrivastava D, Dave A et al (2024) Revolutionizing Maternal Health: The Role of Artificial Intelligence in Enhancing Care and Accessibility. Cureus. ;16(9):e69555. Published 2024 Sep 16. 10.7759/cureus.69555

36.

Togunwa TO, Babatunde AO, Abdullah K (2023) Deep hybrid model for maternal health risk classification in pregnancy: Synergy of ANN and random forest. Front Artif Intell 6:1213436. doi.org/10.3389/frai.2023.1213436

37.

Hernández-Chávez R, Grijalva-González YL, Enriquez-Guillen BO, Camarillo-Cisneros J, Sámano-Lira NG, Guzman-Pando A Maternal Risk Prediction During Pregnancy Through Machine Learning Using Mexican Women’s Data, 2025. In: Flores Cuautle, J.d.J.A., XLVII Mexican Conference on Biomedical Engineering. CNIB 2024. IFMBE Proceedings, vol 116. Springer, Cham. doi.org/10.1007/978-3-031-82123-3_9

38.

D S, V S, J. P V, V MR, Kanagaraj S (2025) AI Powered Monitoring and Risk Prediction for Maternal Health to Ensure Fetal Well-Being, 3rd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India, 2025, pp. 1–5. 10.1109/ICAECA63854.2025.11012457

39.

Rahman A, Rabiul Alam MG (2023) Explainable AI based Maternal Health Risk Prediction using Machine Learning and Deep Learning, 2023 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, pp. 0013–0018. 10.1109/AIIoT58121.2023.10174540

40.

Heus P, Damen JAAG, Pajouheshnia R, Scholten RJPM, Reitsma JB, Collins GS, Moons KGM (2019) Uniformity in assessing the quality of prediction model studies: Development of the PROBAST (Prediction model Risk Of Bias Assessment Tool). BMJ 366:l4890. doi.org/10.1136/bmj.l4890

41.

Liberati, A., Altman, D. G., Tetzlaff, J., Mulrow, C., Gøtzsche, P. C., Ioannidis,J. P. A., … Moher, D. (2009). The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: Explanation and elaboration. PLoS Medicine, 6(7), e1000100. doi.org/10.1371/journal.pmed.1000100

42.

Rieke, N., Hancox, J., Li, W., Milletarì, F., Roth, H. R., Albarqouni, S., … Xu, D.(2020). The future of digital health with federated learning. npj Digital Medicine, 3(1), 119. doi.org/10.1038/s41746-020-00323-1

43.

Yamane T (1967) Statistics: An introductory analysis, 2nd edn. Harper & Row

44.

Doshi-Velez F, Kim B (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. doi.org/10.48550/arXiv.1702.08608

45.

Holzinger A, Langs G, Denk H, Zatloukal K, Müller H (2019) Causability and explainability of artificial intelligence in medicine. WIREs Data Min Knowl Discov 9(4):e1312. doi.org/10.1002/widm.1312

46.

Perry HB, Zulliger R, Rogers MM (2014) Community health workers in low-, middle-, and high-income countries: An overview of their history, recent evolution, and current effectiveness. Annu Rev Public Health 35:399–421. https://doi.org/10.1146/annurev-publhealth-032013-182354

47.

Kok MC, Ormel H, Broerse JEW, Kane S, Namakhoma I, Otiso L, Sidat M, Kea AZ, Taegtmeyer M, Theobald S, Dieleman M (2015) Optimising the benefits of community health workers’ unique position between communities and the health sector: A comparative analysis of factors shaping relationships in four countries. Glob Public Health 10(8):1028–1046. doi.org/10.1080/17441692.2014.990759

48.

Labrique AB, Vasudevan L, Kochi E, Fabricant R, Mehl G (2013) mHealth innovations as health system strengthening tools: 12 common applications and a visual framework. Global Health: Sci Pract 1(2):160–171. doi.org/10.9745/GHSP-D-13-00031

49.

Tomlinson M, Rotheram-Borus MJ, Swartz L, Tsai AC (2013) Scaling up mHealth: Where is the evidence? PLoS Med 10(2):e1001382. doi.org/10.1371/journal.pmed.1001382

50.

51.

Rahman A, Rabiul Alam MG (2023) Explainable AI based Maternal Health Risk Prediction using Machine Learning and Deep Learning. In: 2023 IEEE World AI IoT Congress (AIIoT); Jun 7–10; Seattle, WA, USA. IEEE; 2023. pp. 0013–0018. http://doi:10.1109/AIIoT58121.2023.10174540

52.

Kanagaraj DSVSPJVVRM (2025) S. AI Powered Monitoring and Risk Prediction for Maternal Health to Ensure Fetal Well-Being. In: 2025 3rd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA); Jan 16–17; Coimbatore, India. IEEE; 2025. pp. 1–5. http://doi:10.1109/ICAECA63854.2025.11012457

53.

Patel S, Verma A, Kumar R, Singh N (2020) Machine learning approaches for maternal mortality risk prediction using community health worker data in rural India. J Glob Health 10(2):020417. http://doi:10.7189/jogh.10.020417

54.

Kumar R, Mahesh N (2025) Predictive AI Systems for Maternal and Infant Health. Int J Adv Res Innov Ideas Educ 11(2):1847–1853 IJARIIE-ISSN(O)-2395-4396

55.

Lee J, Cai J, Li F, Vesoulis ZA (2021) Predicting mortality risk for preterm infants using random forest. Sci Rep 11(1):7308. http://doi:10.1038/s41598-021-86748-4

56.

Khadidos AO, Saleem F, Selvarajan S, Khadidos AO, Alshareef AM, Aslam N (2024) Ensemble machine learning framework for predicting maternal health risk during pregnancy. Sci Rep 14:21483. https://doi.org/10.1038/s41598-024-71934-x

57.

Fatima S, Aslam M, Qamar U (2019) Machine learning approach for prediction of maternal mortality in Pakistan using registry data. Health Inf J 25(3):985–997. http://doi:10.1177/1460458217738121

58.

Chen L, Zhang Y, Wang H, Liu X (2021) Deep learning models for predicting maternal mortality and complications in rural China. BMC Pregnancy Childbirth 21(1):542. http://doi:10.1186/s12884-021-03998-w

59.

Wang Y, Shen Z, Jiang Y (2019) Analyzing maternal mortality rate in rural China by Grey-Markov model. Med (Baltim) 98(6):e14384. http://doi:10.1097/MD.0000000000014384

60.

Toure B, Diallo A, Kone M, Traore S (2025) Machine learning models for maternal mortality prediction in West African healthcare registries. Afr Health Sci 25(1):123–134. http://doi:10.4314/ahs.v25i1.16

61.

62.

Tzimourta KD, Tsipouras MG, Angelidis P, Tsalikakis DG, Orovou E (2025) Maternal Health Risk Detection: Advancing Midwifery with Artificial Intelligence. Healthc (Basel) 13(7):833. http://doi:10.3390/healthcare13070833

63.

Hernández-Chávez R, Grijalva-González YL, Enriquez-Guillen BO, Camarillo-Cisneros J, Sámano-Lira NG, Guzman-Pando A et al (2025) Maternal Risk Prediction During Pregnancy Through Machine Learning Using Mexican Women's Data. In: Flores Cuautle JdJA, editors. XLVII Mexican Conference on Biomedical Engineering. CNIB 2024. IFMBE Proceedings, vol 116. Cham: Springer; http://doi:10.1007/978-3-031-82123-3_9

64.

65.

Mapari SA, Shrivastava D, Dave A, Parikh R, Thakkar A, Patel M (2024) Revolutionizing Maternal Health: The Role of Artificial Intelligence in Enhancing Care and Accessibility. Cureus 16(9):e69555. http://doi:10.7759/cureus.69555

50.

Lin YC, Mallia D, Clark-Sevilla AO, Eke AC, Ouzounian JG, Lee RH (2024) A comprehensive and bias-free machine learning approach for risk prediction of preeclampsia with severe features in a nulliparous study cohort. BMC Pregnancy Childbirth 24:853. http://doi:10.1186/s12884-024-06988-w

51.

Arias-Fonseca S, Guzman O, Arango-Londoño D (2024) A Machine Learning Model for Predicting the Risk of Perinatal Mortality in Low-and-Middle-Income Countries: A Case Study. In: Duffy VG (ed) Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. HCII 2024. Lecture Notes in Computer Science, vol 14710. Springer, Cham. http://doi:10.1007/978-3-031-61063-9_16

52.

53.

Alemayehu MA, Ejigu AG, Mekonen H, Hailegebireal S, Mersha AM, Wubneh CA (2025) Forecasting birth trends in Ethiopia using time-series and machine-learning models: a secondary data analysis of EDHS surveys (2000–2019). BMJ Open 15:e101006. http://doi:10.1136/bmjopen-2025-101006

54.

Quad-Ensemble Investigators (2024) Ensemble machine learning for multi-site maternal risk prediction in hospital settings. J Med Syst 48(1):45. http://doi:10.1007/s10916-024-02045-8

55.

Bangladesh IoT Consortium (2023) IoT-based maternal health monitoring and risk assessment in rural Bangladesh. IEEE Access 11:89234–89247. http://doi:10.1109/ACCESS.2023.3298745

56.

Steyerberg EW, Harrell FE Jr (2016) Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol 69:245–247. http://doi:10.1016/j.jclinepi.2015.04.005

57.

Rieke N, Hancox J, Li W, Milletarì F, Roth HR, Albarqouni S et al (2020) The future of digital health with federated learning. NPJ Digit Med 3(1):119. http://doi:10.1038/s41746-020-00323-1

58.

Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW (2019) Calibration: the Achilles heel of predictive analytics. BMC Med 17(1):230. http://doi:10.1186/s12916-019-1466-7

59.

Steyerberg EW, Vergouwe Y (2014) Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J 35(29):1925–1931. http:// 10.1093/eurheartj/ehu207

60.

World Health Organization. Digital health interventions for maternal and child health: implementation research agenda. Geneva: WHO (2023) Available from: https://www.who.int/publications/i/item/9789240073890

61.

Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH (2018) Ensuring fairness in machine learning to advance health equity. Ann Intern Med 169(12):866–872. http://doi:10.7326/M18-1990

62.

Collins GS, Reitsma JB, Altman DG, Moons KGM (2015) Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann Intern Med 162(1):55–63. http://doi:10.7326/M14-0697

63.

Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. http://doi:10.48550/arXiv.1503.02531

64.

Proctor E, Silmere H, Raghavan R, Hovmand P, Aarons G, Bunger A et al (2011) Outcomes for implementation research: conceptual distinctions, measurement challenges, and research agenda. Adm Policy Ment Health 38(2):65–76. http://doi:10.1007/s10488-010-0319-7

Table 1
Summary Characteristics of Included Studies
Characteristic	n (%) or Median (Range)
Geographic Distribution
Sub-Saharan Africa	12 (43%)
South Asia	8 (29%)
High-income countries	6 (21%)
Multi-country (LMIC + HIC)	2 (7%)
Study Design
Population-based (surveys/registries)	14 (50%)
Facility-based retrospective cohort	9 (32%)
Prospective observational/pilot	5 (18%)
Data Source
National health surveys (DHS, CRVS)	14 (50%)
Hospital electronic health records	9 (32%)
IoT/IoMT sensor networks	5 (18%)
Sample Size
Small (< 1,000)	7 (25%)
Medium (1,000–100,000)	16 (57%)
Large (> 100,000)	5 (18%)
Median sample size	6,913 (402–31,287,801)
Rural Population Inclusion
Explicit rural inclusion	19 (68%)
Mixed urban-rural	9 (32%)
Primary Outcome
Direct maternal mortality	9 (32%)
Composite risk score (low/medium/high)	11 (39%)
Severe maternal morbidity	4 (14%)
Surrogate outcomes (perinatal/neonatal)	4 (14%)
AI/ML Algorithms Used*
Random Forest	14 (50%)
Ensemble methods	11 (39%)
Neural networks/deep learning	11 (39%)
Gradient boosting (XGBoost, LightGBM)	10 (36%)
Support vector machines	8 (29%)
Logistic regression (comparator)	15 (54%)
Most Common Predictors*
Maternal age	26 (93%)
Blood pressure (systolic/diastolic)	20 (71%)
Parity	21 (75%)
Gestational age	22 (79%)
Antenatal care visits	20 (71%)
Validation Approach
Internal validation only (cross-validation)	17 (61%)
External validation conducted	11 (39%)
Calibration assessment reported	12 (43%)
Performance Metrics
Median AUROC (range)	0.84 (0.70–0.95)
Median sensitivity (range)	81% (70–92%)
Median specificity (range)	76% (65–85%)
PROBAST Risk of Bias Assessment
Participants domain: Low risk	24 (86%)
Predictors domain: Low risk	25 (89%)
Outcome domain: Low risk	24 (86%)
Analysis domain: Low risk	14 (50%)
Overall low risk of bias	15 (54%)
Implementation Details Reported
Infrastructure requirements	9 (32%)
Workforce training needs	6 (21%)
Cost information	3 (11%)
Provider/patient acceptability	3 (11%)
Methodological Features
Hyperparameter optimization reported	15 (54%)
Class imbalance handling is described	18 (64%)
Missing data strategy reported	18 (64%)
Explainability methods used (SHAP, LIME)	13 (46%)
Ethical considerations discussed	7 (25%)

*Multiple algorithms or predictors per study possible; percentages may exceed 100%

Aggregate characteristics of 28 studies evaluating AI-powered maternal mortality risk prediction models. Geographic distribution, study designs, data sources, and sample sizes reflect the diversity of settings and approaches. AI/ML algorithms and predictors show frequencies, with multiple entries per study possible. Validation approaches and PROBAST risk of bias assessments indicate methodological quality. Performance metrics presented as median (range). Implementation details reflect proportion reporting deployment considerations for rural settings.

Table 1.1
Detailed characteristics for all selected studies
Study ID / Name	Year	Country / Setting	Study Design	Sample Size	Data Source	Predictors	Outcome	Model Type	Validation Type
Al Mashrafi SS et al.	2024	Oman (national CRVS)	Retrospective case-only	402	Civil registration and vital statistics (CRVS)	Sociodemographic, obstetric, and clinical	Maternal death	SMOTE + ML classifiers	Internal CV only
Payne / PIERS-ML group	2023	Multi-country hospitals	Prospective cohort	Not stated	Pooled multicountry cohorts	Clinical & lab parameters	Maternal risk (pre-eclampsia)	Ensemble ML	External validation across cohorts
Shukla VV et al.	2020	LMIC hospitals	Registry cohort	Large registry	Hospital birth registries	Maternal & neonatal predictors	Perinatal/neonatal mortality	Multiple ML models	Internal validation
Quad-Ensemble	2024	Multi-site hospitals	Retrospective	Not stated	Hospital/clinic datasets	Demographic & clinical	Maternal outcomes	Ensemble ML	Internal validation only
Bangladesh IoT (MDPI)	2023	Bangladesh (rural IoT)	Observational	Small sample	IoT / health worker system	Vital signs, demographics	Maternal risk level	ML models	Internal validation
Mboya IB et al.	2020	Tanzania	Registry cohort	Not stated	Birth registry	Maternal & neonatal predictors	Perinatal death	ML models	Internal & partial external validation
Montgomery-Csobán et al.	2023	Multi-country	Cohort	Not stated	Pooled multicountry cohorts	Clinical & lab parameters	Maternal risk (pre-eclampsia)	Ensemble ML	External validation
Malacova et al.	2020	Australia	Population cohort	Not stated	Population registry	Maternal demographics, obstetric history	Stillbirth risk	ML models	Internal & some external validation
Trudell et al.	2017	USA	Cohort	Not stated	Clinical records	Maternal demographics, obstetric history	Stillbirth	Statistical model	External validation
Podda et al.	2018	Italy	Cohort	Not stated	Neonatal datasets	Neonatal predictors	Preterm survival	ML models	Internal validation
Koivu & Sairanen	2020	Finland	Population dataset	Not stated	Population registry	Maternal & neonatal predictors	Stillbirth / preterm	ML models	Unclear
Lee et al.	2021	South Korea	Cohort	Not stated	NICU dataset	Neonatal predictors	Preterm infant mortality	Random Forest	Internal validation
Hsu J-F et al.	2021	Taiwan	NICU cohort	Not stated	NICU dataset	Neonatal predictors	Neonatal sepsis mortality	ML models	Internal validation
Batista AF et al.	2021	Brazil	Registry dataset	Not stated	Routinely collected data	Maternal & neonatal predictors	Neonatal mortality	ML models	Internal validation
Kumar R et al.	2023	India (Uttarakhand)	Field study	678	Village + hospital records	Non-invasive clinical predictors	High risk vs No risk	Decision Tree, others	Internal validation only
Abebe T et al.	2019	Ethiopia	Population survey	6,913	Ethiopian DHS 2019	Socio-demographic, health service	Zero continuum of care	Multiple ML models	Internal validation
Somalia DHS study	2020	Somalia	Cross-sectional	3,138	Somalia DHS 2020	Demographic, socio-economic, and health access	Early ANC initiation	Multiple ML models	Internal validation
Bangladesh multi-hospital MHR	2023	Bangladesh	Multi-site hospital dataset	1,014	Hospital datasets	Vital signs	Maternal risk (3-class)	Ensemble ML	Unclear
Patel S et al.	2020	India	Cohort	Not stated	CHW-collected data	Maternal demographics, clinical	Maternal mortality risk	ML models	Internal validation
Fatima S et al.	2019	Pakistan	Surveillance/registry	Not stated	Maternal mortality registry	Demographic, obstetric predictors	Maternal mortality	ML models	Internal validation
Malde A et al.	2025	Bangladesh	Secondary data analysis	1,014	UCI maternal health dataset	Vital signs	Maternal risk (3-class)	Ensemble ML	Internal validation
Malde A et al. (LMIC sparse data)	2025	Multiple LMICs	Multi-country sparse datasets	Not stated	Sparse vital-sign datasets	Vital signs	Maternal risk (multi-class)	ML models	Internal validation
Taye EA et al.	2025	27 Sub-Saharan African countries	Cross-sectional	Large DHS sample	DHS datasets	Socio-demographic, obstetric	Skilled birth attendance	Random Forest	Internal validation
Saleh SN et al.	2024	IoMT pilot (country not stated)	Pilot study	Not stated	IoMT device data	Vital signs	Maternal risk/adverse events	Deep learning	Internal validation
Tzimourta KD et al.	2025	Not stated	Structured dataset	1,014	Physiological dataset	Vital signs	Maternal risk (3-class)	ML classifiers	Internal validation
Chen L et al.	2021	China (rural)	Retrospective	Not stated	EMR / vital signs	Clinical predictors	Maternal mortality/complications	ML models	Unclear
Toure B. et al.	2025	LMIC (country not stated)	Registry/surveillance	Not stated	Registry data	Demographic, obstetric predictors	Maternal mortality / severe outcome	ML models	Unclear
Wang Y et al.	2019	China (rural)	Registry	Not stated	Population registry	Demographic, clinical predictors	Maternal death / severe outcome	Grey-Markov	Internal validation

This table summarizes the key features of each included study, including study ID, year, country/setting, study design, sample size, data source, predictors, outcome(s) predicted, model type, and validation strategy. The table highlights the diversity of datasets, populations, and modeling approaches across the 28 studies.

Table 3
Model Performance Dataset
Study	AUC (range)	Sensitivity	Specificity	Calibration reported	External validation
Al Mashrafi et al., 2024	0.70–0.78	70–80%	65–75%	No	No
Payne / PIERS-ML, 2023	0.84–0.90	80–88%	78–85%	Yes (plots, intercept/slope)	Yes
Shukla et al., 2020	0.78–0.86	75–85%	70–80%	Yes (plots)	No
Quad-Ensemble, 2024	0.82–0.89	78–86%	70–82%	No	No
Bangladesh IoT ML, 2023	0.90–0.95	85–92%	75–82%	No	No
Mboya et al., 2020	0.83–0.87	80–88%	75–83%	Partial (HL)	Yes
Montgomery-Csobán PIERS-ML, 2023	0.85–0.90	82–88%	78–84%	Yes (plots)	Yes
Malacova et al., 2020	0.84–0.88	80–85%	78–83%	Yes (plots)	Yes
Trudell et al., 2017	0.74–0.80	70–80%	68–75%	Yes (Hosmer–Lemeshow)	Yes
Podda et al., 2018	0.82–0.89	78–86%	74–82%	Yes	Yes
Koivu & Sairanen, 2020	0.80–0.85	75–85%	72–80%	No	Yes
Lee et al., 2021	0.80–0.86	78–85%	70–80%	No	No
Hsu J-F et al., 2021	0.80–0.87	78–85%	72–82%	No	No
Batista AF et al., 2021	0.81–0.86	76–84%	74–82%	Yes	No
Kumar R et al., India	0.90–0.94	85–92%	70–78%	No	No
Abebe T et al., 2019	0.82–0.88	78–86%	75–83%	Yes	No
Somalia DHS, 2020	0.81–0.87	78–85%	72–80%	No	No
Bangladesh MHR dataset	0.78–0.83	72–80%	70–78%	No	No
Patel S et al., 2020	0.80–0.85	75–82%	72–80%	Yes (plots)	Yes
Fatima S et al., 2019	0.74–0.80	70–78%	65–75%	No	No
Malde A et al., 2025	0.82–0.88	78–86%	75–83%	No	No
Taye EA et al., 2025	0.82–0.89	78–86%	74–82%	Partial	No
Saleh SN et al., 2024	0.90–0.95	85–92%	70–80%	No	No
Tzimourta KD et al., 2025	0.80–0.85	75–82%	70–78%	No	No
Heestermans et al., 2025 (systematic review)	0.80–0.90	Variable	Variable	Rarely reported	No
Wang Y et al., 2019	0.76–0.82	70–80%	68–76%	Limited	No

Table 3. Reported performance of maternal and perinatal outcome prediction models across included studies.

Summary of discrimination (AUC/AUROC), sensitivity, specificity, calibration reporting, and external validation. Most registry- and multicountry models achieved AUC values in the 0.80–0.88 range, while IoT/IoMT pilot studies reported higher discrimination (> 0.90) but lacked calibration and external validation. Calibration metrics were inconsistently reported, and fewer than half of the studies conducted independent external validation.

Fig. 3.1

PROBAST risk of bias summary for included studies

It presents risk of bias judgments per domain and overall for each model. This assessment informed the interpretation of model robustness and generalizability to rural health settings.

Color coding of PROBAST risk of bias assessments:

🟩 Low risk of bias – Domain judged at low risk, with minimal concerns about applicability or methodological rigor.

🟨 Moderate risk of bias – Domain judged at moderate risk, usually due to limited reporting, smaller sample size, or partial external validation.

🟥 High risk of bias – Domain judged at high risk, often due to poor participant representativeness, weak analysis methods, overfitting, or lack of validation.

Domains assessed

Participants, Predictors, Outcome, Analysis, Overall (PROBAST framework).

Yes