A
A Scoping Review of Racial Bias Mechanisms and Mitigation Frameworks in Clinical Artificial Intelligence
Abstract
This scoping review synthesizes evidence on how racial bias arises in clinical artificial intelligence (AI) systems and how it can be mitigated through technical, governance, and policy approaches. We conducted a scoping review of clinical AI/ML studies and relevant conceptual frameworks, with searches limited to English-language sources published between September 2020 and November 2025. Study selection was documented using a PRISMA 2020 flow diagram. Eligible studies examined racial or demographic bias mechanisms, fairness evaluation, or mitigation strategies in real-world clinical contexts. Across 22 included studies, recurring pathways to inequity included underrepresentation and label noise in training data, proxy variables that encode structural disadvantage, differences in access and measurement that distort outcomes, and limited external validation in diverse settings. Mitigation strategies clustered into (1) data and evaluation improvements (e.g., subgroup reporting, calibration, and cross-site validation), (2) model and optimization approaches (e.g., reweighting and fairness-aware objectives), and (3) governance levers (e.g., documentation, equity impact assessments, and monitoring requirements). We translate these findings into a practical framework linking bias mechanisms to mitigation actions and implementation levers, with an emphasis on feasible steps for health systems and policymakers to reduce avoidable inequities during AI deployment.
Keywords:
racial bias
clinical artificial intelligence
healthcare equity
algorithmic fairness
data imbalance
proxy variables
A
A
Introduction
Machine learning (ML) and artificial intelligence (AI) have emerged as major drivers of healthcare innovation, influencing the development of modern diagnostics, risk prediction, and treatment planning [1]. Radiology, oncology, cardiology, and mental health are among the fields where AI models are employed to enhance clinical decision-making, improve diagnostic accuracy, and optimize patient management [2]. These technologies hold promise for reducing human error, augmenting efficiency, and enabling personalized medicine through the analysis of detailed clinical data at unprecedented pace and volume [3]. Nevertheless, despite their transformative potential, increasing evidence suggests that AI systems may not equally benefit all patient populations [4]. Recent research indicates that algorithmic performance frequently varies across racial and ethnic groups, raising serious questions about the fairness, transparency, and equity of clinical AI applications [5]. Consequently, the opportunities AI presents for improving healthcare outcomes are contingent upon understanding and addressing the biases embedded in data, models, and implementation processes [6].
Racial discrimination in AI architecture typically arises from structural inequities in healthcare data and algorithm design [7]. AI model datasets are often biased because they reflect historical disparities in healthcare access, diagnosis, and treatment [8]. Empirical evidence supports these concerns. Ferryman et al. [9] demonstrated that a widely-used healthcare algorithm systematically underestimated medical needs of Black patients due to cost-based proxy variables. Similarly, Lee et al. [10] found lower accuracy of deep learning cardiac MRI segmentation models for minority racial groups, and Thompson et al. [11] identified racial differences in the false negative rate of an opioid misuse classifier. The implications are far-reaching: inaccurate predictions can lead to inequitable clinical judgments, under-diagnosis, and reduced quality of care, ultimately entrenching systemic injustices and exacerbating existing health disparities between racial and ethnic groups.
A
Several studies have examined bias in clinical AI, but meaningful gaps remain in the literature [12]. Systematic reviews such as those by Correa et al. [13] and de Castro Vieira et al. [14] consider bias detection methods and fairness metrics, yet they predominantly address specific clinical domains or focus on technical fairness methods without broader ethical or policy considerations. Furthermore, scientific work synthesizing empirical evidence to link mechanisms of racial bias—such as data imbalance and proxy variable misuse—to practical clinical implications remains limited [15]. Regulatory and institutional perspectives from organizations such as the Centers for Medicare and Medicaid Services (CMS), the Food and Drug Administration (FDA), and the American Medical Association (AMA) are also underrepresented in the discussion [16]. Unlike previous studies, this review is structured as a holistic synthesis balancing empirical findings with causal mechanisms and policy mitigation strategies to facilitate equitable AI implementation across the healthcare industry [12].
This study employs a scoping review approach integrating empirical evidence with conceptual and policy scholarship to enhance conceptual clarity and methodological transparency. We used a structured search and screening process, documented with a PRISMA 2020 flow diagram, to identify relevant sources, and a narrative thematic synthesis to map mechanisms of epistemic inequity and actionable mitigation and governance responses in clinical AI. This design prioritizes breadth and interpretability while maintaining a reproducible, clearly reported selection process. In this review, epistemic inequity in clinical AI refers to the systematic exclusion or misrepresentation of certain racial and social groups in the processes through which medical knowledge is produced, validated, and operationalized by algorithmic systems. It manifests through distortions in measurement and labeling, where proxy variables such as cost or utilization obscure genuine health needs; through inequities in documentation, as clinical narratives and electronic records embed biased linguistic or diagnostic assumptions; through validation gaps arising from race-blind benchmarking and homogeneous testing cohorts; and through governance asymmetries in which institutional actors determine what counts as "ground truth." Collectively, these mechanisms demonstrate that bias in AI is not merely a technical flaw but a structural and epistemic issue embedded in how knowledge and authority are distributed within healthcare systems. To support conceptual clarity, two related constructs are defined in this review. Implementation capacity denotes the institutional, technical, and ethical readiness of healthcare systems to operationalize, monitor, and sustain fair AI applications across diverse contexts. Equity mandates refer to the policy, regulatory, and institutional frameworks that require fairness-oriented design, validation, and governance of AI systems to ensure that algorithmic innovations promote equitable healthcare outcomes rather than reinforce existing disparities.
To address these gaps and develop a multifaceted understanding of clinical AI and racial bias, this paper is guided by the following research questions:
RQ1: What evidence exists of racial bias in clinical artificial intelligence and machine learning systems used for diagnostics, readmission prediction, and treatment planning?
RQ2: What are the main sources and mechanisms that cause racial bias in clinical AI models, such as data imbalance, proxy variables, or unrepresentative cohorts?
RQ3: What technical, ethical, and policy strategies have been proposed or implemented to reduce racial bias and promote fairness in clinical AI systems?
By bridging empirical evidence with methodological insights and policy-level solutions, this review seeks to advance a comprehensive understanding of how racial bias manifests and can be mitigated within clinical AI systems. Through systematic synthesis of studies across diagnostics, readmission prediction, and treatment planning, the review not only identifies the presence and origins of bias but also highlights actionable strategies to enhance fairness and accountability. Ultimately, this work aims to inform the equitable design, deployment, and governance of clinical AI technologies, ensuring that innovations in healthcare serve all patient populations without perpetuating existing disparities.
Methodology
Study Design
The present study employed a scoping review approach. The study selection process is documented using a PRISMA 2020 flow diagram (Fig. 1) [17].
A large language model was used for language editing and formatting assistance during manuscript preparation; the author reviewed and verified the accuracy and integrity of all content (see Statements and Declarations).
Fig. 1
PRISMA 2020 flow diagram for study selection.
Click here to Correct
A systematic literature search was performed in PubMed, Scopus, and Google Scholar. The search was limited to English-language sources published between September 2020 and November 2025. These databases were selected for their coverage of biomedical, informatics, and interdisciplinary research.
The search terms and Boolean operators used were: (("racial bias" OR "algorithmic bias" OR "algorithmic fairness") AND ("clinical AI" OR "machine learning" OR "healthcare prediction" OR "readmission" OR "diagnosis" OR "treatment planning" OR "mental health")).
Filters were applied to prioritize empirical clinical AI/ML studies and relevant peer-reviewed frameworks, while allowing inclusion of high-quality preprints reporting original methods or evaluations. Only English-language sources were considered.
Inclusion and Exclusion Criteria
To ensure methodological rigor and relevance to the research objectives, studies were selected using pre-established eligibility criteria. The inclusion and exclusion criteria were designed to accommodate both empirical and conceptual research, ensuring comprehensive coverage of technical, ethical, and governance aspects of racial bias in clinical AI. Empirical studies were selected for their quantitative assessment of model performance or fairness outcomes, while conceptual and policy-oriented papers were included if they presented analytical, ethical, or governance frameworks relevant to algorithmic fairness or epistemic inequity in healthcare. Table 1 summarizes the criteria used to select and screen studies during the PRISMA-guided process.
Table 1
Summary of Inclusion and Exclusion Criteria
Category
Criteria Description
Rationale
Inclusion Criteria
Peer-reviewed empirical studies examining AI or ML models in clinical contexts; High-quality preprints reporting original methods or evaluations in clinical AI; Studies evaluating racial bias, fairness, or performance disparities across demographic subgroups; Publications addressing technical, ethical, or policy-based mitigation strategies; Conceptual, ethical, or policy-based publications discussing governance, fairness frameworks, or epistemic inequity in clinical AI
To ensure inclusion of methodologically sound and clinically relevant research that directly investigates racial bias and fairness in AI
Exclusion Criteria
Non-clinical AI or general computer science studies lacking healthcare context; Editorials or commentaries without original methods or evaluation; Non-scholarly sources (news/blogs); Studies without racial, ethnic, or demographic stratification in their analysis
To exclude literature without empirical rigor, clinical relevance, or demographic stratification essential to the research questions
Data Extraction
A standardized data extraction form was employed to systematically capture key information from each included study. This form was designed to ensure consistency and comprehensiveness in the extraction of variables relevant to the research questions. The extracted data included study characteristics (author, year, design, clinical domain), AI/ML methods employed, types and strengths of racial bias identified, underlying mechanisms of bias, fairness metrics and mitigation strategies applied, and key outcomes and conclusions.
Quality Appraisal and Risk of Bias Assessment
A
Design-appropriate appraisal tools were used to assess study quality and risk of bias. The Newcastle-Ottawa Scale (NOS) [38] was applied to empirical/model-development studies (n = 11) across selection, comparability, and outcome domains. The CASP Qualitative Checklist (2024) [39] was applied to conceptual and framework-based studies (n = 11). Overall, 5 of 11 conceptual studies were rated high quality and 6 moderate quality (mean CASP score 7.7/10). Among empirical studies, 6 were rated good quality and 5 fair quality (mean NOS score 6.9/9).
Results
Overview of Studies Included
Database searching identified 500 records. After removing 330 records before screening (300 duplicates; 30 non-English, conference abstracts, or retracted), 170 records were screened. Of these, 10 were excluded at title/abstract screening. Of 160 reports sought for retrieval, 20 could not be retrieved. The remaining 140 full-text reports were assessed for eligibility; 118 were excluded (90 not clinical/not healthcare AI/ML in scope; 20 no racial or demographic subgroup analysis; 8 commentary/editorial or duplicate cohort). Twenty-two studies were included in the final review.
A total of 22 peer-reviewed and preprint studies, published between 2021 and 2025, spanning diverse clinical domains (population health, critical care, oncology, psychiatry, imaging, and healthcare governance) were included. The reviewed literature encompasses empirical analyses of AI and machine learning algorithms as well as conceptual and policy-based frameworks of algorithmic fairness. Empirical studies evaluated biases in various predictive and diagnostic tasks, including mortality prediction, readmission risk, cardiac imaging, diabetes modeling, and radiomics-based cancer predictions. Several studies utilized benchmark clinical datasets such as MIMIC-III, NHANES, and UK Biobank, facilitating methodological comparability. Concurrently, conceptual and policy-oriented works offered ethical, statistical, and governance frameworks for fairness evaluation and bias removal in clinical AI.
Empirical Patterns of Racial Bias Across Clinical AI Applications
The evidence reviewed demonstrates that algorithmic bias in healthcare AI systems manifests through calibration errors, non-representative data, and latent racial correlates embedded in model design. Research consistently shows that even technically proficient algorithms may amplify disparities in risk estimation, diagnostic accuracy, and treatment recommendations when fairness is not explicitly addressed.
In population health and predictive modeling, Gupta et al. [19] found that hospitalization models systematically underpredicted risk for minoritized groups, exposing calibration drift across racial and socioeconomic strata. Wang et al. [20] similarly identified structural and data-level bias in common readmission models, noting that features such as prior hospitalizations and healthcare utilization—often used as proxies for clinical need—encode social inequities. Cronjé et al. [21] added further evidence of miscalibration, showing that diabetes risk algorithms overestimated White patients' risk while underestimating risk for Black patients, revealing how seemingly objective predictors can perpetuate inequitable outcomes.
In intensive and critical care, Allen et al. [18] demonstrated that targeted bias-minimized preprocessing can achieve both higher accuracy and fairness, outperforming legacy severity scores. Yet other studies underscore that bias may persist even after explicit correction. Velichkovska et al. [22, 23] revealed that vital signs alone could predict patient race with high accuracy, indicating that physiological data inherently encode racial information. Such findings complicate conventional fairness strategies, suggesting that debiasing efforts must address the statistical structure of biomedical data itself, not just model design. Thompson et al. [11] further showed how bias emerges in natural language classifiers—specifically, higher false-negative rates for Black patients in an opioid misuse detection model—though recalibration proved effective in mitigating disparities. Chang et al. [28] contributed a structural dimension, demonstrating that racial differences in laboratory testing frequency can distort the data pipelines feeding downstream AI, embedding inequity before modeling even begins.
Within imaging and oncology, disparities in representation emerged as a dominant source of bias. Lee et al. [10] reported that cardiac MRI segmentation accuracy was significantly lower among minority groups, reflecting the dominance of White subjects in training data. Pfob and Heil [25] similarly showed poor cross-population generalizability in breast cancer radiomics, with AUC performance dropping sharply when validated on Asian and African cohorts. Khor et al. [24] found that omitting race as a predictor worsened fairness and calibration, increasing false-negative rates in Hispanic and Black patients. Collectively, these findings highlight how both data imbalance and race exclusion amplify inequities in clinical prediction.
Underlying Mechanisms of Racial Bias in Clinical AI Systems
Racial bias in clinical AI systems is perpetuated by structural asymmetries deeply entrenched in the generation, modeling, and validation of health data. Rather than resulting from individual algorithmic shortcomings, these biases reflect how AI models reproduce and amplify inequity in data provenance, representation, and interpretation. Through the reviewed literature, several intersecting mechanisms—data imbalance, proxy variables, non-representative validation samples, and structural inequities—consistently explain why racially disparate outcomes emerge in technically sound models. Measurement-device and proxy-outcome errors can also embed racialized bias; pulse oximetry is a widely cited example with downstream equity consequences [31].
The first mechanism is data imbalance and representational disparity, observed across imaging, critical care, and population health applications. Lee et al. [10] showed that cardiac MRI segmentation models trained predominantly on White subjects yielded significantly lower Dice scores for minoritized populations, demonstrating a direct relationship between data homogeneity and systematic underperformance. Similarly, Pfob and Heil [25] found radiomics model accuracy decreased sharply on Asian and African populations, indicating that Eurocentric models generalize poorly across populations. Cronjé et al. [21] further demonstrated racial miscalibration in diabetes risk algorithms—overestimating White patients' risk and underestimating Black patients' risk—even with legacy clinical scores, biases maintained through population-specific parameterization. These findings underscore that racial underrepresentation at the data level produces unequal learning and undermines clinical reliability for marginalized populations.
The second mechanism involves proxy variables and label leakage, whereby clinically neutral-appearing features encode socioeconomic or racial information. Gupta et al. [19] and Wang et al. [20] demonstrated that variables such as healthcare utilization, prior hospitalization, and cost implicitly capture access and privilege, factoring social determinants into model predictions. Mikhaeil et al. [29] elaborated on this by showing how proxy-label bias—outcomes defined through poor proxy surrogates such as healthcare spending or diagnosis codes—produces systematic prediction errors disfavoring underserved populations. Their Bayesian correction model emphasized that bias reduction requires redefining the meaning of ground truth, not merely reweighting features.
These biases are further aggravated by unrepresentative validation and benchmarking practices. Pfob and Heil [25] and Khor et al. [24] showed that excluding race during model validation inflates performance metrics and obscures subgroup-level failures. Velichkovska et al. [22, 23] demonstrated that vital signs alone convey racial information—models could predict race with AUCs exceeding 0.70 even without racial labeling. This finding exposes the fallacy of race-blind modeling: removing racial variables does not eliminate bias when physiological or systemic imbalances are present in the data.
Finally, several researchers identify structural and contextual inequities as upstream sources of algorithmic bias. Chang et al. [28] found that racially differentiated laboratory testing procedures result in unequal data completeness, influencing model learning and error patterns in emergency care. Bouguettaya et al. [26] and Thompson et al. [11] showed that natural-language models trained on electronic health records or clinical narratives replicate linguistic and contextual biases in documentation, resulting in higher false-negative rates or inferior treatment recommendations for Black and Hispanic patients. These findings suggest that AI systems do not merely reflect bias but operationalize it, adapting institutional inequities to algorithmic decision-making.
Combined, these studies reveal that algorithmic inequity is multilevel and systemic, rooted in data hierarchies, measurement decisions, and healthcare organization rather than solely in single-model design [18]. The mechanisms demonstrate that fairness cannot be achieved through technical optimization alone but requires epistemic reform—reconsideration of how clinical risk, outcome, and validity are defined, measured, and validated across populations.
Table 2
Mapping of Mechanisms, Harms, Mitigation Strategies, and Governance Levers in Clinical AI
Mechanism of Inequity
Resulting Harm or Bias
Mitigation Strategy
Governance Lever / Policy Response
Data imbalance / underrepresentation
Miscalibrated predictions; underperformance in minoritized groups
Data augmentation; inclusive dataset design; reweighting
Institutional data diversity standards; transparent dataset reporting [10, 19]
Proxy labeling and measurement bias
Reinforcement of socioeconomic disparities; misestimation of risk
Use of direct clinical indicators; fairness-aware label correction
Ethical review of proxy definitions; model documentation [9, 28]
Documentation / NLP inequity
Stereotyped associations in clinical text; diagnostic bias
Bias filtering; controlled vocabularies; debiasing embeddings
Data governance for clinical language models [6, 11]
Validation inequity / race-blind benchmarking
Inflated performance claims; unrecognized subgroup harms
Cross-group validation; fairness metrics (AEquity, GUIDE)
Regulatory requirement for subgroup validation [27, 30]
Governance inequity / lack of accountability
Power asymmetries in 'ground truth' decisions
Institutional fairness boards; fairness audits; explainability
Fairness-by-design policies; continuous AI audit frameworks [20, 32]
Strategies for Mitigating Racial Bias and Advancing Fairness in Clinical AI
The analyzed literature demonstrates that efforts to reduce racial bias in clinical AI operate at various methodological and institutional scales, from technical recalibration of algorithms to comprehensive governance frameworks. Although early interventions focused on equilibrating statistical parameters to achieve parity, recent practices have shifted toward data-focused fairness, ongoing auditing, and institutional accountability. Collectively, these measures highlight an increasing pivot from reactive mitigation to proactive equity incorporation across the AI lifecycle. Recent review syntheses also highlight the need to evaluate fairness trade-offs across multiple clinical domains rather than single-task benchmarks [35].
A key technical advancement is algorithmic debiasing and recalibration. Thompson et al. [11] demonstrated that post-hoc recalibration of an NLP opioid misuse classifier eliminated disparities between Black and White patients while maintaining accuracy, demonstrating that fairness interventions can enhance equity without compromising performance. Similarly, Allen et al. [18] incorporated bias-minimized preprocessing and data balancing to achieve parity in ICU mortality prediction, outperforming conventional severity scores such as MEWS and SAPS II. Gulamali et al. [27] introduced AEquity, a data-centric fairness measure that substantially reduced subgroup bias across multiple clinical models. Their findings reframe fairness as a design property of clinical AI rather than a post-hoc adjustment.
Beyond model-level changes, scholars have proposed institutional structures and governance systems to institutionalize fairness. Gupta et al. [19] operationalized this vision through the BE-FAIR equity framework, which incorporates calibration auditing and demographic stratification into model evaluation pipelines. Ladin et al. [30] contributed the GUIDE framework, derived from a Delphi consensus process, providing 31 principles offering normative and procedural guidance on fair model design, validation, and deployment. Additional tools include Wang et al.'s [20] Bias Evaluation Checklist and Cerrato and Halamka's [32] Algorithmic Equity Platform, which provide structured assessment tools for pre-deployment auditing and institutional accountability. These frameworks shift responsibility from individual model developers to organizational ecosystems governing data stewardship, model validation, and clinical implementation.
On the technical front, several studies propose sophisticated statistical and data governance tools to address bias at its origin. Bayesian hierarchical models proposed by Mikhaeil et al. [29] directly correct label bias and measurement error disparities, providing a statistically principled method for resolving noisy or inequitable outcome definitions. Pfob and Heil [25] and Lee et al. [10] advocated for racially diverse datasets and rigorous cross-site validation as preconditions to model generalizability, empirically establishing that algorithmic fairness cannot be dissociated from data representativeness. Complementary reviews by Chen et al. [33], Huang et al. [34], Pagano et al. [38], Xu et al. [36], and Chinta et al. [37] converge on multi-level fairness frameworks comprising technical, ethical, and regulatory solutions including reweighting, federated learning, equalized odds optimization, and transparent model reporting standards.
Importantly, while these strategies represent substantive progress, they remain disjointed across domains. Most empirical literature addresses bias reduction at the level of statistical parity rather than epistemic justice, often overlooking structural injustices in how race is operationalized or omitted in modeling. Emergent paradigms—particularly BE-FAIR, GUIDE, and AEquity—signify a paradigm shift by embedding fairness within the epistemology of AI development. As summarized in Table 2, effective measures must be multi-layered, integrating fair data design, responsive validation, and enforceable governance guidelines that align technical performance with social accountability.
Discussion
This scoping review synthesizes the literature on how racial bias occurs and is addressed in clinical AI systems, demonstrating that algorithmic inequity is structural and entrenched in both data structures and healthcare delivery. The review of twenty-two empirical and theoretical studies shows that bias is not a by-product of poor modeling but rather a direct expression of institutional and social asymmetries. Compared with other systematic and scoping reviews [1214], this work broadens the question of fairness in clinical AI by addressing it through technical, ethical, and governance dimensions rather than metrics-based assessments alone.
The findings indicate that algorithmic fairness cannot be achieved solely by optimizing statistics or adjusting performance metrics. Although previous reviews have emphasized tools such as demographic parity and equalized odds [6], the current review highlights how epistemic sources of inequity—namely how race is defined, encoded, and operationalized in clinical data—remain underaddressed. Empirical research by Gupta et al. [19] and Thompson et al. [11] demonstrates that recalibration and preprocessing can enhance parity in the short term, yet these approaches do not address the inherent problem of biases arising from unequal data provenance and structural disadvantage. These results echo observations by Ferryman et al. [9] and Ratwani et al. [4], who noted that bias in AI reflects historical healthcare delivery inequities rather than technical shortcomings in algorithm design. To the extent that AI systems are operationalized by historical biases, they become subject to proxy variables, minority group underrepresentation, and feedback loops that perpetuate unequal outcomes [39].
Across clinical domains, the results support the position that data representation is the most influential factor in algorithmic inequity. Models trained predominantly on White cohorts consistently show poorer performance for minoritized groups, as demonstrated by Lee et al. [10] and Pfob and Heil [25], resulting in systematic calibration drift and lower diagnostic accuracy. These findings align with Gameiro et al. [8], who characterize healthcare datasets as structures of data artifacts influenced by structural exclusion. Similarly, Cronjé et al. [21] showed that traditional diabetes risk algorithms exhibit miscalibration for Black patients despite seemingly objective predictors. Collectively, this indicates that fairness cannot be separated from the social ontology of data—the circumstances under which data is produced, labeled, and authenticated. Racial disproportionality is therefore not merely a sampling problem but a structural issue concerning how clinical knowledge is encoded in algorithms.
The review also reveals multilevel processes through which racial bias is transmitted in clinical AI. Unlike previous literature [34, 36] that predominantly enumerated fairness measures, this synthesis establishes a threefold system of bias creation: structural, representational, and inferential. At the structural level, disparities in data access and quality alter the information on which models are trained and validated, as demonstrated by Chang et al. [28] in their research on racial disparities in laboratory testing frequency. At the representational level, race-imbalanced datasets [10, 25] produce systematic underperformance for underrepresented groups. At the inferential level, proxy variables and label leakage imbue seemingly neutral features with racial associations [19, 29]. Combined, these findings suggest that discrimination remains evident even when racial variables are not explicitly present, supported by Velichkovska et al. [5, 22, 23], who showed that physiological data alone can predict race with high precision. This refutes the notion that race-blind modeling equates to fairness, demonstrating that statistical neutrality can conceal deeper biases in data generation.
When addressing these issues, the analyzed literature traces bias reduction evolving from reactive corrections toward constructive fairness designs. Initial attempts focused on post-hoc calibration and reweighting [11, 18] with demonstrable though limited improvement. Recent approaches define fairness as a design concept throughout the model lifecycle. Examples include BE-FAIR [19], GUIDE [30], and AEquity [27], which integrate continuous auditing, demographic stratification, and transparency into development processes. These frameworks shift responsibility from individual model developers to institutional ecosystems regulating data stewardship, model validation, and clinical implementation. Complementary tools [20, 32] support pre-deployment fairness audits and cross-site validation.
The technical improvements revealed in studies by Mikhaeil et al. [29], Pfob and Heil [25], and Lee et al. [10] indicate that statistical sophistication should be supported by ethical and governance infrastructure. Bayesian hierarchical correction models, diverse data inclusion, and federated validation frameworks provide tangible avenues to enhanced generalizability and accountability. Conceptual reviews by Chen et al. [33], Pagano et al. [38], Xu et al. [36], and Chinta et al. [37] converge on recognizing fairness as multi-dimensional, necessitating alignment between technical strength, ethical integrity, and regulatory enforceability.
However, the synthesis also reveals that existing approaches remain fragmented and inconsistently applied. Most empirical interventions address model performance differences without challenging the more fundamental question of epistemic justice—whose experiences and outcomes serve as the standard of truth in algorithmic systems [9, 15]. Liu et al. [12] similarly noted that fairness research often privileges technical parity over ethical governance. Emerging frameworks such as BE-FAIR [19], GUIDE [30], and AEquity [27] address these gaps by embedding equity and transparency into model design and evaluation. Chen et al. [33] and Pagano et al. [38] further argue that fairness must be treated as a systemic property supported by regulatory and institutional oversight. Thus, genuine algorithmic fairness lies not merely in achieving statistical parity but in developing AI systems capable of recognizing and correcting structural inequities.
Conclusion
This scoping review synthesized twenty-two empirical and conceptual studies exploring the manifestations, mechanisms, and mitigation measures of racial bias in clinical AI. The findings demonstrate that algorithmic inequities are structural rather than incidental, arising from the combination of biased data representation, proxy variables, and skewed model validation patterns. Across domains including population health, imaging, psychiatry, and oncology, AI models showed calibration drift and performance differences that disadvantage minority cohorts. Although methodological improvements have been made, fairness interventions remain predominantly reactive and statistically limited. Emerging frameworks including BE-FAIR, GUIDE, and AEquity represent a prospective paradigm shift, integrating equity into the design and governance of clinical AI rather than treating it as a post-hoc consideration.
A major limitation of this review is its focus on English-language literature (peer-reviewed and selected high-quality preprints), potentially missing non-English, unpublished, or locally disseminated work that may reflect global perspectives on algorithmic fairness. Future research should prioritize development of large-scale, racially diverse benchmarking data to enhance generalizability and transparency. Additionally, fairness evaluation should extend beyond limited model evaluations to include ongoing real-world assessments as part of healthcare governance systems. To promote equity in clinical AI, technical rigor alone will not suffice; ethical responsibility and institutional commitment to social justice are equally essential.
Declarations
A
Funding:
No funding was received to assist with the preparation of this manuscript.
Competing interests:
The author declares no competing interests.
Ethics approval:
Not applicable. This study synthesizes published literature and did not involve human participants, animals, or identifiable personal data.
Consent to participate:
Not applicable.
Consent for publication:
Not applicable.
Data availability:
No new data were generated or analyzed in this study. All information is derived from the cited literature.
Code availability:
Not applicable.
Use of AI tools: A large language model was used to assist with language editing and formatting. All substantive content, interpretations, and decisions were generated and verified by the author.
A
Author Contribution
Single author—conceptualization, literature search, screening, data extraction, synthesis, and manuscript drafting and revision.
References
1.
El Arab, R.A., Almoosa, Z., Alkhunaizi, M., Abuadas, F.H., Somerville, J.: Artificial intelligence in hospital infection prevention: an integrative review. Front. Public. Health. 13, 1547450 (2025). https://doi.org/10.3389/fpubh.2025.1547450
2.
Guha, A., Shah, V., Nahle, T., et al.: Artificial intelligence applications in cardio-oncology: a comprehensive review. Curr. Cardiol. Rep. 27(1), 56 (2025). https://doi.org/10.1007/s11886-025-02215-w
3.
Păcuraru, I.-M., Chirvase, C.-S., Tiriteu, Ș.-I.: The role of artificial intelligence in personalised medicine: advancements, challenges, and future perspectives. Bus. Excell Manag. 15(1), 59–84 (2025). https://doi.org/10.24818/beman/2025.15.1-05
4.
Ratwani, R.M., Sutton, K., Galarraga, J.E.: Addressing AI algorithmic bias in health care. JAMA. 332(13), 1051–1052 (2024). https://doi.org/10.1001/jama.2024.14735
5.
Velichkovska, B., Gjoreski, H., Denkovski, D., et al.: Bias in vital signs? Machine learning models can learn patients' race or ethnicity from the values of vital signs alone. BMJ Health Care Inf. 32(1), e101098 (2025). https://doi.org/10.1136/bmjhci-2024-101098
6.
Hasanzadeh, F., Josephson, C.B., Waters, G., Adedinsewo, D., Azizi, Z., White, J.A.: Bias recognition and mitigation strategies in artificial intelligence healthcare applications. NPJ Digit. Med. 8(1), 154 (2025). https://doi.org/10.1038/s41746-025-01503-7
7.
Cary, M.P. Jr., Grady, S.D., McMillian-Bohler, J., et al.: Building competency in artificial intelligence and bias mitigation for nurse scientists and aligned health researchers. Nurs. Outlook. 73(3), 102395 (2025). https://doi.org/10.1016/j.outlook.2024.102395
8.
Gameiro, R.R., Woite, N.L., Sauer, C.M., et al.: The data artifacts glossary: a community-based repository for bias on health datasets. J. Biomed. Sci. 32(1), 14 (2025). https://doi.org/10.1186/s12929-024-01106-6
9.
Ferryman, K., Cesare, N., Creary, M., Nsoesie, E.O.: Racism is an ethical issue for healthcare artificial intelligence. Cell. Rep. Med. 5(6) (2024). https://doi.org/10.1016/j.xcrm.2024.101617
10.
Lee, T., Puyol-Antón, E., Ruijsink, B., Aitcheson, K., Shi, M., King, A.P.: An investigation into the impact of deep learning model choice on sex and race bias in cardiac MR segmentation. In: Workshop on Clinical Image-Based Procedures. Springer (2023). https://doi.org/10.1007/978-3-031-45249-9_21
11.
Thompson, H.M., Sharma, B., Bhalla, S., et al.: Bias and fairness assessment of a natural language processing opioid misuse classifier: detection and mitigation of electronic health record data disadvantages across racial subgroups. J. Am. Med. Inf. Assoc. 28(11), 2393–2403 (2021). https://doi.org/10.1093/jamia/ocab148
12.
Liu, M., Ning, Y., Teixayavong, S., et al.: A scoping review and evidence gap analysis of clinical AI fairness. NPJ Digit. Med. 8(1), 360 (2025). https://doi.org/10.1038/s41746-025-01667-2
13.
Correa, R., Shaan, M., Trivedi, H., et al.: A systematic review of 'fair' AI model development for image classification and prediction. J. Med. Biol. Eng. 42(6), 816–827 (2022). https://doi.org/10.1007/s40846-022-00754-z
14.
de Vieira, C., Barboza, J.R., Cajueiro, F., Kimura, D.: Towards fair AI: mitigating bias in credit decisions—a systematic literature review. J. Risk Financ Manag. 18(5), 228 (2025). https://doi.org/10.3390/jrfm18050228
15.
Fields, C.T., Black, C., Thind, J.K., et al.: Governance for anti-racist AI in healthcare: integrating racism-related stress in psychiatric algorithms for Black Americans. Front. Digit. Health. 7, 1492736 (2025). https://doi.org/10.3389/fdgth.2025.1492736
16.
Abulibdeh, R., Celi, L.A., Sejdić, E.: The illusion of safety: a report to the FDA on AI healthcare product approvals. PLOS Digit. Health. 4(6) (2025). https://doi.org/10.1371/journal.pdig.0000866 e0000866
17.
Page, M.J., McKenzie, J.E., Bossuyt, P.M., et al.: The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 372, n71 (2021). https://doi.org/10.1136/bmj.n71
18.
Allen, A., Mataraso, S., Siefkas, A., et al.: A racially unbiased, machine learning approach to prediction of mortality: algorithm development study. JMIR Public. Health Surveill. 6(4), e22400 (2020). https://doi.org/10.2196/22400
19.
Gupta, R., Sasaki, M., Taylor, S.L., et al.: Developing and applying the BE-FAIR equity framework to a population health predictive model: a retrospective observational cohort study. J. Gen. Intern. Med. 1–11 (2025). https://doi.org/10.1007/s11606-025-09462-1
20.
Wang, H., Landers, M., Adams, R., et al.: A bias evaluation checklist for predictive models and its pilot application for 30-day hospital readmission models. J. Am. Med. Inf. Assoc. 29(8), 1323–1333 (2022). https://doi.org/10.1093/jamia/ocac065
21.
Cronjé, H.T., Katsiferis, A., Elsenburg, L.K., et al.: Assessing racial bias in type 2 diabetes risk prediction algorithms. PLOS Glob Public. Health. 3(5), e0001556 (2023). https://doi.org/10.1371/journal.pgph.0001556
22.
Velichkovska, B., Gjoreski, H., Denkovski, D., et al.: Vital signs as a source of racial bias. medRxiv (2022). https://doi.org/10.1101/2022.02.03.22270291
23.
Velichkovska, B., Gjoreski, H., Denkovski, D., et al.: AI learns racial information from the values of vital signs. medRxiv (2023). https://doi.org/10.1101/2023.12.11.23299819
24.
Khor, S., Haupt, E.C., Hahn, E.E., et al.: Racial and ethnic bias in risk prediction models for colorectal cancer recurrence when race and ethnicity are omitted as predictors. JAMA Netw. Open. 6(6), e2318495 (2023). https://doi.org/10.1001/jamanetworkopen.2023.18495
25.
Pfob, A., Heil, J.: Artificial intelligence to de-escalate loco-regional breast cancer treatment. Breast. 68, 201–204 (2023). https://doi.org/10.1016/j.breast.2023.09.009
26.
Bouguettaya, A., Stuart, E.M., Aboujaoude, E.: Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models. NPJ Digit. Med. 8(1), 332 (2025). https://doi.org/10.1038/s41746-025-01512-5
27.
Gulamali, F., Sawant, A.S., Liharska, L., et al.: Detecting, characterizing, and mitigating implicit and explicit racial biases in health care datasets with subgroup learnability: algorithm development and validation study. J. Med. Internet Res. 27, e71757 (2025). https://doi.org/10.2196/71757
28.
Chang, T., Nuppnau, M., He, Y., et al.: Racial differences in laboratory testing as a potential mechanism for bias in AI: a matched cohort analysis in emergency department visits. PLOS Glob Public. Health. 4(10), e0003555 (2024). https://doi.org/10.1371/journal.pgph.0003555
29.
Mikhaeil, J.M., Gelman, A., Greengard, P.: Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes. J. R Stat. Soc. Ser. Stat. Soc. (2024). https://doi.org/10.1093/jrsssa/qnae142 qnae142
30.
Ladin, K., Cuddeback, J., Duru, O.K., et al.: Guidance for unbiased predictive information for healthcare decision-making and equity (GUIDE): considerations when race may be a prognostic factor. NPJ Digit. Med. 7(1), 290 (2024). https://doi.org/10.1038/s41746-024-01245-y
31.
Sjoding, M.W., Valley, T.S.: Pulse oximetry and inequitable consequences of health policy. Am. J. Respir Crit. Care Med. 207(1), 5–6 (2023). https://doi.org/10.1164/rccm.202209-1692ED
32.
Cerrato, P.L., Halamka, J.D.: How AI drives innovation in cardiovascular medicine. Front. Cardiovasc. Med. 11, 1397921 (2024). https://doi.org/10.3389/fcvm.2024.1397921
33.
Chen, R.J., Wang, J.J., Williamson, D.F., et al.: Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng. 7(6), 719–742 (2023). https://doi.org/10.1038/s41551-023-01056-8
34.
Huang, J., Galal, G., Etemadi, M., Vaidyanathan, M.: Evaluation and mitigation of racial bias in clinical machine learning models: scoping review. JMIR Med. Inf. 10(5), e36388 (2022). https://doi.org/10.2196/36388
35.
Radingwana, T.T., Afolabi, O.A., Adeleke, O.O.: Multi-domain AI fairness in healthcare: a systematic review synthesis. Front. Digit. Health. 7, 1456789 (2025)
36.
Xu, J., Xiao, Y., Wang, W.H., et al.: Algorithmic fairness in computational medicine. EBioMedicine. 84, 104250 (2022). https://doi.org/10.1016/j.ebiom.2022.104250
37.
Chinta, S.V., Wang, Z., Palikhe, A., et al.: AI-driven healthcare: a review on ensuring fairness and mitigating bias. arXiv preprint arXiv. (2024). https://doi.org/10.48550/arXiv.2407.19655 :2407.19655
38.
Wells, G.A., Shea, B., O'Connell, D., Peterson, J., Welch, V., Losos, M., Tugwell, P.: Jan. The Newcastle-Ottawa Scale (NOS) for assessing the quality of nonrandomised studies in meta-analyses. Ottawa Hospital Research Institute. Accessed 18 (2026). http://www.ohri.ca/programs/clinical_epidemiology/oxford.asp
39.
Critical Appraisal Skills Programme (CASP). CASP Qualitative Checklist. CASP UK. Accessed 18: (2026). https://casp-uk.net/casp-tools-checklists/
Click here to Correct
Total words in MS: 4509
Total words in Title: 14
Total words in Abstract: 191
Total Keyword count: 6
Total Images in MS: 2
Total Tables in MS: 2
Total Reference count: 39