Segmenting with Confidence: Uncertainty Quantification for Brain Tumor Imaging
Institutions:
Authors: Yassine Guennoun1, Pierre Nedelec1, Mark McArthur1, Evan Bloch1, Jinchi Wei1, Leo Sugrue1, Evan Calabrese2, Andreas Rauschecker1.
Title:
1 Center for Intelligent Imaging (ci2), Department of Radiology & Biomedical Imaging, University of California San Francisco (UCSF), San Francisco, USA.
2 Department of Radiology, Duke University Medical Center, Durham, NC, USA.
A
Yassine Guennoun 1✉ Email
Pierre Nedelec 1
Mark McArthur 1
Evan Bloch 1
Jinchi Wei 1
Leo Sugrue 1
Evan Calabrese 2
Andreas Rauschecker 1
1
A
Center for Intelligent Imaging (ci2), Department of Radiology & Biomedical Imaging University of California San Francisco (UCSF) San Francisco USA
2 Department of Radiology Duke University Medical Center Durham NC USA
Corresponding author: Yassine Guennoun, yassineguennoun02@gmail.com
A
Abstract
Purpose
To develop and validate a deep learning framework that provides clinically meaningful uncertainty estimates for meningioma segmentation, enabling more trustworthy longitudinal volumetric assessment.
Materials and Methods
A
In this retrospective study, we developed an evidential deep learning (EDL) ensemble framework and trained it on 1,655 post-contrast T1-weighted brain MRIs from 788 patients with meningiomas. We evaluated the clinical utility of an architecturally heterogeneous ensemble on an independent test set of 68 MRIs from 43 patients. We compared its performance to other state-of-the-art segmentation models and uncertainty estimates. The evaluation included: (1) assessment of Dice score and overall segmentation accuracy, (2) qualitative correspondence of spatial uncertainty maps with neuroradiologist-defined ambiguity, (3) quantitative calibration of volumetric credible intervals using empirical coverage, assessing whether the model’s intervals indeed contained the true volume in 95% of cases and (4) external validation on an independent cohort from another institution to confirm generalizability.
Results
High segmentation accuracy was achieved across all ensemble configurations (median Dice ≈ 0.93), with spatial uncertainty maps qualitatively aligning with regions rated as difficult by a neuroradiologist. Out of all models tested, the heterogeneous EDL ensemble produced the most reliable volumetric credible intervals, capturing the true tumor volume in 92.8% of cases. External validation on an independent external cohort of 353 patients with meningioma confirmed high generalizability, achieving a median Dice of0.92.
Conclusion
Evidential deep learning ensembles provide well-calibrated uncertainty estimates while maintaining high segmentation accuracy. Architectural diversity enhances credible interval calibration, enabling more trustworthy single time-point and longitudinal assessments and supporting safer clinical deployment of automated meningioma segmentation. The methods presented here for meningioma are directly applicable to medical image lesion segmentation more broadly, promising to increase trust and safety in the use of AI in medical imaging.
Key words:
Meningioma
Brain MRI
Uncertainty Estimation
Deep Learning
Evidential Learning
Tumor Segmentation
Volume Quantification
A
A
A
A
Introduction
Medical imaging is central to the practice of medicine, allowing experts to localize, diagnose, and longitudinally follow abnormalities inside the body. For example, in neuro-oncology practice, MRI-based monitoring of brain tumors, such as meningiomas,guides critical treatment decisions. The clinical standard for measuring tumors often relies on either subjective evaluation or simplified bi-dimensional measurements as outlined by the Response Assessment in Neuro-Oncology (RANO) working group, primarily designed for gliomas (Wen et al. 2023) and extended to meningiomas (Huang et al. 2019). However, for tumors with slow or irregular growth patterns, these 2D metrics can fail to capture the true tumor burden, with studies showing that 3D volumetric analysis provides a more accurate picture of progression (Ramakrishnan et al. 2023; Ellingson et al. 2021)
While deep learning has demonstrated strong performance metrics for 3D segmentation across a wide variety of biomedical image segmentation tasks, clinical adoption of these tools has been limited. A primary reason for limited adoption is a lack of trust and interpretability in overconfident deterministic models. Standard segmentation models produce deterministic outputs that mask the inherent uncertainty of the prediction, making it difficult for clinicians to distinguish true biological growth from measurement noise. Furthermore, small segmentation inaccuracies near eloquent brain regions or along tumor margins could have substantial consequences for radiation or surgical planning, or for evaluation of tumor growth versus stability, directly affecting treatment decisions.
Until mechanisms exist to quantify and communicate voxelwise model confidence in individual patients, segmentation methods will remain limited in terms of clinical applicability. Model precision, and its confidence, can vary by tumor pathology, across tumors of the same pathology in different locations or with different imaging appearances, and within portions of a single tumor. Uncertainty quantification (UQ) (Lambert et al. 2022) is a group of methods for addressing this inherent variability in model confidence. Techniques include: Bayesian methods such as Monte Carlo Dropout (Gal and Ghahramani 2015) and Deep Ensembles (Lakshminarayanan, Pritzel, and Blundell 2016) for capturing model uncertainty, and generative approaches like the Probabilistic U-Net (Kohl et al. 2018) for modeling structural ambiguity in segmentations. A more recent paradigm is Evidential Deep Learning (EDL), which directly models evidence from data to form a predictive distribution (Sensoy, Kaplan, and Kandemir 2018; Li et al. 2023). However, recent work has shown that single-model EDL can produce unreliable uncertainty estimates, motivating the need for more robust, ensemble-based approaches (Shen et al. 2024).
The purpose of this study was to develop and validate a fully automated framework for tumor segmentation on real clinical brain MRIs, built on ensembles of EDL models, capable of producing clinically meaningful volumetric credible intervals and interpretable uncertainty maps. We focus this study on meningiomas, the most common primary brain tumor, accounting for over one-third of all intracranial tumors and nearly half of all primary brain tumors (Baldi et al. 2018; Lin et al. 2019; Price et al. 2024). While some meningiomas have very well-defined borders and model confidence would be expected to be high, others are infiltrative, abut anatomical structures that obscure the tumor boundary, or demonstrate unusual shapes, all leading to expected regions of model uncertainty that are critical to consider when measuring tumor volumes or when communicating tumor boundaries to clinicians. Notably, our study also includes post-operative brain MRIs; uncertainty can be especially high when enhancing treatment-related changes about enhancing tumors, yet these MRIs are critical for evaluating growth of residual or recurrent tumors (Goldbrunner et al. 2021). We hypothesized that the clinical reliability and calibration of these uncertainty estimates are critically dependent on architectural heterogeneity and that they directly relate to radiologists’ uncertainty of tumor boundaries.
Results
An evidential deep learning (EDL) ensemble framework was trained to detect and segment meningiomas based on T1-weighted post-contrast images acquired as part of routine clinical care (Fig. 1). Importantly, as described in detail in Methods, the EDL framework allows for estimation of probability of tumor presence at each voxel, translating into a measure of voxel-wise uncertainty and overall uncertainty of tumor volume. Further details of model training can be found in the Methods section.
Fig. 1
Conceptual overview of the evidential deep learning ensemble framework. A T1 post-contrast MRI is processed by an ensemble of
models. Each model outputs the parameters of a Beta distribution per voxel, representing the probability of tumor presence. These individual distributions are combined into a mixture model, from which a final posterior probability is derived to classify each voxel and quantify uncertainty.
Click here to Correct
The EDL framework models demonstrate state-of-the-art segmentation performance
For the test set, we established a strict human reference standard where every case was manually segmented by a neuroradiologist. We then compared the segmentation performance of three ensemble configurations against this ground truth. A baseline homogeneous ensemble (five SegResNets without EDL) achieved a median Dice of 0.93. Our proposed homogeneous EDL ensemble (five SegResNets) scored a median Dice of 0.95, while our heterogeneous EDL ensemble (mixed-architecture) scored a median Dice of 0.93 (Fig. 2). A repeated-measures ANOVA revealed a significant effect of the ensemble type on Dice scores (
with the homogeneous ensemble outperforming both heterogeneous and baseline configurations. Both models outperformed a baseline nnU-net v2 model (Isensee et al. 2021) trained on the same dataset, which achieved a median Dice score of 0.77 (p < 0.05). These results demonstrate that the EDL framework, in both its homogeneous and heterogeneous implementations, provides uncertainty quantification without compromising segmentation accuracy.
While all five individual architectures in the mixed ensemble achieve a similarly high median Dice score, they possess different and often uncorrelated failure modes. We observed instances where the SWIN UNETR and DiNTS failed while the SegResNets succeeded, and vice versa.
Fig. 2
Quantitative evaluation of the heterogeneous EDL ensemble on the test set (N = 68). (A) Distribution of Dice scores, stratified by ground truth tumor volume. The model maintains high performance across a range of tumor sizes. The few zero-Dice points correspond to challenging or out-of-distribution cases: the green dot reflects the most difficult example (rated 9/10), while the others are small 4 cm³ tumors. (B) Predicted versus ground truth volume, colored by radiologist-assessed difficulty. Error bars indicate the 95% credible interval derived through the EDL framework. The credible interval widens for smaller and more difficult tumors, demonstrating appropriate uncertainty calibration.
Click here to Correct
External validation demonstrates high generalizability of the model
To assess generalizability beyond our private internal dataset, we evaluated the homogeneous SegResNet (EDL) ensemble on an independent external cohort of 353 patients with meningiomas obtained from Duke University. The model achieved a median Dice score of 0.92, not significantly different than the performance observed on the internal (UCSF) test set (p = 0.57, independent t-test), demonstrating cross-institutional robustness and consistency. A detailed analysis of performance as a function of tumor volume is provided in the Supplementary Material, where Dice scores are plotted against ground truth volume to visualize performance stability across tumor sizes (Figures S1 and S2).
Fig. S1
Quantitative evaluation of the homogeneous EDL ensemble on the test set (N = 68). (A) Distribution of Dice scores, stratified by ground truth tumor volume. The model maintains high performance across a range of tumor sizes. The few zero-Dice points correspond to challenging or out-of-distribution cases: the green dot reflects the most difficult example (rated 9/10), while the others are small 4 cm³ tumors with resolutions differing from the training data. (B) Predicted versus ground truth volume, colored by radiologist-assessed difficulty. Error bars indicate the 95% credible interval derived through the EDL framework
Appendix F - Dice score per tumor volume
A consistent performance trend is observed across both the internal UCSF (Figure S1) and external Duke (Figure S2) test sets. The model's segmentation accuracy, as measured by the Dice score, shows a clear positive correlation with tumor volume. For smaller tumors, particularly those in the 0.3-1 cm³ range, the model is more sensitive to error, exhibiting lower median Dice scores, greater variance, and a higher number of outlier cases with poor overlap. Conversely, as tumor volume increases, the model's performance becomes significantly more robust and reliable. For larger tumors (> 5 cm³), the segmentation is highly accurate.
Click here to Correct
Fig. S2
Segmentation Performance by Tumor Volume on the UCSF Dataset. Boxplots of Dice scores achieved by the model, stratified by ground truth tumor volume groups (cm³). The number of cases (n) in each group is indicated on the x-axis.
Click here to Correct
EDL framework improves uncertainty quantification over prior methods
Beyond segmentation accuracy, we analyzed the quality of the predictions and uncertainty estimates. Empirical coverage refers to the percentage of cases where the true ground-truth volume falls within the model's predicted 95% credible interval. Thus, a 95% credible interval should ideally result in empirical coverage of 95%.
Table 1
Empirical coverage and Pearson correlations between predicted 95% credible interval width and segmentation metrics (segmentation error, average symmetric surface distance, and volumetric error) across four different frameworks. Training details, including for MC Dropout, are provided in Appendix C. p refers to the dropout rate. The metrics are defined in Appendix D and calculated on all 68 segmentations in the test set.
Method
Coverage
Seg. Error (
)
ASSD (
)
Vol. Error (
)
Homogeneous EDL (SegResNet)
87 %
0.78 (0.61)
0.56 (0.32)
0.73 (0.53)
MC Dropout (p = 0.2)
66%
0.79 (0.63)
0.73 (0.54)
0.74 (0.54)
MC Dropout (p = 0.5)
72%
0.81 (0.65)
0.73 (0.54)
0.74 (0.55)
Mixed Arch. (EDL)
93 %
0.82 (0.68)
0.71 (0.50)
0.63 (0.40)
The results (Table 1) reveal differences in uncertainty quality across methods, which we analyze by first comparing our two proposed EDL ensembles and then benchmarking against MC Dropout (Gal and Ghahramani 2015).
First, we assessed the impact of architectural diversity by comparing the homogeneous and heterogeneous EDL ensembles. The homogeneous EDL using five SegResNets demonstrated 87% coverage and strong correlations with segmentation (R² = 0.61) and volumetric error (R² = 0.53). In comparison, the heterogeneous mixed architecture EDL ensemble demonstrated the benefits of architectural diversity. It achieved the highest coverage (93%), reflecting improved calibration, and exhibits stronger correlations with geometric error (R² = 0.50 vs 0.32) and segmentation error (R² = 0.68 vs 0.61). These findings indicate that combining diverse architectures enhances the model’s ability to capture spatial and volumetric deviations through its uncertainty estimates.
Next, we compared our proposed mixed architecture EDL ensembles to the MC Dropout baseline. While MC Dropout showed strong error correlations (R² = 0.65 for segmentation error, R² = 0.54 for ASSD), the mixed architecture EDL ensemble achieved comparable performance (R² = 0.68, R² = 0.50) while providing substantially better coverage (93% vs 72%). This higher coverage indicates that EDL avoids the overconfidence of MC Dropout, producing more reliable credible intervals. In clinical practice, this improved calibration is critical for reducing false negatives, thereby supporting safer decision-making.
Uncertainty-aware segmentation is qualitatively meaningful
The novel mixed architecture EDL ensemble model produced realistic uncertainty-aware segmentation outputs (Fig. 3). As detailed in the Methods, the framework decomposes total uncertainty into two distinct components. Aleatoric-like uncertainty captures ambiguity intrinsic to the task itself, identifying voxels that are inherently difficult to classify regardless of the model's training. Epistemic-like uncertainty reflects the model’s internal lack of knowledge, arising when it encounters anatomy that differs from the patterns seen in its training data. In challenging meningioma cases, we found that both the epistemic-like and aleatoric-like uncertainty estimates were high along the borders of the tumor, in particular at boundaries with other enhancing structures. The epistemic-like uncertainty maps reflect regions of uncertainty caused by the interface of enhancing tumor with other anatomically normal enhancing structures, such as the dural venous sinuses or post-operative changes. Aleatoric-like uncertainty, meanwhile, highlights the inherent uncertainty in the boundary of tumors in MRI data (e.g. due to partial voluming). Credible intervals tended to be wider for cases judged as difficult to segment by a neuroradiologist (Figure S3), although there was no significant difference, likely due to insufficient data for very difficult cases.
Fig. 3
Qualitative visualization of the uncertainty-aware segmentation results for four challenging (difficulty score ≥ 7/10 as rated by a neuroradiologist) meningioma cases (A–D). Each panel shows a spatial decomposition of predictive uncertainty into epistemic and aleatoric components on a representative slice of the 3D input volume. Aleatoric uncertainty systematically concentrates along tumor boundaries, aligning with regions of intrinsic voxel-level ambiguity, while epistemic uncertainty appears more spatially clustered, often localizing to anatomically ambiguous regions such as tumor-brain interfaces. Patient A’s tumor is challenging because of involvement of the adjacent superior sagittal sinus, which also enhances similarly to the meningioma. Patient B’s tumor is challenging because of involvement of the adjacent cavernous sinus, which enhances similarly to the meningioma. Patient C is a post-operative brain MRI with residual/recurrent meningioma, but some of the enhancing structures represent normal vasculature or post-operative scar tissue, making differentiation from residual tumor difficult. Patient D’s tumor extends along the dural reflection towards the sphenoparietal sinus, which is difficult to distinguish from the tumor boundary.
Click here to Correct
Next, to assess robustness under image acquisition variability, we simulated clinically plausible degradations such as motion blur and reduced resolution (Fig. 4). These degradations commonly arise in routine clinical workflows. Despite these perturbations, the model remained qualitatively robust, with uncertainty increasing appropriately under both conditions. Specifically, for well-defined, high-contrast tumors with sharp borders (Patient A), segmentation and uncertainty maps remained stable. In more ambiguous cases (Patients B–D), degradations amplified uncertainty in a clinically coherent manner. Thus, qualitatively, the uncertainty maps appear to offer a reliable signal of reduced confidence in the setting of image degradations, specifically in cases that require higher quality images.
Fig. 4
Qualitative visualization of segmentation robustness and uncertainty decomposition under image degradation for four MRIs with meningiomas (A–D). Each MRI is shown under three imaging conditions: original, motion blur, and low resolution, with columns depicting (i) prediction overlaid with ground truth, (ii) aleatoric uncertainty, and (iii) epistemic uncertainty. The human reference standard segmentation is shown in blue and the model prediction in green overlay. For an easier case such as Patient A, both the segmentation and uncertainty maps remain stable across degraded inputs, indicating strong model confidence despite image degradations. For Patients C and D, degradation leads to visibly increased uncertainty without significantly affecting segmentation accuracy, consistent with the model expressing higher uncertainty while maintaining performance. In contrast, Patient B, a particularly challenging case of a cavernous sinus meningioma involving complex anatomy, shows both elevated uncertainty and a clear drop in segmentation accuracy under motion blur and low resolution. These results confirm the expected relationship: as real-world image artifacts increase task difficulty, the model becomes more uncertain, reflecting appropriate calibration of predictive confidence.
Click here to Correct
Utility of uncertainty in longitudinal evaluations of brain tumors
The above results demonstrate the utility of uncertainty estimates in guiding physicians to investigate areas of tumor segmentations that may be less reliable or where model confidence is low. To demonstrate the potential utility of uncertainty estimation in a longitudinal setting, voxel-wise uncertainty maps are translated into 95% credible intervals around volumetric measurements of the tumors (Fig. 5). These credible intervals endow longitudinal variation in volumes with interpretability, differentiating measurement noise from true changes in tumor volume.
Fig. 5
Longitudinal meningioma volume evolution in a single patient post-resection. Left: Tumor volume progression (solid blue line) with 95% credible intervals (shaded area). Axial MRIs show the anatomical context at each timepoint. Right: A zoom on the final timepoint reveals a region of high epistemic uncertainty at the tumor margin, where it abuts post-surgical granulation tissue, illustrating how model uncertainty can directly impact volumetric assessment.
Click here to Correct
Additional results and subgroup analyses are reported in Appendix E, Appendix F and Appendix G.
Discussion
In this work, we developed and validated an uncertainty-aware framework for meningioma volumetrics. Our approach, using evidential deep learning (EDL) ensembles, successfully produced principled credible intervals for tumor volume while maintaining state-of-the-art segmentation accuracy. By directly modeling the amount of evidence supporting the prediction at every voxel in the image, the EDL framework enables neural networks to express both what they predict and how confident they are. In the context of meningiomas, the most common primary brain tumor, our validation of this method demonstrates state-of-the-art segmentation accuracy with robust, meaningful uncertainty metrics. This capability has the potential to substantially increase trust in biomedical image segmentation, particularly in applications such as brain tumor volumetrics where decisions are sensitive to boundary ambiguities and subtle longitudinal changes in the context of heterogenous image quality.
The quality of uncertainty estimates is central to clinical applicability (Faghani et al. 2023). In our analysis, the heterogeneous EDL ensemble achieved the highest empirical coverage (92.8%), indicating that the reported credible intervals around the predicted tumor volumes reliably reflect the true variability of tumor volume estimates. Moreover, error metrics of segmentations were correlated with degree of uncertainty. In practice, uncertainty maps can help physicians focus their attention on challenging regions, improving efficiency in clinical review. Similarly, longitudinal credible intervals enable clinicians to distinguish true volumetric change from model ambiguity, supporting more reliable monitoring of tumor progression.
The voxel-level decomposition into epistemic and aleatoric uncertainty is visually interpretable and qualitatively closely reflects clinical intuition, effectively translating theoretical uncertainty modeling into a form usable for radiological assessment. The epistemic-like component tended to be high in regions of uncertainty due to enhancing tumor abutting other normally enhancing structures, such as the dural venous sinuses. These are regions where it can be difficult for an expert neuroradiologist to differentiate between tumor and normal tissue. The aleatoric-like component tended to be high along the boundaries of the tumor, which is inherently uncertain. Motion blur and lower resolution qualitatively increased both components, most notably for ambiguous or unusual regions of tumors, where high-quality imaging is most essential.
A key finding of this study is that uncertainty quantification can be achieved without a trade-off in performance. Both homogeneous (Dice 0.95) and heterogeneous EDL (Dice 0.93) ensembles maintained high Dice scores comparable to or better than standard state-of-the-art segmentation networks, such as nnU-net v2 (Dice 0.77), confirming that uncertainty modeling need not come at the expense of segmentation accuracy.
A
While the homogeneous ensemble achieved slightly higher median Dice, the heterogeneous ensemble provided superior calibration and stronger correlations with segmentation and volumetric error metrics. These results highlight the benefit of combining various architectures with distinct inductive biases to better capture complementary aspects of model uncertainty.
Compared to MC Dropout (Lemay et al. 2022), the evidential framework offers substantial computational advantages. MC Dropout requires 20 stochastic forward passes, resulting in significant runtime overhead, whereas the EDL ensembles produce uncertainty estimates in a single forward pass. This efficiency is particularly important for clinical deployment, where rapid inference and integration into existing workflows are essential.
While our current system was designed for deployment within the local workflow and trained exclusively on iternal private data, the high segmentation performance on an independent external test set provides compelling evidence of cross-institutional generalization, without any fine-tuning. Nevertheless, future studies incorporating multi-center datasets and multi-rater annotations will be necessary to fully validate model uncertainty against human inter-observer variability.
Limitations of this study should be noted. While we compared the homogeneous and heterogeneous EDL frameworks as described, other combinations of ensemble models could achieve even higher performance. Moreover, it is unclear how to quantitatively compare uncertainty maps across methods and relate those to human uncertainty. Ultimately, future studies incorporating uncertainty quantification techniques like the one described here should measure these methods’ impact on radiological workflows and clinical decision making, which was beyond the scope of the current analyses.
In conclusion, we have presented an evidential deep learning framework that produces well-calibrated credible intervals for tumor volume. We found that leveraging architectural diversity within the ensemble is a powerful strategy for achieving high-quality uncertainty estimates. This work advances the broader goal of developing transparent, safe, and trustworthy AI for medicine, paving the way for uncertainty-aware quantitative monitoring of tumor dynamics in routine clinical care.
Material and Methods
Study Population
A
This retrospective study was approved by our institution's IRB, with a waiver for consent due to low risk. All procedures complied with HIPAA. Data were stored on secure, access-controlled servers and de-identified as necessary. The patient population has not been previously reported.
The study involved three distinct patient cohorts:
1. Training Cohort. This cohort, used for model training and validation, consisted of 1,655 3D post-contrast T1-weighted brain MRIs (1,324 for training and 331 for validation) from 788 unique patients from a curated private institutional dataset.
The ground truth segmentations for this cohort were generated through a semi-automated process. An initial model (Isensee et al. 2021) was trained on a subset of the data to generate predictions on the full training set. Cases with outlier or clearly erroneous segmentations were subsequently identified and discarded during a curation step.
2. Independent Test Cohort. This cohort, used for final model evaluation, consisted of 93 3D T1-weighted post-contrast brain MRIs from 55 unique patients. [ 11 male, 44 female; age range: 35–85 years; mean age: 59 years.] Patients included in the test cohort were not included in the training cohort. Exams with meningioma volumes smaller than 0.3 cm³ were excluded, as precise volumetric measurement and credible intervals at this scale are not clinically meaningful. After filtering, 68 MRIs from 43 unique patients (10 male, 33 female; age range: 35–82 years; mean age: 58 years) were retained for analysis.
For this cohort, the ground truth segmentation was reviewed and manually corrected on a voxel-wise basis by a neuroradiologist with 6 years of post-graduate clinical experience, directly guided and supervised by a neuroradiologist with 12 years of post-graduate clinical experience. Challenging cases were discussed in detail to create accurate reference standard segmentations.
3. Independent External Institution Test Cohort. This independent external cohort, used to assess model generalization, consisted of 353 unique patients from Duke University [94 male, 259 female; age range: 19–96 years; mean age: 63.2 years] (LaBella et al. 2024), and is partially publicly available.
Beta Modeling of Class Probabilities
We adopt an Evidential Deep Learning (EDL) framework to model both predictive performance and voxel-wise uncertainty in a principled Bayesian setting. Unlike conventional softmax-based networks that output deterministic class probabilities, EDL models generate parameters of a predictive distribution, enabling uncertainty quantification that propagates to higher-level volumetric measures.
Let
denote the input 3D image volume and
the binary voxel-level label indicating the presence or absence of tumor tissue. In our framework, the network, parameterized by
, is trained to output parameters of a Beta distribution for each voxel
, as
. This is a special case of the Dirichlet distribution when
classes.
Specifically, the neural network outputs two non-negative evidence values
, which are passed through a ReLU activation to ensure positivity. The parameters of the Beta distribution are then defined as:
This formulation produces a predictive distribution over
that reflects the model’s confidence in voxel-wise classification. The mean of this distribution represents the expected tumor probability given the input and the trained model:
A binary segmentation mask is obtained by thresholding this probability at
.
Furthermore, the model’s confidence in each voxel prediction is quantified by the variance of the Beta distribution:
This captures within-model uncertainty, reflecting both noise in the input data and class overlap, given a fixed trained model
.
Loss Function and Optimization Objective
To train the model to express reliable uncertainty and accurate segmentation, we use a compound loss function that balances prediction accuracy with uncertainty regularization:
The first term, LDice​, focuses on segmentation accuracy. It is a variant of the standard Dice loss that rewards accurate spatial overlap between the prediction and the ground truth, while also penalizing the model for being uncertain about its correct predictions. The second term, LKL​, is a regularization penalty that specifically discourages the model from making overconfident errors.
The mathematical formulation of these loss components and the annealing schedule λ(t) used to balance them during training are detailed in Appendix A for technical completeness.
Model Ensembling and Total Uncertainty Quantification
A single deep learning model, even one trained with EDL, can sometimes be overconfident. To obtain a more robust and reliable uncertainty estimate, we used an ensemble approach. Instead of relying on one model, we trained five independent models and aggregated their predictions. This strategy is analogous to seeking a consensus from multiple experts. It provides a practical and effective way to approximate the true Bayesian predictive distribution, which formally accounts for uncertainty over the model's parameters. The mathematical basis for this approximation is detailed in Appendix B.
Each model (m) yields a Beta distribution
for each voxel
. The total predictive variance per voxel can then be decomposed into two components:
In this decomposition, the first term, often referred to as aleatoric-like uncertainty, captures the average intrinsic variance within each model’s prediction. This reflects the uncertainty that remains even with a single, perfectly trained model, such as from data noise or ambiguous voxel-level boundaries. It is typically considered irreducible with more data, as it is a property of the data-generating process itself.
The second term, referred to as epistemic-like uncertainty, measures the variance in the mean predictions across the ensemble. This represents the uncertainty arising from model parameter variability due to a finite training set. It manifests as disagreement among ensemble members and is considered reducible with the addition of more training data. To assess the impact of model diversity on the quality of uncertainty estimates, we evaluate two distinct configurations: a homogeneous ensemble composed of models with a single architecture, and a heterogeneous ensemble combining multiple, diverse architectures.
We use the terms "aleatoric-like" and "epistemic-like" to acknowledge that while this decomposition provides a valuable and intuitive framework for separating different sources of error, the strict philosophical dichotomy between aleatoric and epistemic uncertainty is a subject of ongoing debate in the literature, and they can be highly correlated in practice (Kirchhof, Kasneci, & Kasneci, 2025). For instance, high aleatoric uncertainty at a difficult boundary might lead to greater model disagreement (epistemic uncertainty) as different models converge to slightly different solutions.
Nevertheless, this formulation provides a highly interpretable and clinically useful visual decomposition of uncertainty. It allows us to distinguish between regions of intrinsic ambiguity (e.g., tumor boundaries) and areas where the model is simply "unsure" due to a lack of training data or out-of-distribution input (e.g., dural reflections), which is especially important for downstream tasks such as volumetric quantification.
Bayesian Estimation of Tumor Volume with Credible intervals
To derive uncertainty-aware tumor volumes, we employ Bayesian Model Averaging (BMA) over the ensemble of voxel-wise Beta distributions predicted by our models. Given the ensemble, we define a mixture model over class probabilities and compute credible intervals through Monte Carlo sampling. For each voxel
, we first compute the expected tumor probability for each model:
The ensemble mean prediction is then obtained by averaging over the
models:
To capture predictive uncertainty, we sample voxel-wise class probabilities from the mixture of Beta distributions defined by the ensemble. Concretely, for each voxel
, we draw
samples:
where
is randomly selected from
. We selected
as it provides a stable estimate of the posterior distribution without excessive computational overhead. From the
samples, we estimate the
-credible interval by computing the quantiles
and
.
Based on these quantiles, we define three segmentation masks:
Baseline segmentation
: voxel
is included if
.
High confidence segmentation
: voxel
is included if
, indicating high confidence in tumor presence.
Inclusive segmentation
: voxel
is included if
, indicating even marginal suspicion of tumor.
These three masks allow us to define a credible interval for the total tumor volume:
where each volume
is computed by summing over the binary segmentation mask
and multiplying by the voxel spacing for each dimension,
,
and
.
Models training
To construct a diverse ensemble, we implemented a principled strategy designed to induce variation across multiple orthogonal axes to improve the robustness and calibration of uncertainty estimates. Our approach systematically introduces diversity ( Wang & Ji, 2023) at the data, model, and objective levels:
Data-level Diversity: We used a 5-fold cross-validation structure combined with bootstrap resampling of the training data for each fold. This ensures that each model is trained on a unique, though overlapping, subset of the patient data, exposing it to different sampling distributions.
Model-level Diversity: Each model was trained from a distinct random initialization. Furthermore, each network was subjected to its own unique sequence of data augmentations during training, including random flips, rotations, and intensity shifts, forcing them to learn along different optimization trajectories.
Objective-level Diversity: We varied the hyperparameters of the loss function for each model, specifically the weight and annealing schedule of the KL-divergence penalty. This encourages individual models to learn different trade-offs between segmentation accuracy and uncertainty penalization. We also modulated the softness of the ground-truth boundary on a per-model basis (by dilating the tumor mask by 0–4 voxels when computing the regularization loss), prompting different models to learn varying degrees of caution at ambiguous tumor margins.
Our study compares two main ensemble configurations built from three different deep learning architectures 3D SegResNet (Myronenko 2019), SWIN UNETR (Pan 2024), and DiNTS (He et al. 2021):
1.
A homogeneous ensemble consisting of five 3D SegResNet models.
2.
A heterogeneous ensemble consisting of two 3D SegResNets, two SWIN UNETRs, and one DiNTS model.
The specific preprocessing, implementation, and training hyperparameters for each architecture are detailed in Appendix C.
Evaluation and Statistical Analysis
A
All evaluations were performed on the independent, anonymized test set. The primary outcome for assessing uncertainty was the empirical coverage of the 95% volumetric credible intervals. Secondary outcomes included model robustness, segmentation accuracy (Dice Similarity Coefficient) (Dice 1945), and the clinical relevance of the uncertainty estimates. To validate this clinical relevance, we assessed the model's ability to predict its own failure modes, based on the principle that an ideal uncertainty estimate should be high when the model’s prediction is inaccurate. Accordingly, we investigated the relationship between the model's predicted volumetric uncertainty and three distinct error metrics: Segmentation Error, Geometric Error (ASSD), and Relative Volumetric Error. Based on a preliminary analysis suggesting a power-law relationship, this was assessed using Pearson correlation on log-transformed variables, except for the Relative Volumetric Error, which was analyzed on a linear scale, with a P value of less than .05 considered statistically significant. The detailed mathematical definitions for the error metrics are provided in Appendix D.
Click here to Correct
Appendix A – Loss function definition
The first term,
, is a soft Bayesian Dice loss. It operates directly on the expected prediction
and penalizes high predictive variance
:
This formulation encourages the model to produce high evidence (i.e., low variance) for correct predictions.
The second term,
, is a regularization loss that penalizes overconfident predictions on misclassified voxels. It computes the Kullback–Leibler divergence between the predicted Beta distribution and a uniform prior:
where
and
. Here,
is the one-hot encoded ground truth label, and
denotes element-wise multiplication. The use of
instead of
masks the contribution of the true class to focus the KL penalty on the incorrect one.
To balance the two terms during training, we apply an annealing schedule to
:
This encourages the model to focus initially on learning discriminative features before penalizing overconfidence.
Appendix B – Bayesian Ensemble Approximation
In a full Bayesian framework, predictive uncertainty is derived from the posterior distribution
where
is the posterior over model parameters. As this integral is intractable, standard Evidential Deep Learning (EDL) methods approximate it by assuming a single point estimate for the optimal parameters,
, which is formally equivalent to setting the model posterior to a Dirac delta function,
This simplification, however, completely neglects the epistemic uncertainty associated with the model parameters themselves. Our work addresses this limitation by employing a deep ensemble of
independently trained models
to serve as a Monte Carlo approximation of the intractable integral, thereby re-integrating this critical source of uncertainty:
. Consequently, our combined approach provides a more comprehensive estimate of total epistemic uncertainty than standard single-model EDL, capturing both the uncertainty expressed by individual evidential models and the parameter uncertainty expressed through their disagreement.
Appendix C - Training details
The specific training configurations for each architecture were tailored to their computational requirements while promoting model diversity. SegResNet models were trained on volumes preprocessed to the dataset’s median voxel spacing using a random crop size of 296×296×128 voxels. SWIN UNETR and DiNTS models were trained on volumes resampled to a 1mm3 isotropic voxel spacing, with crop sizes of 160×160×160 and 128×128×128, respectively. These region of interest sizes were selected as the maximum possible for each architecture to avoid out-of-memory errors. At inference, full-resolution segmentations for all models were reconstructed via a sliding window strategy with Gaussian-weighted blending.
A
All models were implemented in the MONAI framework and shared a common training protocol. To manage memory demands, we used a per-GPU batch size of 1 with gradient accumulation over 4 steps, Group Normalization, and mixed-precision (FP16) training. Each model was trained for 1000 epochs using the AdamW (Loshchilov and Hutter 2017) optimizer with a base learning rate of 2×10− 5 and weight decay of 5×10− 4, governed by a linear warm-up and cosine annealing schedule. The MC Dropout models, trained with dropout rates of 0.2 and 0.5, used the same training setup as the EDL variants, except that the EDL-specific loss functions were replaced by a standard Dice + Focal loss and the ReLU activation was substituted with a Softmax applied to the model’s logits.
Training was distributed across two NVIDIA Tesla V100 GPUs per model. The average epoch duration was approximately 12 minutes for SegResNet and SWIN UNETR, and 6 minutes for the computationally lighter DiNTS model. For inference, we retained the checkpoint corresponding to the highest validation Dice score for each model.
Appendix D - Quantitative metrics
The correlation analysis quantified the relationship between the model's predicted volumetric uncertainty and three distinct prediction error metrics. These metrics are defined as follows:
1.
Predicted Volumetric Uncertainty: The relative width of the volumetric credible interval.
2.
Segmentation Error: The complement of the Dice Similarity Coefficient, measuring volumetric overlap error. Let P and G be the predicted and ground-truth voxel sets, respectively.
3.
Geometric Error (ASSD): The Average Symmetric Surface Distance between the predicted (P) and ground-truth (G) segmentations. Let S(⋅) be the set of surface voxels.
4.
5.
Relative Volumetric Error: The absolute difference in predicted volume (
​) and ground-truth volume (
​), normalized by the ground-truth volume.
Appendix E - Quantitative evaluation of the homogeneous EDL ensemble
A
Fig. S4
Relationship between uncertainty width and radiologist-assessed difficulty for the heterogeneous EDL ensemble. Left: normalized credible interval width versus difficulty ratings. Right: Dice score versus difficulty ratings. Statistical interpretation remains limited by the class imbalance and small number of cases in the “Hard” difficulty group.
Click here to Correct
A
Data Availability
The private internal dataset cannot be publicly shared due to UCSF institutional policies and HIPAA privacy laws; however, de-identified portions may be made available by the corresponding author upon reasonable request. Regarding the external cohort (Duke), a subset of the data is publicly available as part of the Brain Tumor Segmentation (BraTS) challenge dataset
References
Baldi, I., J. Engelhardt, C. Bonnet, L. Bauchet, E. Berteaud, A. Grüber, and H. Loiseau. 2018. “Epidemiology of Meningiomas.” Neuro-Chirurgie 64 (1): 5–14.
Dice, Lee R. 1945. “Measures of the Amount of Ecologic Association between Species.” Ecology 26 (3): 297–302.
A
Diversity-Enhanceref: Diversity-Enhanced Probabilistic Ensemble For Uncertainty Estimationd Probabilistic Ensemble For Uncertainty Estimation. n.d.
Ellingson, Benjamin, Grace Kim, Matt Brown, Jihey Lee, Noriko Salamon, Lori Steelman, Islam Hassan, et al. 2021. “Nimg-33. Volumetric Tumor Measurements Are Superior to 2d Bidirectional Measurements in the Evaluation of Idh Inhibition in Diffuse Gliomas: Evidence from a Prospective Phase i Trial of Ivosidenib.” Neuro-Oncology 23 (Supplement_6): vi136–vi136.
Faghani, Shahriar, Mana Moassefi, Pouria Rouzrokh, Bardia Khosravi, Francis I. Baffour, Michael D. Ringler, and Bradley J. Erickson. 2023. “Quantifying Uncertainty in Deep Learning of Radiologic Images.” Radiology 308 (2): e222217.
A
Gal, Yarin, and Zoubin Ghahramani. 2015. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” ArXiv [Stat.ML]. arXiv. http://arxiv.org/abs/1506.02142.
Goldbrunner, Roland, Pantelis Stavrinou, Michael D. Jenkinson, Felix Sahm, Christian Mawrin, Damien C. Weber, Matthias Preusser, et al. 2021. “EANO Guideline on the Diagnosis and Management of Meningiomas.” Neuro-Oncology 23 (11): 1821–34.
He, Yufan, Dong Yang, Holger Roth, Can Zhao, and Daguang Xu. 2021. “DiNTS: Differentiable Neural Network Topology Search for 3D Medical Image Segmentation.” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr46437.2021.00578.
Huang, Raymond Y., Wenya Linda Bi, Michael Weller, Thomas Kaley, Jaishri Blakeley, Ian Dunn, Evanthia Galanis, et al. 2019. “Proposed Response Assessment and Endpoints for Meningioma Clinical Trials: Report from the Response Assessment in Neuro-Oncology Working Group.” Neuro-Oncology 21 (1): 26–36.
A
Isensee, Fabian, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. 2021. “NnU-Net: A Self-Configuring Method for Deep Learning-Based Biomedical Image Segmentation.” Nature Methods 18 (2): 203–11.
Kirchhof, M., Kasneci, G., & Kasneci, E. (2025). Re-examining the foundational assumption of aleatoric and epistemic uncertainty separation. ICLR Blogpost. https://iclr-blogposts.github.io/2025/blog/reexamining-the-aleatoric-and-epistemic-uncertainty-dichotomy/
Kohl, Simon A. A., Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R. Ledsam, Klaus H. Maier-Hein, S. M. Ali Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. 2018. “A Probabilistic U-Net for Segmentation of Ambiguous Images.” ArXiv [Cs.CV]. arXiv. http://arxiv.org/abs/1806.05034.
LaBella, Dominic, Omaditya Khanna, Shan McBurney-Lin, Ryan Mclean, Pierre Nedelec, Arif S. Rashid, Nourel Hoda Tahon, et al. 2024. “A Multi-Institutional Meningioma MRI Dataset for Automated Multi-Sequence Image Segmentation.” Scientific Data 11 (1): 496.
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2016. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” ArXiv [Stat.ML]. arXiv. http://arxiv.org/abs/1612.01474.
Lambert, Benjamin, Florence Forbes, Alan Tucholka, Senan Doyle, Harmonie Dehaene, and Michel Dojat. 2022. “Trustworthy Clinical AI Solutions: A Unified Review of Uncertainty Quantification in Deep Learning Models for Medical Image Analysis.” ArXiv [Eess.IV]. arXiv. http://arxiv.org/abs/2210.03736.
Lemay, Andreanne, Katharina Hoebel, Christopher P. Bridge, Brian Befano, Silvia De Sanjosé, Didem Egemen, Ana Cecilia Rodriguez, Mark Schiffman, John Peter Campbell, and Jayashree Kalpathy-Cramer. 2022. “Improving the Repeatability of Deep Learning Models with Monte Carlo Dropout.” Npj Digital Medicine 5 (1): 174.
Li, Hao, Yang Nan, Javier Del Ser, and Guang Yang. 2023. “Region-Based Evidential Deep Learning to Quantify Uncertainty and Improve Robustness of Brain Tumor Segmentation.” Neural Computing & Applications 35 (30): 22071–85.
Lin, Dong-Dong, Jia-Liang Lin, Xiang-Yang Deng, Wei Li, Dan-Dong Li, Bo Yin, Jian Lin, Nu Zhang, and Han-Song Sheng. 2019. “Trends in Intracranial Meningioma Incidence in the United States, 2004–2015.” Cancer Medicine 8 (14): 6458–67.
A
Loshchilov, Ilya, and Frank Hutter. 2017. “Decoupled Weight Decay Regularization.” ArXiv [Cs.LG]. arXiv. http://arxiv.org/abs/1711.05101.
Myronenko, Andriy. 2019. “3D MRI Brain Tumor Segmentation Using Autoencoder Regularization.” In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, 311–20. Lecture Notes in Computer Science. Cham: Springer International Publishing.
Pan, Jiachen. 2024. “Swin UNet: A Memory-Efficient and Accurate Deep Learning Model for Medical Image Segmentation.” In Third International Conference on Machine Vision, Automatic Identification, and Detection (MVAID 2024), edited by Renchao Jin, 74. SPIE.
Price, Mackenzie, Corey Neff, Nitin Nagarajan, Carol Kruchko, Kristin A. Waite, Gino Cioffi, Brittany B. Cordeiro, et al. 2024. “CBTRUS Statistical Report: American Brain Tumor Association & NCI Neuro-Oncology Branch Adolescent and Young Adult Primary Brain and Other Central Nervous System Tumors Diagnosed in the United States in 2016–2020.” Neuro-Oncology 26 (Supplement_3): iii1–53.
A
Ramakrishnan, D., Sarah C. Brüningk, Fatima Memon, Sandra Abi Fadel, Nazanin Maleki, R. Bahar, A. Avesta, et al. n.d. Comparison of Volumetric and 2D-Based Response Methods in the PNOC-001 Pediatric Low-Grade Glioma Clinical Trial M. von Reppert.
A
Sensoy, Murat, Lance Kaplan, and Melih Kandemir. 2018. “Evidential Deep Learning to Quantify Classification Uncertainty.” ArXiv [Cs.LG]. arXiv. http://arxiv.org/abs/1806.01768.
A
Shen, Maohao, J. Jon Ryu, Soumya Ghosh, Yuheng Bu, Prasanna Sattigeri, Subhro Das, and Gregory W. Wornell. 2024. “Are Uncertainty Quantification Capabilities of Evidential Deep Learning a Mirage?” ArXiv [Cs.LG]. arXiv. http://arxiv.org/abs/2402.06160.
Wen, Patrick Y., Martin van den Bent, Gilbert Youssef, Timothy F. Cloughesy, Benjamin M. Ellingson, Michael Weller, Evanthia Galanis, et al. 2023. “RANO 2.0: Update to the Response Assessment in Neuro-Oncology Criteria for High- and Low-Grade Gliomas in Adults.” Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology 41 (33): 5187–99.
A
Author Contribution
Y.G. contributed to the methodology, pipeline development, and data analysis. A.R. conceived the study. P.N., E.B., J.W., and M.M. performed data acquisition, preprocessing, and validation. E.C. contributed material support (Duke imaging resources). A.R. and P.N. supervised the project execution. Y.G., P.N. and A.R. primarily wrote the manuscript. L.S. and all authors contributed to manuscript review and revision.
A
Funding
This work was supported by the Tianqiao and Chrissy Chen Institute Chen Scholars Program and the Foundation of the American Society of Neuroradiology.
Competing Interests
The authors declare no competing interests.
Ethics Statement
A
This retrospective study was approved by the Institutional Review Board of the University of California, San Francisco (IRB #18-24775).
A
All procedures were conducted in accordance with institutional guidelines and the HIPAA laws.
Abstract
Purpose: To develop and validate a deep learning framework that provides clinically meaningful uncertainty estimates for meningioma segmentation, enabling more trustworthy longitudinal volumetric assessment. Materials and Methods: In this retrospective study, we developed an evidential deep learning (EDL) ensemble framework and trained it on 1,655 post-contrast T1-weighted brain MRIs from 788 patients with meningiomas. We evaluated the clinical utility of an architecturally heterogeneous ensemble on an independent test set of 68 MRIs from 43 patients. We compared its performance to other state-of-the-art segmentation models and uncertainty estimates. The evaluation included: ([1] 1) assessment of Dice score and overall segmentation accuracy, (2) qualitative correspondence of spatial uncertainty maps with neuroradiologist-defined ambiguity, (3) quantitative calibration of volumetric credible intervals using empirical coverage, assessing whether the model’s intervals indeed contained the true volume in 95% of cases and (4) external validation on an independent cohort from another institution to confirm generalizability. Results: High segmentation accuracy was achieved across all ensemble configurations (median Dice ≈ 0.93), with spatial uncertainty maps qualitatively aligning with regions rated as difficult by a neuroradiologist. Out of all models tested, the heterogeneous EDL ensemble produced the most reliable volumetric credible intervals, capturing the true tumor volume in 92.8% of cases. External validation on an independent external cohort of 353 patients with meningioma confirmed high generalizability, achieving a median Dice of 0.92.     Conclusion: Evidential deep learning ensembles provide well-calibrated uncertainty estimates while maintaining high segmentation accuracy. Architectural diversity enhances credible interval calibration, enabling more trustworthy single time-point and longitudinal assessments and supporting safer clinical deployment of automated meningioma segmentation. The methods presented here for meningioma are directly applicable to medical image lesion segmentation more broadly, promising to increase trust and safety in the use of AI in medical imaging.   Key words: Meningioma, Brain MRI, Uncertainty Estimation, Deep Learning, Evidential Learning, Tumor Segmentation, Volume Quantification
Total words in MS: 5982
Total words in Title: 9
Total words in Abstract: 285
Total Keyword count: 7
Total Images in MS: 9
Total Tables in MS: 1
Total Reference count: 26