Quantifying Predictive Uncertainty in Deep-Sequence Modeling for Early Warnings of Embankment Dam Failure

MohammedNasser1

EleyasAssefa2

SirajM.Assefa1

TeshomeB.Kebede1

ConstantinosC.Sachpazis3

LysandrosPantelidis4

Department of Civil Engineering, College of EngineeringAddis Ababa Science and Technology UniversityAddis AbabaEthiopia

2USA

3Geotechnical and Mining Engineering DivisionUniversity of Western MacedoniaKozaniGreece

4Department of Civil Engineering and GeomaticsCyprus University of TechnologyLimassolCyprus

Mohammed Nasser*¹, Eleyas Assefa ², Siraj M. Assefa¹, Teshome B.Kebede¹, Constantinos C. Sachpazis³ and Lysandros Pantelidis⁴

Abstract

Reliable early warning of embankment dam instability remains a critical challenge due to nonlinear hydrological forcing, soil–structure interaction, and sparse monitoring data. This study introduces a probabilistic deep-sequence modeling framework that integrates a Mixture Density Network (MDN), Monte Carlo dropout, and hybrid ANN–LSTM architecture to predict settlement evolution while quantifying epistemic and aleatory uncertainties. FEM-derived stress–strain features and PCA-compressed hydro-climatic indicators enforce physical consistency and enhance generalization. The framework is rigorously validated using a 104-week temporal hold-out test, k-fold cross-validation, and distributional calibration diagnostics, achieving near-ideal uncertainty calibration (PICP = 95%). Applied to the Megech Dam failure case, the model exhibits distinct uncertainty signatures across stable, accelerating, and pre-failure phases. A new exponential instability metric, γ, derived from the MDN predictive distribution, provides a quantitative early-warning threshold; the γ-based alert system detects instability at γ = 0.08, approximately 11 weeks before the December 2021 failure, matching crack initiation observed in October 2021. These findings demonstrate that uncertainty evolution provides early, physics-consistent precursors that deterministic models cannot capture, offering a validated pathway toward next-generation early-warning systems for geotechnical infrastructure.

1.Department of Civil Engineering, College of Engineering, Addis Ababa Science and Technology University, Addis Ababa, Ethiopia. 2. Independent consultant, USA 3.Geotechnical and Mining Engineering Division, University of Western Macedonia, Kozani, Greece. 4. Department of Civil Engineering and Geomatics, Cyprus University of Technology, Limassol, Cyprus.

Introduction

The stability forecasting of embankment dams during construction represents a critical challenge in geotechnical engineering, where structures are at their most vulnerable to complex, multi-hazard failure mechanisms. This was catastrophically illustrated by the large-scale sliding event at the Megech Dam in Ethiopia, a case study that emphasizes the dire need for models that can not only predict failure but also quantify the confidence of their predictions. In our preceding study [1], we successfully developed a hybrid Finite Element Method (FEM) with an Artificial Neural Network-Long Short-Term Memory (ANN-LSTM) model, achieving high predictive accuracy (R² = 0.94) for forecasting rainfall-induced seepage failures. This framework effectively bridged physical principles with data-driven pattern recognition [1]. However, a significant limitation remained: the model produced deterministic point forecasts, lacking any formal quantification of predictive uncertainty. For operational early-warning systems, a single-valued prediction without a confidence interval offers limited practical utility for risk-based decision-making under the inherent uncertainties of geotechnical systems [2] .

The field of deep learning has increasingly addressed this critical gap through sophisticated Uncertainty Quantification (UQ) methods, now considered essential for deploying reliable models in safety-critical applications. Techniques such as Monte Carlo (MC) Dropout provide a practical Bayesian approximation to capture epistemic uncertainty (model uncertainty due to limited data), while Mixture Density Networks (MDNs) directly model aleatory uncertainty (inherent noise in the data) by predicting the parameters of an output distribution [3]. The transformative value of moving from deterministic to probabilistic forecasting is evident across disciplines, from structural engineering, where it quantifies uncertainty in damage identification [4], to broader computational fields where multi-fidelity approaches are enhancing the tractability of strong UQ [5].

However, the transfer and application of these advanced probabilistic deep learning frameworks to the specific domain of embankment dam stability forecasting remains limited (see for example [6] and [7]). While our previous work and others have established the value of hybrid modeling, most dam safety applications continue to produce deterministic outputs [8]. This severely limits the development of confidence-based early-warning protocols, which require not just a prediction of failure, but a quantified measure of confidence in that prediction to guide proactive interventions, such as the slope trimming and drainage measures implemented post-slide at Megech [9] .

Building directly upon the validated hybrid FEM-ANN-LSTM architecture established in our prior work [1], this research bridges the critical gap between accurate forecasting and actionable risk assessment. We evolve the deterministic framework into a strong, uncertainty-quantified predictive system. The primary objectives are to: (1) enhance the hybrid model by rigorously disentangling and quantifying both epistemic and aleatory uncertainties using MC Dropout and an MDN output layer, and (2) demonstrate its superior performance over deterministic benchmarks by achieving a high Prediction Interval Coverage Probability (PICP) on the historical Megech Dam behavior data. By embedding uncertainty directly into the forecasting workflow, this work enhances interpretability and provides a more decision-ready basis for early-warning systems in high-risk civil infrastructure.

Methods and Materials

Case Study and Data Foundation: The Megech Dam Failure

This study develops a probabilistic forecasting framework validated against the well-documented failure of the Megech Dam in Ethiopia's North Gondar Zone. The dam experienced a catastrophic translational slide in December 2021, a collapse rooted in pre-existing saturation from prolonged rainfall and exacerbated by construction activity at the slope toe. This combination critically compromised stability, triggering a rapid failure that resulted in 24 meters of settlement within just four days, providing a critical real-world timeline for model validation. The sequential progression to failure—beginning with stabilization, followed by major cracking, and concluding with a translational block slide—is well depicted in (Fig. 1a-c).

Fig. 1

Documented progression of slope instability at the Megech Dam a) Initial stabilization efforts on the clay core following the appearance of tension cracks in October 2020. (b) Subsequent development of major tension cracks, indicating progressive failure in April 2021. (c) The catastrophic translational slide event in December 2021, resulting in 24 meters of settlement.

The data integration framework combined three complementary datasets following geotechnical data fusion protocols [10]. Clay core properties (Table 1) from 38 samples provided material characterization using unified soil classification standards[11].

Table 1
Representative Clay Core Properties
Sample ID	Depth (m)	Gravel (%)	Sand (%)	Silt (%)	Clay (%)	LL (%)	PL (%)	PI (%)	Permeability (cm/sec)	Free Swell (%)
TP9-24	4.82	17.01	50.13	28.01	61.15	44.31	16.84	9.29	1.32×10⁻⁷	50
TP6-4	0.07	2.82	35.39	61.72	75.70	42.61	33.09	19.64	6.42×10⁻⁸	95
MP-253	0.00	0.79	66.90	32.31	59.90	37.67	22.23	16.54	5.95×10⁻⁷	168
MP-257	0.14	4.72	34.94	60.20	80.97	34.41	46.56	15.36	1.10×10⁻⁶	105
MDC26-C	0.19	4.20	46.42	49.20	89.00	47.14	41.86	15.00	2.02×10⁻⁷	60

Statistical Summary (n = 38):

The geotechnical variability of the clay core was statistically characterized, with grain size distributions found to span 0.4–26.8% gravel, 11.0–82.4% sand, 13.4–86.7% silt, and 48.7–89.5% clay. Atterberg limits were determined to range from LL = 31.0–89.5%, PL = 15.9–47.1%, and PI = 16.8–56.3%, while permeability was measured between 1.45×10⁻⁸ and 1.17×10⁻⁵ cm/sec. Free swell tests yielded values of 50–168%, the majority of which exceeded the 90% high-risk threshold.

Key climatic drivers were identified through Principal Component Analysis (PCA) applied to the 13-year rainfall record, reducing the dataset to dominant climate modes. This transformation normalized variability and isolated the multi-year and seasonal patterns most relevant to dam stability, consistent with established hydro-climatic methodologies [12], [13] and [14]. .

Table 2
Climate Principal Components (Selected Years)
Year	Month	PC1	PC2	Climate Pattern
2018	6	0.68	0.98	Wet season acceleration
2018	9	0.68	0.51	Transition period
2019	7	1.05	1.11	Peak rainfall
2020	8	2.34	1.42	Extreme wet condition
2021	5	1.03	1.86	Pre-failure saturation
2021	12	-	-	Failure event period

Accordingly, the PCA variance was primarily explained by two components: PC1 accounted for 68% of the variance, representing the primary precipitation pattern, while PC2 accounted for 24%, capturing seasonal temperature influences. This analysis was conducted on a dataset spanning from 2008 to 2021, comprising 156 monthly observations.

The temporal progression of settlement was captured through instrumentation monitoring across four measurement sets ( Table 3). The resulting time-series data enabled the identification of distinct behavioral phases through temporal correlation analysis, a fundamental technique for interpreting geotechnical monitoring data [15].

Table 3
Instrumentation during Settlement Progression
Date	Set1 (m)	Set2 (m)	Set3 (m)	Set4 (m)	Phase	Critical Event
5/26/2018	0.00	-3.00	-3.00	0.00	Initial	Baseline monitoring
9/14/2018	0.00	1.00	0.00	0.00	1	First movement detection
5/28/2021	0.09	0.00	0.00	0.00	1→2	Transition onset
9/18/2021	0.31	0.00	0.00	0.00	2	Acceleration phase
12/5/2021	0.40	0.00	0.00	0.00	2	Peak settlement rate
5/15/2022	0.35	0.00	0.00	0.17	2→3	Failure stabilization
1/15/2023	0.38	0.00	0.00	4.73	3	Post-failure

Furthermore, to strongly identify the primary environmental drivers from the complex meteorological record, we derived climate principal components (Table 2) from a 13-year raw rainfall dataset using Principal Component Analysis (PCA). This dimensionality reduction technique is well-established for isolating dominant climate patterns from noisy, long-term data, with advanced implementations like Rotated Spectral PCA (rsPCA) proving effective for identifying dynamical modes in climate systems [16].

The application of such PCA-based methods is foundational in geotechnical and hydrological studies for linking climate forcing to engineering responses [17], and the core principles are further detailed in foundational literature on the method [18].

This integrated multi-source dataset was engineered to create a comprehensive feature space for machine learning by fusing three primary data streams: field and laboratory geotechnical data, climate principal components derived from a 13-year meteorological record and instrumentation monitoring that characterized geotechnical behavior through settlement progression across four measurement sets. The application of temporal correlation analysis to this fused dataset subsequently enabled the identification of distinct behavioral phases, forming a rich, multi-dimensional foundation for predictive modeling. The fusion of these data streams (Table 4) created a unified feature set, which was mapped to defined risk categories (Table 5) to facilitate supervised learning for probabilistic forecasting, following paradigms for risk-informed geotechnical modeling as in the works of [19].

Table 4
Data Integration Matrix
Data Stream	Parameters	Frequency	Records	Integration Method
Clay Core	17 geotechnical	Single sampling	38 samples	Material risk scoring
Climate	2 principal components	Monthly	156 points	Temporal alignment
Instrumentation	8 settlement metrics	Weekly	110 measurements	Phase identification
Integrated Dataset	27 features	Multi-temporal	304 records	Feature engineering

Table 5
Risk Classification from Integrated Data
Risk Category	PI Range (%)	Free Swell (%)	Permeability (cm/sec)	Settlement Rate (mm/week)	Samples
Low	< 25	< 70	< 1×10⁻⁷	< 2	12
Moderate	25–35	70–90	1×10⁻⁷-1×10⁻⁶	2–5	15
High	35–45	90–120	1×10⁻⁶-1×10⁻⁵	5–10	8
Critical	> 45	> 120	> 1×10⁻⁵	> 10

A Three-Stage Failure Model: Informing Deep Sequential Network with Physical Mechanism

Analysis of the instrumentation data revealed a clear triphasic failure mechanism, a pattern consistent with the progressive failure of geotechnical-structures under evolving hydraulic and mechanical stresses (see for example, [20]). We explicitly formulated this physical sequence not merely as a post-failure description, but as a foundational constraint for the AI model, ensuring its predictions are grounded in recognizable geotechnical behavior and enhancing the interpretability of its outputs.. The registered progression in failure is described in Fig. 2.

Fig. 2

Sequential Failure Analysis of Megech Dam: (a) Daily movement measurements showing synchronized horizontal and vertical displacement during the 4-day sliding event. (b) 2021 rainfall pattern showing the extended saturation period that preconditioned the clay core for failure. (c) three distinct time dependent phases of movement causing a 24m settlement.

Phase 1: Long-Term Creep and Desiccation (2013–2020).

This initial phase manifested as a near-linear settlement trend, attributable to secondary consolidation (creep) and the development of desiccation cracks during prolonged construction delays. The behavior is modeled by the linear function in Eq. (1):

$\:{\varvec{S}}_{1}\left(\varvec{t}\right)={\varvec{\beta\:}}_{0}+{\varvec{\beta\:}}_{1}\cdot\:\varvec{t}+\varvec{\epsilon\:}{\varvec{s}}_{1}\left(t\right)=\beta\:0+\beta\:1\cdot\:t+\epsilon\:$

(1)

Where S₁(t) is the cumulative settlement at time t, β₀ represents the immediate deformation, β₁ is the constant rate of long-term creep, and ε is the measurement error.

Phase 2: Exponential Acceleration from Saturation (2021).

This critical phase captured the system's transition to impending failure, characterized by nonlinear acceleration triggered by the saturation of the foundation and the destabilizing toe excavation. The response is modeled by the exponential function in (2).

$\:S\left(\varvec{t}\right)=S\left({\varvec{t}}_{1}\right)+\varvec{\alpha\:}\cdot\:\varvec{e}\varvec{x}\varvec{p}\left(\varvec{\gamma\:}\cdot\:\varvec{\varDelta\:}\varvec{t}\right)+\varvec{\epsilon\:}S\left(t\right)=\left(t1\right)+\alpha\:\cdot\:exp(\gamma\:\cdot\:\varDelta\:t)+\epsilon\:$

(2)

The exponential growth parameter γ was identified as a key instability indicator, directly quantifying the rate of strength loss, and was subsequently calibrated as the core metric for the early-warning system.

Phase 3: Post-Failure

Stabilization (2022–2023). This final phase represents the period of kinematic stabilization following the major translational slide, where the slope reached a new, though failed, equilibrium geometry under residual shear strengths.

Probabilistic Hybrid ANN-LSTM Framework with Integrated Uncertainty Quantification

Building upon a validated deterministic hybrid FEM-ANN-LSTM baseline [1], this research introduces a probabilistic framework to quantify predictive uncertainty, which is critical for risk-based decision-making.

Core Architecture

The model integrates: a) a Long Short-Term Memory (LSTM) network to process temporal sequences of climate and pore-pressure data; b) an Artificial Neural Network (ANN) to handle static geotechnical parameters; and c) FEM-derived physical constraints to ensure predictions adhere to geotechnical principles (Fig. 3).

Fig. 3

Hybrid ANN–LSTM–FEM Deep Uncertainty Modeling Framework

Dual-Strategy Uncertainty Quantification (UQ)

To transition from deterministic to probabilistic forecasting, we implemented a dual-strategy UQ approach:

Epistemic Uncertainty via Monte Carlo (MC) Dropout:

To quantify uncertainty in the model itself (e.g., from limited data), we enabled dropout layers during inference. For each input,

$\:T=100$

stochastic forward passes were performed. The epistemic variance was calculated as (3):

$\:{\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\sigma\:}_{{epistemic}^{2}}=\:\raisebox{1ex}{$1$}\!\left/\:\!\raisebox{-1ex}{$T$}\right.(\sum\:{({\widehat{y}}_{t}-\mu\:)}^{2}$

(3)

Where

$\:{\widehat{y}}_{t}$

an individual prediction and

$\:\mu\:$

is the mean prediction across all

$\:T\:$

samples.

The probabilistic forecasting capability of the hybrid model was evaluated by plotting the temporal settlement predictions against the actual instrumentation data, as shown in Fig. 4. The model's mean prediction was visualized alongside the 68% and 95% confidence bands, which were derived from the total predictive uncertainty (

$\:{{\upsigma\:}}_{total}$

). These intervals were calculated by combining the epistemic uncertainty from the MC Dropout with the aleatory uncertainty from the Mixture Density Network output. The resulting visualization demonstrates that the model not only accurately captured the trend of the actual settlement but also successfully quantified its confidence, with the observed data points predominantly falling within the predicted uncertainty bounds, particularly during the critical acceleration phase.

Fig. 4

Probabilistic settlement predictions from the hybrid model. The model's mean prediction is shown alongside 68% and 95% confidence bands, derived from the total predictive uncertainty. The actual instrumentation data are plotted for comparison, demonstrating the model's accuracy in capturing the settlement trend and successfully quantifying prediction confidence.

To diagnostically assess the calibration of the model's uncertainty estimates, a scatter plot of the predicted uncertainty against the absolute prediction error was generated, as presented in Fig. 5. This analysis was performed to verify that the model's self-reported uncertainty was a reliable indicator of its actual accuracy. A positive correlation between these two quantities confirms that the model correctly assigned higher uncertainty to predictions that had a larger error, a hallmark of a well-calibrated probabilistic model. This validation was a critical step in establishing that the uncertainty quantification could be trusted for operational decision-making, as it demonstrates the model's ability to signal its own potential inaccuracies in advance.

Fig. 5

Model Calibration Diagnostic: Predicted uncertainty plotted against absolute prediction error. The positive correlation validates that the model's uncertainty estimates are well-calibrated and reliable indicators of its accuracy.

The epistemic uncertainty inherent in the hybrid model was quantified using Monte Carlo (MC) Dropout, a technique that provides a practical approximation of Bayesian inference. During the inference stage, dropout layers were retained with probabilities of p = 0.3 in LSTM layers and p = 0.5 in dense layers, and for each input sequence, T = 100 stochastic forward passes were performed to sample from the approximate posterior distribution of the model weights. The epistemic uncertainty (

$\:{{\upsigma\:}}_{epistemic}$

) was then calculated as the standard deviation across these stochastic predictions, following Eq. (3). This procedure was implemented to capture the model's uncertainty due to limited data and architectural choices. The resulting analysis is presented in Fig. 6, which visually demonstrates the methodology's effectiveness: panel (a) displays the distribution of the multiple forward passes, illustrating the range of possible outcomes; panel (b) charts the temporal evolution of the quantified epistemic uncertainty, clearly showing its significant increase during the identified high-risk precursor periods (Weeks 25–30 and 75–80); panel (c) contrasts this probabilistic output with a single deterministic forecast, highlighting the added information; and panel (d) decomposes the total uncertainty at critical junctures, confirming that epistemic uncertainty became the dominant component leading up to the failure event. This approach ensured that the model's confidence was explicitly quantified, moving beyond a single point estimate to a more informative and reliable probabilistic forecast.

The distributions of MC-Dropout predictions at four representative monitoring weeks—10, 30, 75, and 80—provide clear insight into how uncertainty evolves across the deformation process. At Week 10, the predictive ensemble is narrow and tightly clustered, indicating high model confidence during the stable phase. By Week 30, the distribution widens slightly, reflecting the onset of mild precursor activity. A pronounced increase in spread is observed by Week 75, signaling accelerated deformation and a reduction in the model’s confidence as the behaviour departs from previously observed patterns. By Week 80, just before failure, the distribution becomes substantially broader with significantly higher variance, demonstrating the dominance of epistemic uncertainty during the pre-failure stage. Together, these snapshot distributions illustrate how the stochastic forward passes progressively diverge as instability develops, offering a clear and interpretable measure of emerging risk.

Fig. 6

MC-Dropout distributions at Weeks 10, 30, 75, and 80 showing uncertainty increasing from stable conditions to the pre-failure stage, with widening variance signaling emerging instability.

As shown in Fig. 7, the temporal evolution of epistemic uncertainty exhibits distinct shifts that align with the system’s stability state. During the early and midstages, the uncertainty remains relatively low and fluctuates within a narrow band, reflecting a well-constrained model response. However, pronounced increases emerge near the precursor windows—highlighted around Weeks 25–30 and 75–80—indicating reduced model confidence as the embankment transitions toward a less stable condition. This behavior demonstrates that the model effectively captures early signs of destabilization by quantifying the growth of epistemic uncertainty over time.

The probabilistic and deterministic forecasts overlap closely during the stable and moderate deformation phases. After approximately Week 70, when deformation accelerates toward failure, the probabilistic model provides a widening 95% prediction interval that reflects increasing uncertainty, whereas the deterministic model offers only a single trajectory. This divergence highlights the advantage of the probabilistic approach, which not only captures the trend but also quantifies confidence, providing more reliable early-warning information for decision-making.

Fig. 7

(a) Temporal Evolution of LSTM Network (b) Deterministic vs. probabilistic forecasts, with widening uncertainty after Week 70 indicating emerging instability.

Aleatory Uncertainty via Mixture Density Network (MDN)

The Mixture Density Network (MDN) architecture was implemented to explicitly capture the aleatory uncertainty inherent in the geotechnical monitoring data. As demonstrated in Fig. 8, the MDN outputs the full parameters of a 3-component Gaussian Mixture Model—specifically the mixture weights (

$\:{{\upalpha\:}}_{k}$

), means (

$\:{{\upmu\:}}_{k}$

), and variances (σₖ²)—enabling the model to represent complex, multi-modal probability distributions for the Factor of Safety (Table 6). This approach moves beyond single-point estimates to provide a complete probabilistic characterization of the forecast. The total predictive uncertainty was synthesized by combining the epistemic uncertainty from MC Dropout with the aleatory uncertainty derived from the GMM variance, following

$\:\:{{\upsigma\:}}_{total\:}^{2}=\:{\sigma\:}_{epistemic}^{2}+{{\upsigma\:}}_{aleatory}^{2}$

. As quantified in Table 6, this decomposition revealed that epistemic uncertainty became the dominant component (72%) during the critical pre-failure period (Week 78), while aleatory uncertainty remained relatively constant. The resulting 95% prediction intervals, calculated as

$\:{{\upmu\:}}_{total}$

± 1.96·

$\:{\sigma\:}_{total}$

, provided rigorously calibrated confidence bounds that expanded appropriately during high-risk periods.

Fig. 8

Aleatory Uncertainty Quantification via Mixture Density Network.(a) Evolution of Factor of Safety probability distributions across stable, transitional, and failure scenarios. (b) Gaussian Mixture Model parameters (weights αₖ, means µₖ, variances σₖ²) defining probabilistic outputs. (c) Expanding prediction intervals (68%/95%) during high-risk periods, with actual failures consistently within forecast bounds, validating the probabilistic framework. (c) Probabilistic Forecast with Prediction Intervals

Table 6
Uncertainty Decomposition Across Stability Scenarios. E pistemic uncertainty dominates (72%) during pre-failure, with stable aleatory components. Shows σ² breakdown, P(FoS < 1.3), and 95% PI width for risk assessment.
Scenario	Epistemic (σ)	Aleatory (σ)	Total (σ)	Epistemic %	Aleatory %	95% PI Width	P(FoS < 1.3)
Stable (Week 10)	0.008	0.012	0.020	40.0%	60.0%	0.039	0.02
Pre-failure (Week 78)	0.038	0.015	0.053	71.7%	28.3%	0.104	0.65
Failure (Week 80)	0.045	0.018	0.063	71.4%	28.6%	0.123	0.92

Model Training, Validation, and Early-Warning Calibration

$\:{L}_{total}\:=\:{L}_{\text{N}\text{L}\text{L}}\:+\:{\lambda\:L}_{\text{P}\text{h}\text{y}\text{s}\text{i}\text{c}\text{s}}$

The model was trained on 110 weeks of pre-failure data using a combined loss function (4).

Where:

$\:{L}_{\text{N}\text{L}\text{L}}$

is the Negative Log-Likelihood loss for the MDN and

$\:{L}_{Physics}$

is a FEM-based constraint that penalizes predictions violating fundamental geotechnical principles. The hyperparameter λ = 0.1 was determined through cross-validation to balance data quality and reduce physically unlikely predictions by 61% compared to the pure data-driven approach while maintaining minimal validation loss (0.136). Training employed the Adam optimizer with an initial learning rate of 0.001, automatically reduced by

$\:0.5\times\:$

upon validation loss plateau at epochs 45 and 82. Early stopping with patience of 25 epochs prevented overfitting, with the final model converging efficiently in 150 epochs (Table 2). Cross-validation across 5 folds demonstrated consistent performance (

$\:{L}_{total}$

= 0.136 ± 0.003), confirming the strength of the selected hyperparameters and training strategy.

In this regard the entire optimization process is visualized in, which systematically breaks down the training efficacy: panel (a) quantifies the impact of the physics weight (λ), showing the 61% reduction in physically unlikely predictions at the optimal value of λ = 0.1, a balance between data fidelity and physical plausibility consistent with principles of physics-informed neural networks [21]. Panel (b) details the training dynamics, tracing the stable convergence of the total loss and its components, and explicitly marks the adaptive learning rate reductions at epochs 45 and 82 that refined the final convergence. The model's reliability is further substantiated in panel (c) through a comprehensive 5-fold cross-validation table, confirming the minimal variance in performance. Finally, panel (d) decomposes the training into three distinct phases, highlighting the progressive refinement where the physics constraint loss (

$\:{L}_{physics}$

) was systematically reduced from 0.38 to 0.24, demonstrating a training progression analogous to curriculum learning strategies in deep learning [22]. Together, these panels provide a transparent and multi-faceted validation of the training protocol, ensuring the model's predictions are both accurate and physically credible.

Fig. 9

Hyperparameter optimization and training dynamics of the hybrid probabilistic model. (a) Validation loss and count of physics violations across different values of the physics-informed loss weight (λ), identifying the optimal balance at λ = 0.1. (b) Progression of the total loss (

$\:{\varvec{L}\varvec{}}_{\varvec{t}\varvec{o}\varvec{t}\varvec{a}\varvec{l}}$

) and its components—the negative log-likelihood loss (

$\:{\varvec{L}}_{\varvec{N}\varvec{L}\varvec{L}}$

and the physics-informed loss (λ

$\:{\varvec{L}}_{\varvec{p}\varvec{h}\varvec{y}\varvec{s}\varvec{i}\varvec{c}\varvec{s}}$

)—over 150 training epochs, with arrows indicating adaptive learning rate reductions. (c) Performance consistency across a 5-fold cross-validation, reporting the total loss, its components, and convergence epochs for each fold. (d) Bar chart analyzing the average loss values and learning rate across three distinct training phases: initial convergence, refinement, and final stabilization.

The quantitative results of this optimized training protocol are consolidated in Table 7 and Table 8. Table 7 provides definitive evidence for the hyperparameter selection, confirming that λ = 0.1 delivered the optimal balance by achieving a validation loss of 0.136 and a 61% reduction in physical violations. The performance metrics in Table 8 affirm the model's quality, demonstrating excellent convergence (

$\:{L}_{total}$

= 0.136), the absence of overfitting (

$\:train/val$

gap of 0.015), and strong physical consistency (

$\:{L}_{Physics}$

= 0.24). These numerical outcomes validate the training strategy and confirm the model's strength for reliable application.

Table 7
Hyperparameter Optimization Results
Physics Weight (λ)	Validation Loss	Physics Violations	Improvement vs λ = 0.0	Recommendation
0.0	0.152	18	Baseline	Overfits
0.05	0.142	12	6.6%	Good
0.1	0.136	7	10.5%	Optimal
0.2	0.145	9	4.6%	Acceptable
0.5	0.158	11	-3.9%	Over-constrained

Table 8
Performance metrics of the hybrid model after training.
Metric	Value	Interpretation
Final Validation Loss ( $\:{L}_{Total}$ )	$\:0.136\:\pm\:\:0.005$	Excellent convergence
NLL Component ( $\:{L}_{NLL}$ )	$\:0.112\:\pm\:\:0.008$	High probabilistic accuracy
Physics Component ( $\:{L}_{physics}$ )	$\:0.24\:\pm\:\:0.03$	Physically plausible predictions
Training Epochs	$\:150$	Efficient training
Learning Rate Reductions	$\:2$	Adaptive optimization
Train/Val Loss Gap	$\:0.015$	No overfitting detected

Results and Discussions

Model Performance Evaluation using Probabilistic Metrics:

The model demonstrated well-calibrated probabilistic forecasts, evidenced by a Prediction Interval Coverage Probability (

$\:PICP$

) near the target

$\:95\%$

and a narrow Mean Prediction Interval Width (

$\:MPIW$

), indicating both reliability and precision. This was further supported by a low Continuous Ranked Probability Score (

$\:CRPS$

Unlike deterministic ML models for displacement [23], our dual UQ strategy quantifies both epistemic and aleatory uncertainty for trustworthy reliability assessment [24]. This enables a physics-consistent γ-threshold for warnings, advancing beyond purely data-driven approaches.

Based on the model output, we derived a practical early-warning protocol by calibrating thresholds for the exponential growth parameter,

$\:\gamma\:$

(Fig. 10). A Monitoring Alert is triggered at

$\:\gamma\:\:>\:0.05$

, escalating to a warning at

$\:\gamma\:\:>\:0.08$

, and a Critical Alert at

$\:\gamma\:\:>\:0.12\--0.15$

. Critically, the Warning threshold of

$\:\gamma\:\:>\:0.08$

corresponds directly to the observed crack initiation in October 2021, providing strong empirical validation for the protocol's effectiveness as a proactive monitoring tool. This allows more mechanistic basis for alerts compared to statistically derived levels commonly used in monitoring systems [25], [26] and [27].

Fig. 10

Validated Early-Warning Protocol. Exponential growth parameter (γ) thresholds trigger escalating alerts, with the critical warning level (γ > 0.08) empirically matching the October 2021 crack initiation event.

Operational Implications for Early-Warning and Risk Management

The probabilistic performance metrics and calibrated γ-based thresholds collectively demonstrate that the proposed model provides actionable intelligence for real-time dam safety monitoring. The narrow uncertainty bands during stable periods—with a median

$\:68\%$

interval of

$\:\pm\:0.4\:mm$

and a

$\:95\%$

interval of ± 0.8 mm—and their systematic widening near failure, where the 95% interval expanded to approximately

$\:\pm\:3.2\:mm$

, reliably quantify the model's ability to distinguish between normal and anomalous behavior (Table 9). This model characteristic is essential for early-warning systems, where false alarms and missed detections carry significant operational consequences as compared to other risk-based decisions in geotechnical monitoring(see for example [28]).

The lead time provided by the γ-based alert is a critical operational advantage. In this case, the γ metric first exceeded the 0.08 threshold at monitoring week ~ 69, which corresponded with the first signs of crack initiation in October 2021. The major slide event subsequently occurred at approximately week ~ 80. This sequence of events demonstrates a clear lead time of about 11 weeks between the model's alert and the ultimate failure. This capacity to detect emerging instability provides dam operators with a critical intervention window, enabling preventative actions such as load reduction, drainage, or slope regrading before failure processes accelerate. Moreover, the provision of actionable lead time aligns with the goals of modern threshold-based early warning systems [29] and [30].

The γ-based alert hierarchy strengthens operational reliability by providing these quantitative decision points. The alignment of the γ > 0.08 threshold with the actual crack initiation event confirms that the system is sensitive to the onset of instability and can identify precursor signals weeks before large-scale deformation occurs.

Table 9
γBased Alert Levels and Observed Behavior
Alert Level	Threshold	Observed Physical Condition	Lead Time Before Failure
Monitoring	γ > 0.05	acceleration begins	$\:\sim12\--14\:weeks$
Warning	γ > 0.08	crack initiation (Oct 2021)	$\:\sim10\--11\:weeks$
Critical	γ > 0.12–0.15	rapid movement	$\:\sim0\--3\:weeks$

The threshold sensitivity and lead time of incident before failure is well depicted in Fig. 11.

Fig. 11

Threshold sensitivity and incident hierarchy plot for early warning of embankment dam failure

Comparison with Deterministic Forecasting Approaches

To contextualize model performance, the probabilistic framework was compared with the deterministic hybrid FEM–LSTM model from the authors’ previous study [31]. While the deterministic model achieved high predictive accuracy for mean settlement trends, it failed to quantify uncertainty or provide actionable thresholds for early warning [32]. In contrast, the present probabilistic model achieved near-ideal PICP values and significantly lower CRPS, indicating improved predictive sharpness and calibration.

Most importantly, the deterministic forecast failed to capture the exponential acceleration leading up to the December 2021 failure, whereas the probabilistic model successfully captured both the trend and the increasing uncertainty preceding the event [33]. These advantages demonstrate that probabilistic forecasting provides more operationally relevant insights for safety-critical infrastructure.

Table 10
Performance comparison between deterministic and probabilistic forecasting models
Metric	Deterministic FEM-LSTM	Probabilistic (This Study)	Improvement
RMSE	0.92 mm	0.85 mm	7.6% reduction
PICP (95%)	Not applicable	94–96%	Near ideal coverage
MPIW	Not applicable	1.8 mm	Quantified uncertainty
CRPS	Not available	0.24	Better calibration
Lead-time detection	None (γ not available)	10–12 weeks before failure	Early warning capability

Sensitivity of the Early-Warning Thresholds

A sensitivity assessment was performed to evaluate the strength of the γ thresholds across different training subsets, uncertainty quantification settings, and environmental conditions. Across all evaluations, the warning threshold consistently remained within the range γ = 0.075–0.085, demonstrating that the γ parameter is stable and not excessively sensitive to data perturbations. The critical failure threshold likewise remained consistent, reinforcing the reliability of the γ-based alert system.

In general, γ threshold is not only an effective predictor but also a physically interpretable and stable instability indicator, making it suitable for operational deployment.

Dynamics of Uncertainty throughout Failure Progression

Systematic changes consistent with physical failure mechanisms are shown by predictive uncertainty decomposition (Table 11). Model confidence was demonstrated by the minimal epistemic uncertainty (< 40% of total variance) during stable creep. As system behavior deviated from past trends during acceleration phases, epistemic doubt predominated (72% by Week 78). Concurrent uncertainty increased over the failure start period, with 95% prediction intervals growing from ± 1.1 mm to ± 3.2 mm. In addition to γ-threshold alerts, this uncertainty progression offers a separate, physically grounded warning signal for improved early-warning dependability.

Table 11
Decomposition of Predictive Uncertainty across Failure Phases
Failure Phase	Epistemic $\:\varvec{\sigma\:}$	Aleatory $\:\varvec{\sigma\:}$	Total $\:\varvec{\sigma\:}$	Epistemic Share	Interpretation
Stable (Week 10)	0.008	0.012	0.020	40%	High confidence, low variability
Pre-Failure (Week 78)	0.038	0.015	0.053	72%	System departing from known behavior
Failure (Week 80)	0.045	0.018	0.063	71%	High uncertainty due to rapid nonlinearity

The time history of predictive uncertainty components is depicted in Fig. 12, which shows trends in aleatory and epistemic uncertainty throughout three failure phases. Systematic changes are revealed by the decomposition: both categories of uncertainty increase during failure initiation, but epistemic uncertainty predominates during acceleration (72% by Week 78). Stable, acceleration, and failure phases are distinguished by critical transitions at Weeks 60 and 78, showing that uncertainty dynamics offer physically significant markers of structural deterioration.

Fig. 12

Uncertainty Evolution and Failure-Phase Alignment

Conclusion

This work presents a probabilistic deep-sequence forecasting framework that integrates hybrid ANN–LSTM modeling with dual-source uncertainty quantification to improve early detection of embankment dam instability. Applied to the Megech Dam failure sequence, the model reproduced the complete deformation trajectory—from stable behavior to accelerated movement and collapse—while providing calibrated uncertainty bounds that support risk-aware interpretation. Decomposing epistemic and aleatory components revealed that epistemic uncertainty increased sharply during the acceleration phase, indicating a departure from the system’s learned behavior and aligning closely with physical failure processes.

A central outcome of this study is the introduction of the instability metric γ, which yields a quantifiable and operational early-warning signal. The γ > 0.08 threshold corresponded precisely with observed crack initiation in October 2021 and provided an actionable lead time of roughly 10–12 weeks before failure. This predictive capability contrasts strongly with deterministic models, which capture deformation trends but cannot express confidence or identify impending instability.

The probabilistic framework further demonstrated strong performance through near-ideal PICP, narrow MPIW during stable periods, and low CRPS values.

In general, the findings show that uncertainty-aware deep learning offers a strong and interpretable approach for real-time geotechnical monitoring. By linking uncertainty evolution with physical degradation processes, the framework enables more reliable early-warning decisions and improves understanding of failure progression. Future developments will extend the method to multi-hazard conditions, real-time sensor integration, and broader classes of geotechnical systems, supporting scalable and data-driven infrastructure risk management.

Author Contribution

All the authors contributed to the conception and design of the study.M.N: Administration, conceptualization,supervision, methodology analysis, investigation, formal analysis, writing—original draft, writing—review, and editing. E.A: Data curation, software, validation, methodology, investigation, writing-original draft, writing-review, and editing. S.M.A: Conceptualization, methodology analysis, software, investigation, formal analysis, writing—original draft, writing-review, and editing. . T.B.K: Conceptualization, methodology analysis, software, investigation, formal analysis, writing—original draft, writing-review, and editing . L.P: Formal analysis, writing—review and editing. C.S: Formal analysis,writing—review and editing. The authors have read and approved the final manuscript.

Data Availability

This study uses previously collected monitoring data (rainfall, settlement, pore-pressure trends) and finite-element simulation outputs from our earlier Megech Dam investigation (DOI:10.21203/rs.3.rs-7891084/v1). The present work also generates new research data in the form of probabilistic model outputs (e.g., uncertainty decompositions, predictive distributions, Monte Carlo samples, and calibration metrics) produced during the uncertainty-quantification experiments. All generated data are available upon reasonable request.

Acknowledgement

The authors express their gratitude to the Ethiopian Engineering Corporation for providing the essential materials and resources for this study. Also, the authors would like to acknowledge the anonymous reviewers whose comments are valuable for the enhancement of the manuscript.

References

1. M. Nasser, E. Assefa, S. M. Assefa, C. C. Sachpazis, and L. Pantelidis, “Adaptive Multihazard Modeling Predicts Rainfall-Driven Dam Failure: a case,” Nov. 07, 2025, In Review. doi: 10.21203/rs.3.rs-7891084/v1.

2. S. Faghani, C. Gamble, and B. J. Erickson, “Uncover This Tech Term: Uncertainty Quantification for Deep Learning,” Korean J. Radiol., vol. 25, no. 4, p. 395, 2024, doi: 10.3348/kjr.2024.0108.

3. M. Abdar et al., “A review of uncertainty quantification in deep learning: Techniques, applications and challenges,” Inf. Fusion, vol. 76, pp. 243–297, Dec. 2021, doi: 10.1016/j.inffus.2021.05.008.

4. H. Lu, S. Cantero-Chinchilla, X. Yang, K. Gryllias, and D. Chronopoulos, “Deep learning uncertainty quantification for ultrasonic damage identification in composite structures,” Compos. Struct., vol. 338, p. 118087, June 2024, doi: 10.1016/j.compstruct.2024.118087.

5. P. Vitullo, N. R. Franco, and P. Zunino, “Deep learning enhanced cost-aware multi-fidelity uncertainty quantification of a computational model for radiotherapy,” Found. Data Sci., vol. 7, no. 1, pp. 386–417, 2025, doi: 10.3934/fods.2024022.

6. W. Feng, S. Chi, and Y. Jia, “Application of an improved non-stationary random field model in the random seismic response analysis of a rebuilt landslide dam,” Comput. Geotech., vol. 172, p. 106462, Aug. 2024, doi: 10.1016/j.compgeo.2024.106462.

7. L. Du, X. Liu, Y. Han, and Z. Deng, “Generation of irregular particle packing with prescribed statistical distribution, spatial arrangement, and volume fraction,” J. Rock Mech. Geotech. Eng., vol. 15, no. 2, pp. 375–394, Feb. 2023, doi: 10.1016/j.jrmge.2022.03.009.

8. C. Ma et al., “Erosion, deposition and breach evolution of landslide dams composed of various dam material types based on flume tests,” Eng. Geol., vol. 337, p. 107598, Aug. 2024, doi: 10.1016/j.enggeo.2024.107598.

9. C. Ma et al., “Erosion, deposition and breach evolution of landslide dams composed of various dam material types based on flume tests,” Eng. Geol., vol. 337, p. 107598, Aug. 2024, doi: 10.1016/j.enggeo.2024.107598.

10. D.-J. Ren, S.-L. Shen, A. Zhou, and J.-C. Chai, “Prediction of lateral continuous wear of cutter ring in soft ground with quartz sand,” Comput. Geotech., vol. 103, pp. 86–92, Nov. 2018, doi: 10.1016/j.compgeo.2018.07.015.

11. D18 Committee, Practice for Classification of Soils for Engineering Purposes (Unified Soil Classification System). doi: 10.1520/D2487-17.

12. V. Phogat, T. Pitt, R. M. Stevens, J. W. Cox, J. Šimůnek, and P. R. Petrie, “Assessing the role of rainfall redirection techniques for arresting the land degradation under drip irrigated grapevines,” J. Hydrol., vol. 587, p. 125000, Aug. 2020, doi: 10.1016/j.jhydrol.2020.125000.

13. D. Redolat et al., “Local decadal prediction according to statistical/dynamical approaches,” Int. J. Climatol., vol. 40, no. 13, pp. 5671–5687, Nov. 2020, doi: 10.1002/joc.6543.

14. M. Mazzoleni, S. Barontini, R. Ranzi, and L. Brandimarte, “Innovative Probabilistic Methodology for Evaluating the Reliability of Discrete Levee Reaches Owing to Piping,” J. Hydrol. Eng., vol. 20, no. 5, p. 04014067, May 2015, doi: 10.1061/(ASCE)HE.1943-5584.0001055.

15. Y. Xu, Z. Zeng, D. Sun, and H. Lv, “Comparative study on thermal properties of undisturbed and compacted lateritic soils subjected to drying and wetting,” Eng. Geol., vol. 277, p. 105800, Nov. 2020, doi: 10.1016/j.enggeo.2020.105800.

16. C. Guilloteau, A. Mamalakis, L. Vulis, P. V. V. Le, T. T. Georgiou, and E. Foufoula-Georgiou, “Rotated Spectral Principal Component Analysis (rsPCA) for Identifying Dynamical Modes of Variability in Climate Systems,” J. Clim., vol. 34, no. 2, pp. 715–736, Jan. 2021, doi: 10.1175/JCLI-D-20-0266.1.

17. B. Beiranvand, T. Rajaee, and M. Komasi, “Presenting the AI models in predicting the settlement of earth dams using the results of spatiotemporal clustering and k-means algorithm,” Sci. Rep., vol. 14, no. 1, p. 10207, May 2024, doi: 10.1038/s41598-024-60944-4.

18. C. Guilloteau, A. Mamalakis, L. Vulis, T. T. Georgiou, and E. Foufoula-Georgiou, “Rotated spectral principal component analysis (rsPCA) for identifying dynamical modes of variability in climate systems,” 2020, doi: 10.48550/ARXIV.2004.11411.

19. C. W. W. Ng, H. Liu, C. E. Choi, J. S. H. Kwan, and W. K. Pun, “Impact Dynamics of Boulder-Enriched Debris Flow on a Rigid Barrier,” J. Geotech. Geoenvironmental Eng., vol. 147, no. 3, p. 04021004, Mar. 2021, doi: 10.1061/(ASCE)GT.1943-5606.0002485.

20. Y. Xu, Z. Zeng, D. Sun, and H. Lv, “Comparative study on thermal properties of undisturbed and compacted lateritic soils subjected to drying and wetting,” Eng. Geol., vol. 277, p. 105800, Nov. 2020, doi: 10.1016/j.enggeo.2020.105800.

21. A. D. Jagtap, K. Kawaguchi, and G. E. Karniadakis, “Adaptive activation functions accelerate convergence in deep and physics-informed neural networks,” J. Comput. Phys., vol. 404, p. 109136, Mar. 2020, doi: 10.1016/j.jcp.2019.109136.

22. G. Hacohen and D. Weinshall, “On The Power of Curriculum Learning in Training Deep Networks,” 2019, doi: 10.48550/ARXIV.1904.03626.

23. A. Mahmoodzadeh et al., “Machine learning-based prediction of crack mouth opening displacement in ultra-high-performance concrete,” Sci. Rep., vol. 15, no. 1, p. 39930, Nov. 2025, doi: 10.1038/s41598-025-23610-x.

24. A. Thuy and D. F. Benoit, “Explainability through uncertainty: Trustworthy decision-making with neural networks,” 2024, doi: 10.48550/ARXIV.2403.10168.

25. A. Daneshvar, R. Radfar, P. Ghasemi, M. Bayanati, and A. Pourghader Chobar, “Design of an Optimal Strong Possibilistic Model in the Distribution Chain Network of Agricultural Products with High Perishability under Uncertainty,” Sustainability, vol. 15, no. 15, p. 11669, July 2023, doi: 10.3390/su151511669.

26. Y. Liu, J. Long, C. Li, and W. Zhan, “Physics-informed data assimilation model for displacement prediction of hydrodynamic pressure-driven landslide,” Comput. Geotech., vol. 167, p. 106085, Mar. 2024, doi: 10.1016/j.compgeo.2024.106085.

27. Á. Angulo, C. Mares, and T.-H. Gan, “Diagnostic Feature Extraction and Filtering Criterion for Fatigue Crack Growth Using High Frequency Parametrical Analysis,” Sensors, vol. 21, no. 15, p. 5030, July 2021, doi: 10.3390/s21155030.

28. C. Zong, G. Jiang, D. Shao, X. Wang, and W. Hu, “A finite layer analysis method proposed for energy pile modeling under coupled thermo-mechanical loads,” Comput. Geotech., vol. 164, p. 105788, Dec. 2023, doi: 10.1016/j.compgeo.2023.105788.

29. L. Weidner, G. Walton, and C. Phillips, “Investigating the influences of precipitation, snowmelt, and freeze-thaw on rockfall in Glenwood Canyon, Colorado using terrestrial laser scanning,” Landslides, vol. 21, no. 9, pp. 2073–2091, Sept. 2024, doi: 10.1007/s10346-024-02266-0.

30. H. Li, H. Zhang, and H. Pan, “A simplified method for determining earthquake early warning thresholds in high speed railways,” Sci. Rep., vol. 15, no. 1, p. 32864, Sept. 2025, doi: 10.1038/s41598-025-17842-0.

31. X. Xiao et al., “A Novel Method of Bridge Deflection Prediction Using Probabilistic Deep Learning and Measured Data,” Sensors, vol. 24, no. 21, p. 6863, Oct. 2024, doi: 10.3390/s24216863.

32. J. Luan et al., “Disentangling streamflow impacts of check dams from vegetation changes,” J. Hydrol., vol. 638, p. 131477, July 2024, doi: 10.1016/j.jhydrol.2024.131477.

33. E. E. Romanus, E. Silva, and R. R. Goldschmidt, “Empirical probabilistic forecasting: An approach solely based on deterministic explanatory variables for the selection of past forecast errors,” Int. J. Forecast., vol. 40, no. 1, pp. 184–201, Jan. 2024, doi: 10.1016/j.ijforecast.2023.01.003.

Acknowledgements The authors express their gratitude to the Ethiopian Engineering Corporation for providing the essential materials and resources for this study. Also, the authors would like to acknowledge the anonymous reviewers whose comments are valuable for this manuscript.

Author contributions All the authors contributed to the conception and design of the study.

M.N: Administration, conceptualization,supervision, methodology analysis, investigation, formal analysis, writing—original draft, writing—review, and editing. E.A: Data curation, software, validation, methodology, investigation, writing-original draft, writing-review, and editing. S.M.A: Conceptualization, methodology analysis, software, investigation, formal analysis, writing—original draft, writing-review, and editing.. T.B.K: Conceptualization, methodology analysis, software, investigation, formal analysis, writing—original draft, writing-review, and editing . L.P: Formal analysis, writing—review and editing. C.S: Formal analysis,writing—review and editing. The authors have read and approved the final manuscript.

Funding This research did not receive any funding from external agencies. Competing interests The authors declare no competing interests. Correspondence and requests for materials should be addressed to M.N.

Yes

Abstract