A
Projecting Future TOC in Data-Scarce Agricultural Reservoirs: A WGAN-Enhanced Framework Revealing the Importance of Dynamic Variability
A
ChristianJosephSiose1
JeonghoHan2
Kyung-SookChoi3
ByoungHanChoi4
KyoungJaeLim1
1Department of Regional Infrastructure EngineeringKangwon National University24341ChuncheonRepublic of Korea
2Agriculture and Life Sciences Research InstituteKangwon National University24341ChuncheonRepublic of Korea
3Department of Agricultural Civil EngineeringKyungpook National University37224DaeguRepublic of Korea
4Rural Research InstituteKorea Rural Community Corporation15634AnsanRepublic of Korea
Christian Joseph Siose a, Jeongho Han b*, Kyung-Sook Choi c, Byoung Han Choi d, Kyoung Jae Lim a
a Department of Regional Infrastructure Engineering, Kangwon National University, Chuncheon 24341, Republic of Korea
b Agriculture and Life Sciences Research Institute, Kangwon National University, Chuncheon 24341, Republic of Korea
c Department of Agricultural Civil Engineering, Kyungpook National University, Daegu 37224, Republic of Korea
d Rural Research Institute, Korea Rural Community Corporation, Ansan 15634, Republic of Korea
Abstract
Climate change poses a significant threat to water quality in agricultural reservoirs. However, projecting future changes using machine learning (ML) is often hampered by the limited availability of long-term observational data. This study develops a robust framework to project future Total Organic Carbon (TOC) concentrations by combining a Wasserstein Generative Adversarial Network (WGAN) for data augmentation with high-performance ML models. We applied this framework to two contrasting South Korean reservoirs (one pristine, one polluted). Historical data (n = 60) for each was augmented using WGAN, and a suite of nine ML models was optimized and evaluated. The best-performing model, M5 Model Tree with Stochastic Gradient Boosting (M5-SGB), was then used to project TOC from 2025–2100 using a 12 Global Climate Model (GCM) ensemble from CMIP6 under SSP2-4.5 and SSP5-8.5 scenarios. Notably, the effectiveness of WGAN augmentation was model-dependent, significantly boosting the performance of certain algorithms while having a negligible effect on others. Future projections revealed a statistically significant increasing TOC trend for both reservoirs. However, the response was site-specific: the pristine but temporally unstable reservoir showed vulnerability even under the moderate-emissions scenario, while the polluted but stable reservoir only exhibited a significant trend under the high-emissions scenario. This study provides a practical framework for water quality prediction in data-limited contexts. The findings demonstrate that a reservoir's vulnerability to climate change is critically linked to its dynamic variability, not just its static water quality grade. This offers crucial insights for designing more effective, risk-based water management and monitoring strategies.
Keywords
Reservoir water quality
Total organic carbon
Climate change
Machine learning model
WGAN
Data augmentation
A
A
1 Introduction
Climate change is one of the most critical global challenges, exerting profound and far-reaching consequences on both natural and human systems. As global temperatures continue to rise, precipitation patterns shift, and extreme weather events become more frequent, the impact on freshwater resources is becoming increasingly evident (Delpla et al. 2009). These disruptions not only threaten the water availability and accessibility but also degrade water quality, posing significant risks to ecosystems, public health, and socio-economic development (Tranvik et al. 2009). Consequently, there is a growing urgency to understand the intricate relationship between climate change and water quality and to develop effective mitigation strategies.
Reservoirs serve as crucial infrastructures for water storage and distribution, providing resources for domestic, agricultural, and industrial purposes, as well as supporting flood control and ecosystem services (Jeppesen et al. 2014). However, climate change exerts considerable stress on reservoir, affecting both inflow patterns and water quality dynamics (Whitehead et al. 2009). These pressures highlight the urgent need for innovative management strategies and reliable predictive frameworks to anticipate water quality variations under climate change scenarios.
Traditionally, water quality has been assessed using parameters such as biochemical oxygen demand (BOD) and chemical oxygen demand (COD). Although these parameters are widely used, they can be influenced by specific organic or chemical conditions in aquatic environments. In contrast, total organic carbon (TOC), which represents the total amount of carbon bound in organic compounds dissolved or suspended in water, is increasingly recognized as a more comprehensive indicator of water quality (Aguilar-Torrejón et al. 2023). TOC quantifies the overall organic carbon content without depending on particular oxidation pathways. Thus, TOC provide a more stable and comprehensive measure of organic pollution that can better capture various forms of organic matter in a single parameter (Oh et al. 2024).
Previous studies have also shown that higher summer temperatures and greater seasonal temperature variability are associated with higher TOC concentrations in lake sediments, whereas reduced seasonality is associated with lower TOC (Zhou et al. 2025). Similarly, in catchments, increases in rainfall intensity and duration drive higher TOC concentrations and export in runoff, with heavy rainfall events generating TOC fluxes several times greater than during lower-intensity storms (Delpla et al. 2011). Given these findings, many process-based or physically-based hydrological and water quality models have been developed and widely used to predict changes in water quality considering climate dynamics. However, their limited ability to capture complex, non-linear relationships among environmental variables hinder accurate predictions (Yan et al. 2024).
Compared to these traditional approaches, machine learning (ML) models are capable of identifying intricate patterns in large datasets and offer a promising alternative for water quality prediction (Nallakaruppan et al. 2024). Despite ML models superiority in handling non-linear relationships, the application of ML to water quality prediction often faces challenges due to sparse or irregular datasets compared to those available for hydrological modeling. Given that ML are inherently data-driven, insufficient data for developing ML models can significantly undermine their predictive performance (Zheng et al. 2025). To address this constraint, data augmentation techniques, such as Wasserstein Generative Adversarial Networks (WGAN), has been applied to generate synthetic data and enhance model performance (Arjovsky et al. 2017). WGAN is a type of generative model that can learn the underlying distribution of the original data and generate realistic synthetic data.
In this context, the present study aims to develop a comprehensive framework to project the impact of climate change on reservoir water quality, particularly TOC, by integrating ML models with data augmentation approach. The specific objectives are (1) to develop and compare diverse ML models for TOC prediction; (2) to apply WGAN-based data augmentation to enhance small water quality datasets; and (3) to evaluate future reservoir water quality trends based on Coupled Model Intercomparison Project Phase 6 (CMIP6) global climate models (GCMs), using the best-performing ML models. By achieving these objectives, this study will contribute to improving water quality management strategies for reservoirs, offering insight into climate-resilient infrastructure planning and providing a foundation for sustainable resource management.
2 Materials and Methods
The overall research workflow is depicted in Fig. 1. First, observational data and CMIP6 climate projections were collected and pre-processed to construct the input dataset. Next, a WGAN-based data augmentation technique was applied to address the limitations of a small dataset. Subsequently, Optuna was used to optimize and validate nine ML models. The best-performing model was then selected to project future TOC concentrations for the period 2025–2100 and to analyze the corresponding trends. The detailed procedures and data sources for each step are described in the following sections.
Fig. 1
Methodological workflow for developing TOC prediction models and projecting future concentrations under climate change scenarios, detailing the key steps from data preprocessing to model application
Click here to Correct
2.1 Study Area
This study focused on two representative agricultural reservoirs in South Korea: Sacheon Reservoir in Gangneung and Docheon Reservoir in Yeongdeok (Fig. 2). While both are situated in the eastern coastal region of the Korea Peninsula, they were selected to represent contrasting environmental conditions.
Sacheon Reservoir is located in Gangwon Province, at approximately 37°47' N, 128°53' E (Fig. 2a). It has a total storage capacity of 2.1 × 106 m3 and a watershed area of 2,280 ha. This reservoir, managed by the Korea Rural Community Corporation (KRC), is characterized by high water quality, consistently meeting the criteria for Grade A under South Korea's agricultural water quality standards, indicating its suitability for all agricultural purposes without prior treatment. The region is defined by a mountainous topography that contributes to higher precipitation and cooler temperature.
Docheon Reservoir is located in Gyeongsangbuk Province at approximately 36°31′ N, 129°35′ E (Fig. 2b), with a total storage capacity of 1.1 × 106 m3 and a watershed area of 1,380 ha. Also managed by the KRC, this reservoir's water quality is generally poor, often classified as Grade E, indicating the need for significant treatment before agricultural use. The area has a relatively warmer climate with more distinct seasonal variations in rainfall compared to the Sacheon watershed.
These two sites were selected based on four key criteria: (1) their roles as typical agricultural reservoirs in South Korea; (2) the availability of comprehensive water quality data; (3) their contrasting water quality statuses (high vs. low); and (4) their regional climate conditions.
Fig. 2
Study area in South Korea, showing the locations of (a) Sacheon and (b) Docheon Reservoirs
Click here to Correct
2.2 Data Acquisition and Preprocessing
The target variable for this study was TOC, with historical data obtained from the Rural Agricultural Water Resource Information System (RAWRIS 2024). The dataset for each reservoir spanned from 2011 to 2024 and consisted of 60 observations, collected at monthly or bimonthly intervals.
To quantify the temporal variability of this historical data, the coefficient of variation (CV) was calculated for the TOC time series of each reservoir. Sacheon Reservoir exhibited greater instability (CV = 34.59%) compared to the more stable Docheon Reservoir (CV = 21.72%).
Meteorological variables, including daily precipitation (Prcp), minimum temperature (Tmin), maximum temperature (Tmax), and relative humidity (Rhum), were collected from the Korea Meteorological Administration (KMA) for the North Gangneung and Yeongdok stations (Fig. 2). To address the temporal discrepancy between the daily meteorological data and the less frequent TOC measurements, monthly averages of each meteorological variable were calculated to server as input features for the TOC prediction models.
For the climate change impact assessment, future climate projection data were retrieved from the Earth System Grid Federation node under the CMIP6 framework. We selected two contrasting scenarios: the intermediate-emissions pathways (SSP2-4.5) and the high-emissions, fossil-fueled development pathway (SSP5-8.5).
SSP2-4.5 assumes radiative forcing stabilize at 4.5 W/m2 by 2100, whereas SSP5-8.5 represents a worst-case scenario with forcing reaching 8.5 W/m2 (Tebaldi et al. 2021). Analyzing these two contrasting scenarios was intended to capture a plausible range of future climate impacts on TOC.
A
To account for inter-model uncertainty, outputs from 12 GCMs from CMIP6 were incorporated into our analysis (refer to Table S1 in the Supplementary Material). These GCMs were selected based on their established performance in simulating East Asian climate patterns and the availability of the required variables. This multi-model ensemble approach was designed to enhance the robustness of our future TOC projections by considering a wide spectrum of potential climate futures.
2.3 WGAN-based Data Augmentation and Validation
Given that ML model performance relies heavily on data volume, the 60 observations available for each reservoir were deemed insufficient for robust training. To address this limitation, we augmented the dataset using a WGAN. A Generative Adversarial Network (GAN) (Goodfellow et al. 2014) is a deep learning architecture consisting of two competing neural networks: a generator that creates synthetic data and a discriminator that distinguishes real data from synthetic counterparts. While conventional GANs often suffer from training instability, WGAN utilizes the Wasserstein distance metric to mitigate these issues and improve convergence (Arjovsky et al. 2017; Gulrajani et al. 2017).
Despite its advantages, universally established guidelines for applying WGAN to environmental data augmentation remain scarce. Drawing upon precedents in related fields, we established our methodological parameters. Previous studies demonstrated the suitability of a learning rate of 0.00005 and the feasibility of 3000 training epochs for environmental prediction tasks (Lee et al. 2021; Peng et al. 2022). Accordingly, the present study adopted these values. The original dataset was augmented fivefold, resulting in a 5:1 synthetic-to-real data ratio for subsequent model training.
To validate the fidelity of the generated data, we assessed the distributional similarity between the synthetic and real datasets using three complementary nonparametric tests: the Kolmogorov–Smirnov (KS) test, which is adept at detecting differences in the central region of distributions; the Anderson–Darling (AD) test, which assigns greater weight to the tails and is thus more sensitive to discrepancies in extreme values; and the Energy Distance (ED) test, which provides a single, robust measure of divergence between multivariate samples (Kolmogorov A. 1933; Anderson and Darling 1952; Rizzo and Székely 2016).
2.4 Predictive Modeling Framework
2.4.1 Machine Learning Models
To identify the most effective algorithm for TOC prediction, nine distinct ML models were evaluated. The comprehensive suite, selected to cover diverse modeling principles, included the decision tree (DT), random forest (RF), gradient boosting (GB), eXtreme gradient boosting (XGB), support vector regression (SVR), ridge regression (RR), LassoLars (LL), the M5 Model Tree (M5) and M5 Model Tree with Stochastic Gradient Boosting (M5-SGB). For more comprehensive details on the theoretical background of each algorithm, readers are referred to the original studies cited below.
DT organizes data into a tree-like structure for regression, offering high interpretability but requiring hyperparameter tuning to prevent overfitting (Breiman et al. 1984; Mingers 1989). RF is an ensemble model that constructs multiple decision tree on a random subset and aggregates their predictions, thereby reducing overfitting and capturing complex nonlinear relationships (Breiman 2001; Probst et al. 2019). GB sequentially combines weak learners (typically trees) into a strong predictive model by iteratively training new learners on the residual errors of the previous ones (Friedman 2001; Natekin and Knoll 2013). XGB is a highly efficient and scalable implementation of GB that incorporates regularization to control overfitting and leverages parallel processing for computational speed (Chen and Guestrin 2016).
SVR is a kernel-based method that maps features into a higher-dimensional space to model nonlinear relationships, fitting a function within a specified error margin (Lee et al. 2005; Awad and Khanna 2015). Regularized Linear Models such as LL and Ridge Regression (RR) were also included. LL uses L1 regularization for feature selection by shrinking some coefficients to zero, while RR uses L2 regularization to handle multicollinearity by reducing coefficient variance (Hoerl and Kennard 1970; Tibshirani 1996).
M5 is a hybrid algorithm that places linear regression models at the leaf nodes of decision trees, enabling it to capture localized linear trends and perform limited extrapolation beyond the training data range, addressing a key limitation of standard decision trees, which cannot predict values outside the observed data (Quinlan Basser 1992).
M5-SGB enhances the M5 model by integrating it with Stochastic Gradient Boosting to iteratively refine weak learners, thereby reducing bias and variance. This improved robustness enhances the model's potential for extrapolation, making it better suited for predicting novel, non-stationary climate-TOC relationships (Friedman 2002; Sattari et al. 2018).
2.4.2 Model Training, Optimization, and Application
A
The historical observation data were partitioned into training (80%) and test (20%) sets. Synthetic data generated by a WGAN were added exclusively to the training sets to enhance model learning. Subsequently, the key hyperparameters for each of the nine machine-learning models were optimized using Optuna, an automated hyperparameter optimization framework (Akiba et al. 2019). The search space for each hyperparameter (see Table S2 in the Supplementary Material) was defined based on established practices in environmental modeling (Sipper 2022). The model demonstrating the best performance on the test set was selected as the final model. This optimized model was then used to project future TOC concentrations throughout 2100, driven by climate projections from 12 GCMs under SSP2-4.5 and SSP5-8.5 scenarios.
2.4.3 Performance Evaluation of ML Models
ML model performance was evaluated using two widely accepted metrics in hydrological modeling: the Nash-Sutcliffe Model Efficiency (NSE) (Nash and Sutcliffe 1970) and the coefficient of determination (R2) due to their effectiveness demonstrated by numerous studies (Moriasi et al. 2015; Bhatta et al. 2019; Alqahtani et al. 2022; Modi and Chintalacheruvu 2024).
NSE is a normalized statistic that determines the relative magnitude of the residual variance compared to the measured data variance, ranging from negative infinity to 1. A NSE value of 1 indicates a perfect match, while a value of 0 suggests the model is only as accurate as the mean of the observed data. R2 quantifies the proportion of the variance in the observed data that is predictable from the model, ranging from 0 to 1.
2.5 Trend Analysis of Future TOC Concentrations under Climate Change Scenarios
This study used an ML-based approach to project reservoir TOC concentrations from 2025 to 2100 under SSP2-4.5 and SSP5-8.5 climate scenarios. To rigorously analyze these long-term projections and identify significant patterns, we adopted a comprehensive trend analysis framework.
To detect the presence of monotonic trends, we applied the non-parametric Mann–Kendall (MK) test, which is robust to outliers and non-normal data distributions (Hirsch et al. 1982). The magnitude of any identified trend was then quantified using Sen's slope, a method that calculates the median of all possible pairwise slopes and is therefore resistant to the influence of extreme values (Sen 1968).
Furthermore, to investigate seasonal dynamics, the projected time series were aggregated into seasonal means (spring, summer, autumn). This seasonal approach can uncover distinct trends that might be obscured in an annual analysis (Ha et al. 2022). Winter data were excluded from this seasonal analysis because monitoring during ice-covered months is typically absent.
3 Results
3.1 Distribution Similarity Analysis between Synthetic and Real data
The distributional similarity between the synthetic and real data for both Sacheon and Docheon reservoirs were evaluated using three complementary nonparametric tests (Table 1). For all applied tests, the results were statistically non-significant (p ≥ 0.05), thus failing to reject the null hypothesis that both datasets originate from the same underlying distribution.
A
This outcome provides strong evidence that WGAN-generated data faithfully reproduced the key statistical properties of the original observations. Specifically, the results from the KS and AD tests confirmed a high degree of similarity in both the central regions and the tails of the distributions, respectively. The ED test further corroborated these findings, indicating a strong overall proximity between two datasets. Collectively, these results statistically validate that the synthetic data are of sufficient fidelity to reliably augment the training set for the subsequent development of ML models.
Table 1
Distribution similarity analysis between synthetic data and real data in each reservoir. Asterisks (*) denote that trends are statistically significant (p < 0.05)
Reservoir
Variables
Kolmogorov-Smirnov
Test
Anderson-Darling
Test
Energy Distance
Test
Statistic
p-value
Statistic
p-value
Statistic
p-value
Sacheon
Prcp
0.130
0.380
0.037
0.250
-
-
Tmax
0.111
0.575
-0.108
0.250
-
-
Tmin
0.099
0.716
-0.695
0.250
-
-
Rhum
0.079
0.907
-0.880
0.250
-
-
TOC
0.225
0.016
2.099
0.028
-
-
Overall variables
-
-
-
-
0.157
0.948
Docheon
Prcp
0.104
0.653
-0.575
0.250
-
-
Tmax
0.104
0.653
0.491
0.250
-
-
Tmin
0.083
0.878
0.004
0.250
-
-
Rhum
0.125
0.422
0.002
0.250
-
-
TOC
0.100
0.702
0.541
0.250
-
-
Overall variables
-
-
-
-
0.257
0.597
3.2 Comparative Performance of Machine Learning Models
In Sacheon Reservoir (Fig. 3a), the M5-SGB model outperformed all other algorithms, achieving the highest performance metrics (NSE = 0.862, R² = 0.864). Following M5-SGB, the standalone M5 model also demonstrated strong performance. The evaluation revealed a nuanced impact of WGAN-based data augmentation, whereas boosting-based models such as XGB and GB showed improved performance with the augmented data, other models like RF, LL, RR, and SVR exhibited a marginal decrease.
For Docheon Reservoir (Fig. 3b), M5-SGB again proved to be the most accurate predictor, recording the highest NSE of 0.804 and an R² of 0.895. In this reservoir, most models, including the regularized linear models (LL and RR), showed slight performance enhancements with the augmented dataset.
In summary, M5-SGB was consistently the superior model for TOC prediction in both reservoirs. However, the effectiveness of WGAN-based data augmentation was observed to be model-dependent.
Fig. 3
Performance of ML models trained on the original and WGAN-augmented datasets for (a) Sacheon and (b) Docheon Reservoirs
Click here to Correct
3.3 Projection of TOC in 12 GCMs under Climate Change Scenarios
The projected time series TOC concentrations for Sacheon Reservoir were analyzed using the MK test and Sen's slope (Fig. 4a and Table 2). Under the high emissions SSP5-8.5 scenarios, results indicated a statistically significant increasing trend in TOC levels for all 12 GCMs. Under the intermediate SSP2-4.5 scenarios, 11 of the 12 GCMs also showed a significant increasing trend, with the FGOALS-g3 model being the sole exception. The seasonal analysis (Fig. 5a) further revealed that projected summer TOC concentrations consistently exceeded those of spring and autumn, emphasizing the potential influence of intensified temperature fluctuations under climate change.
In Docheon Reservoir (Fig. 4b and Table 3), 8 of the 12 GCMs under the SSP5-8.5 scenarios exhibited a statistically significant increasing TOC trend. In contrast, under SSP2-4.5 scenario, 9 of the 12 models displayed no significant trend, with only MRI-ESM2-0, NESM3 and NorESM2-LM showing a significant upward trajectory. The seasonal analysis (Fig. 5b) illustrated distinct periodic fluctuations, with particularly pronounced summer peaks.
Fig. 4
Projected annual TOC concentrations from 2025–2100 for (a) Sacheon and (b) Docheon Reservoirs, based on 12 GCMs. For each GCM, the faint line illustrates the projected annual values, while the bold line is the long-term trend estimated by Sen's slope. Trend significance is determined by the MK test (p < 0.05), as detailed in Tables 2 and 3
Click here to Correct
Table 2
Results of the MK test and Sen's slope analysis applied to the projected TOC time series (2025–2100) for Sacheon Reservoir. The analysis covers 12 GCMs under SSP2-4.5 and SSP5-8.5 scenarios. Asterisks (*) denote a statistically significant trend (p < 0.05)
Model
Scenario
Trend
Z-Score
Sen's Slope
ACCESS-CM2
SSP245
Increasing*
2.938
0.003
SSP585
Increasing*
5.996
0.005
ACCESS-ESM1-5
SSP245
Increasing*
5.243
0.005
SSP585
Increasing*
7.557
0.010
CMCC-ESM2
SSP245
Increasing*
6.472
0.006
SSP585
Increasing*
7.862
0.010
CanESM5
SSP245
Increasing*
4.274
0.005
SSP585
Increasing*
9.288
0.013
FGOALS-g3
SSP245
No trend
0.489
0.000
SSP585
Increasing*
3.512
0.003
KIOST-ESM
SSP245
Increasing*
2.381
0.002
SSP585
Increasing*
3.763
0.003
MIROC6
SSP245
Increasing*
2.435
0.002
SSP585
Increasing*
6.292
0.007
MRI-ESM2-0
SSP245
Increasing*
2.651
0.003
SSP585
Increasing*
4.041
0.004
NESM3
SSP245
Increasing*
3.108
0.002
SSP585
Increasing*
3.009
0.002
NorESM2-LM
SSP245
Increasing*
2.731
0.003
SSP585
Increasing*
6.552
0.007
NorESM2-MM
SSP245
Increasing*
3.826
0.004
SSP585
Increasing*
4.050
0.004
TaiESM1
SSP245
Increasing*
5.216
0.004
SSP585
Increasing*
6.813
0.007
Table 3
Results of the MK test and Sen's slope analysis applied to the projected TOC time series (2025–2100) for Docheon Reservoir. The analysis covers 12 GCMs under SSP2-4.5 and SSP5-8.5 scenarios. Asterisks (*) denote a statistically significant trend (p < 0.05)
Model
Scenario
Trend
Z-Score
Sen's Slope
ACCESS-CM2
SSP245
No trend
0.525
0.000
SSP585
No trend
-0.049
0.000
ACCESS-ESM1-5
SSP245
No trend
-0.112
0.000
SSP585
No trend
1.135
0.001
CMCC-ESM2
SSP245
No trend
1.377
0.001
SSP585
Increasing*
5.099
0.005
CanESM5
SSP245
No trend
1.673
0.002
SSP585
Increasing*
6.283
0.007
FGOALS-g3
SSP245
No trend
0.229
0.000
SSP585
Increasing*
5.557
0.006
KIOST-ESM
SSP245
No trend
1.709
0.003
SSP585
Increasing*
4.005
0.006
MIROC6
SSP245
No trend
1.180
0.001
SSP585
Increasing*
5.521
0.005
MRI-ESM2-0
SSP245
Increasing*
1.978
0.002
SSP585
Increasing*
2.238
0.002
NESM3
SSP245
Increasing*
3.252
0.004
SSP585
Increasing*
7.952
0.015
NorESM2-LM
SSP245
Increasing*
2.256
0.002
SSP585
Increasing*
3.539
0.004
NorESM2-MM
SSP245
No trend
1.404
0.001
SSP585
No trend
0.740
0.001
TaiESM1
SSP245
No trend
-0.184
0.000
SSP585
No trend
-1.601
-0.002
Fig. 5
Projected seasonal TOC cycles for the near-term (2050–2055) and long-term (2080–2085) in (a) Sacheon and (b) Docheon Reservoirs. Lines represent the ensemble mean of 12 GCMs for each period.
Click here to Correct
4 Discussion
4.1 Augmentation Effects on Model Performance
The superior performance of the M5-SGB model across both reservoirs highlights the advantages of its hybrid architecture for predicting complex environmental variables like TOC. By integrating the piecewise linear functions of the M5 model tree with the iterative error-correction of Stochastic Gradient Boosting, M5-SGB effectively captures the non-linear relationships between climatic drivers and water quality. Unlike conventional tree-based models that are limited to their training range, the model's ability to perform localized linear regression at its leaf nodes likely provided a critical advantage in extrapolating future TOC concentrations under novel climate conditions.
The successful application of WGAN for data augmentation aligns with findings from various domains. For instance, WGAN has been shown to generate synthetic data that closely matches real distributions in medical time series (Esteban et al. 2017) and electronic health records (Yoon et al. 2019). More specifically, within the environmental sciences, studies have reported that WGAN-based augmentation enhanced the predictive performance of models for aquatic ecosystem health (Lee et al. 2021) and improved streamflow prediction accuracy (López-Chacón et al. 2023). Our finding that WGAN provided a distinct performance boost to boosting algorithms like XGB and GB is consistent with this established potential.
However, our results also introduce a critical nuance. The observation that this positive impact did not extend to inherently robust models like RF or regularized linear models suggests that data augmentation is not a universally beneficial strategy. Its utility is highly dependent on the chosen model architecture. This implies that a tailored approach is crucial, as synthetic data may offer significant benefits for data-sensitive algorithms while providing negligible or even slightly negative effects for models already well-suited to smaller datasets. This distinction is also pragmatically important. Implementing WGAN is computationally intensive and requires specialized expertise; therefore, understanding precisely which models benefit most ensures that this powerful technique is applied efficiently. Consequently, studies like the present one, which investigate the specific interactions between data augmentation methods and model architectures, are essential for developing effective and resource-conscious modeling strategies.
4.2 Contrasting Future TOC Projections between Reservoirs under Climate Change
A central finding of this study is the divergent response of the two reservoirs to climate change; Sacheon Reservoir is projected to experience significant TOC increases under both moderate and high-emissions scenarios, whereas Docheon Reservoir's increasing trend is only significant under the high-emissions scenario.
Sacheon Reservoir presents a pristine but unstable paradox. Despite its high-water quality (Grade A), it exhibits a high variability (CV = 34.59%), indicating considerable instability and sensitivity to irregular meteorological events or pollution inputs. This inherent vulnerability makes the reservoir highly susceptible to the systematic pressures of climate change, such as accelerated warming and intensified precipitation, which drive increased organic matter input from the watershed (Monteith et al. 2007; Fenner et al. 2021). Consequently, even the moderate climatic shift under SSP2-4.5 appears sufficient to trigger a significant rising TOC trend.
Conversely, Docheon Reservoir can be characterized as polluted but stable, with poor baseline water quality but a much lower variability (CV = 21.72%). This suggests a more predictable system, possibly buffered by its already high organic load. A compelling hypothesis for its behavior is a threshold-based response. The watershed and reservoir system may effectively buffer TOC mobilization under moderate warming, but this capacity is overwhelmed once a climatic threshold is surpassed under the more extreme high-emissions conditions (Laudon et al. 2011). This would explain the lack of a significant trend in the SSP2-4.5 scenario and the abrupt emergence of one under SSP5-8.5.
These long-term trends are further contextualized by the seasonal analysis, which showed pronounced summer peaks in TOC for both reservoirs. This is consistent with established mechanisms where rising temperatures enhance microbial activity and the decomposition of organic matter, a process expected to intensify under future warming (Sobek et al. 2007). These findings underscore that predicting climate change impacts on water quality requires a nuanced approach that considers not only the current state (e.g., water quality grade) but also the system's temporal variability and potential for non-linear responses.
4.3 Implications and Future Directions
The projected increase in TOC concentrations carries significant practical implications for agricultural water management in South Korea. Rising TOC levels can increase the potential for summer algal blooms and elevate the costs and complexity of water treatment, particularly concerning the formation of disinfection byproducts. These findings present a direct challenge to water management bodies like the Korea Rural Community Corporation (KRC) and underscore the need for proactive, climate-adaptive watershed management strategies. Furthermore, this study's finding that the pristine Sacheon Reservoir is more unstable (high CV) than the polluted Docheon Reservoir suggests that current monitoring strategies may need re-evaluation. A water body's vulnerability to climate change may be better characterized by its dynamic variability than its current quality grade alone. Therefore, designing robust monitoring and management plans should consider not just the current state, but also the inherent instability of the system.
While this study provides a robust framework for future TOC prediction, the analysis was confined to two reservoirs, and the generalization of these findings requires further research across a wider range of hydro-climatic conditions. Furthermore, all climate projections are subject to inherent uncertainty from the GCMs themselves, although this was mitigated by using a 12-model ensemble. Therefore, future research should aim to expand this modeling framework to a larger, more diverse set of reservoirs. Comparing WGAN with other generative models, such as Variational Autoencoders (VAEs), could also yield valuable insights into the most effective data augmentation techniques. Finally, incorporating additional predictor variables, such as satellite-derived land-use data or point-source pollution information, could further enhance the accuracy and explanatory power of the predictive models.
5 Conclusion
This study successfully developed and validated a predictive framework combining WGAN-based data augmentation with a high performance M5-SGB model to project future TOC concentrations in data-scarce agricultural reservoirs. The ability to achieve satisfactory prediction accuracy using only readily available meteorological data is particularly encouraging. High-frequency water quality monitoring in reservoirs is often constrained by the logistical challenges of manual, boat-based sampling, making this modeling approach a valuable tool for supplementing sparse observational records. The framework proved effective, accurately capturing the complex relationships between climatic drivers and water quality.
The projections reveal a significant increasing trend in TOC under future climate change, a trend that is particularly pronounced in the pristine but temporally unstable reservoir. This highlights the vulnerability of even healthy aquatic ecosystems to climatic shifts. Methodologically, this research demonstrated that the effectiveness of data augmentation is highly model-dependent, providing a critical insight for the practical application of machine learning in hydrology.
In conclusion, this study provides a robust tool for water resource managers to anticipate future challenges. It underscores that effective climate adaptation strategies must consider not only the current state of water bodies but also their dynamic variability to build true resilience.
Acknowledgements
This work was supported by Korea Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET) through Intelligent Agricultural Infra Management for Climate Change Development Program, funded by Ministry of Agriculture, Food and Rural Affairs (MAFRA) (RS-2025-02263904).
Supplementary Information Supplementary data to this article is submitted at the same time.
A
Author Contributions
Conceptualization: Christian Joseph Siose, Jeongho Han, Kyung-Sook Choi, Byoung Han Choi, Kyoung Jae Lim; Methodology: Jeongho Han, Kyung Sook Choi, Byoung Han Choi, Kyoung Jae Lim; Formal analysis and investigation: Christian Joseph Siose, Jeongho Han; Writing—original draft preparation: Christian Joseph, Jeongho Han; Writing—review and editing: Kyung Sook Choi, Byoung Han Choi, Kyoung Jae Lim; Funding acquisition: Kyoung Jae Lim; Supervision: Kyoung Jae Lim; Project administration: Kyoung Jae Lim.
A
Data Availability
The data that support the findings of this study are available from the corresponding author (hanjeongho24@gmail.com) upon reasonable request.
A
Declarations
Competing Interests
The authors declare there is no conflict.
Electronic Supplementary Material
Below is the link to the electronic supplementary material
References
Aguilar-Torrejón JA, Balderas-Hernández P, Roa-Morales G et al (2023) Relationship, importance, and development of analytical techniques: COD, BOD, and, TOC in water—An overview through time. SN Appl Sci 5
Akiba T, Sano S, Yanase T et al (2019) Optuna: A Next-generation hyperparameter optimization framework. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, pp 2623–2631
Alqahtani A, Shah MI, Aldrees A, Javed MF (2022) Comparative assessment of individual and ensemble machine learning models for efficient analysis of river water quality. Sustain (Switzerland) 14. https://doi.org/10.3390/su14031183
Anderson TW, Darling DA (1952) asymptotic theory of certain Goodness of Fit criteria based on stochastic processes
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein Generative Adversarial Networks
Awad M, Khanna R (2015) Support vector machines for classification. in: efficient learning machines. A, Berkeley, CA, pp 39–66
Bhatta B, Shrestha S, Shrestha PK, Talchabhadel R (2019) Evaluation and application of a SWAT model to assess the climate change impact on the hydrology of the Himalayan River Basin. Catena (Amst) 181. https://doi.org/10.1016/j.catena.2019.104082
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall/CRC. https://doi.org/10.1201/9781315139470
Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/https://doi.org/10.1023/A:1010933404324
Chen T, Guestrin C (2016) XGBoost: A Scalable Tree Boosting System. https://doi.org/10.1145/2939672.2939785
Delpla I, Baurès E, Jung AV, Thomas O (2011) Impacts of rainfall events on runoff water quality in an agricultural environment in temperate areas. Sci Total Environ 409:1683–1688. https://doi.org/10.1016/j.scitotenv.2011.01.033
Delpla I, Jung A-V, Baures E et al (2009) Impacts of climate change on surface water quality in relation to drinking water production. Environ Int 35:1225–1233. https://doi.org/10.1016/j.envint.2009.07.001
Esteban C, Hyland SL, Rätsch G (2017) Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:170602633
Fenner N, Meadham J, Jones T et al (2021) Effects of Climate Change on Peatland Reservoirs: A DOC Perspective. Global Biogeochem Cycles 35. https://doi.org/10.1029/2021GB006992
Friedman JH (2001) 999 Reitz Lecture Greedy function approximation: A Gradient Boosting Machine 1
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378. https://doi.org/10.1016/S0167-9473(01)00065-2
Goodfellow IJ, Pouget-Abadie J, Mirza M et al (2014) Generative Adversarial Nets
Gulrajani I, Ahmed F, Arjovsky M et al (2017) Improved training of wasserstein gans. Adv Neural Inf Process Syst 30
Ha DW, Jung KY, Baek J et al (2022) Trend Analysis Using Long-Term Monitoring Data of Water Quality at Churyeongcheon and Yocheon Basins. Sustainability (Switzerland) 14:. https://doi.org/10.3390/su14159770
Hirsch RM, Slack JR, Smith RA (1982) Techniques of trend analysis for monthly water quality data. Water Resour Res 18:107–121
Hoerl AE, Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12:55–67
Jeppesen E, Meerhoff M, Davidson TA et al (2014) Climate change impacts on lakes: An integrated ecological perspective based on a multi-faceted approach, with special focus on shallow lakes. J Limnol 73:88–111. https://doi.org/10.4081/jlimnol.2014.844
Kolmogorov A (1933) Sulla determinazione empirica di una legge didistribuzione. Giorn Dell’inst Ital Degli Att 4:89–91
Laudon H, Berggren M, Ågren A et al (2011) Patterns and dynamics of dissolved organic carbon (DOC) in boreal streams: The role of processes, connectivity, and scaling. Ecosystems 14:880–893
Lee S, Kim J, Lee G et al (2021) Prediction of aquatic ecosystem health indices through machine learning models using the wgan-based data augmentation method. Sustain (Switzerland) 13. https://doi.org/10.3390/su131810435
Lee Y-J, Hsieh W-F, Huang C-M (2005) /spl epsi/-SSVR: a smooth support vector machine for /spl epsi/-insensitive regression. IEEE Trans Knowl Data Eng 17:678–685. https://doi.org/10.1109/TKDE.2005.77
López-Chacón SR, Salazar F, Bladé E (2023) Combining Synthetic and Observed Data to Enhance Machine Learning Model Performance for Streamflow Prediction. Water (Switzerland) 15. https://doi.org/10.3390/w15112020
Mingers J (1989) An empirical comparison of pruning methods for decision tree induction. Mach Learn 4:227–243. https://doi.org/10.1023/A:1022604100933
Modi P, Chintalacheruvu MR (2024) Investigating River water quality assessment through non-parametric analysis: A case study of the Godavari River in India. Environ Qual Manage 33:239–264. https://doi.org/10.1002/tqem.22117
Monteith DT, Stoddard JL, Evans CD et al (2007) Dissolved organic carbon trends resulting from changes in atmospheric deposition chemistry. Nature 450:537–540. https://doi.org/10.1038/nature06316
Moriasi DN, Gitau MW, Pai N, Daggupati P (2015) Hydrologic and water quality models: Performance measures and evaluation criteria. Trans ASABE 58:1763–1785. https://doi.org/10.13031/trans.58.10715
Nallakaruppan MK, Gangadevi E, Shri ML et al (2024) Reliable water quality prediction and parametric analysis using explainable AI models. Sci Rep 14:7520. https://doi.org/10.1038/s41598-024-56775-y
Nash JE, Sutcliffe JV (1970) River flow forecasting through conceptual models’ part i-a discussion of principles*. © North-Holland Publishing Co
Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobot 7. https://doi.org/10.3389/fnbot.2013.00021
Oh H, Park HY, Kim JI et al (2024) Enhancing machine learning models for total organic carbon prediction by integrating geospatial parameters in river watersheds. Sci Total Environ 943. https://doi.org/10.1016/j.scitotenv.2024.173743
Peng L, Li S, Sun H, Huang S (2022) A Pipe Ultrasonic Guided Wave Signal Generation Network Suitable for Data Enhancement in Deep Learning: US-WGAN. Energies (Basel) 15: https://doi.org/10.3390/en15186695
Probst P, Wright MN, Boulesteix AL (2019) Hyperparameters and tuning strategies for random forest. Wiley Interdiscip Rev Data Min Knowl Discov 9
Quinlan Basser JR (1992) Learning with continuous classes. World Scientiic
Rizzo ML, Székely GJ (2016) Energy distance. Wiley Interdiscip Rev Comput Stat 8:27–38
Rural Agricultural Water Resource Information System (RAWRIS) (2024) Rural Agricultural Water Resource Information System. Available at: https://rawris.ekr.or.kr Accessed 11 Nov 2024
Sattari MT, Pal M, Mirabbasi R, Abraham J (2018) Ensemble of M5 Model. Tree-Based Modelling of Sodium Adsorption Ratio
Sen PK (1968) Estimates of the regression coefficient based on Kendall’s tau. J Am Stat Assoc 63:1379–1389
Sipper M (2022) High Per Parameter: A Large-Scale Study of Hyperparameter Tuning for Machine Learning Algorithms. Algorithms 15:315. https://doi.org/10.3390/a15090315
Sobek S, Tranvik LJ, Prairie YT et al (2007) Patterns and regulation of dissolved organic carbon: An analysis of 7,500 widely distributed lakes. Limnol Oceanogr 52:1208–1219. https://doi.org/10.4319/lo.2007.52.3.1208
Tebaldi C, Debeire K, Eyring V et al (2021) Climate model projections from the Scenario Model Intercomparison Project (ScenarioMIP) of CMIP6. Earth Sys Dyn 12:253–293. https://doi.org/10.5194/esd-12-253-2021
Tibshirani R (1996) Regression Shrinkage and Selection via the Lasso
Tranvik LJ, Downing JA, Cotner JB et al (2009) Lakes and reservoirs as regulators of carbon cycling and climate. Limnol Oceanogr 54:2298–2314. https://doi.org/10.4319/lo.2009.54.6_part_2.2298
Whitehead PG, Wilby RL, Battarbee RW et al (2009) A review of the potential impacts of climate change on surface water quality. Hydrol Sci J 54:101–123. https://doi.org/10.1623/hysj.54.1.101
Yan X, Zhang T, Du W et al (2024) A comprehensive review of machine learning for water quality prediction over the past five years. J Mar Sci Eng 12
Yoon J, Jarrett D, Van der Schaar M (2019) Time-series generative adversarial networks. Adv Neural Inf Process Syst 32
Zheng Y, Zhang X, Zhou Y et al (2025) Deep representation learning enables cross-basin water quality prediction under data-scarce conditions. NPJ Clean Water 8. https://doi.org/10.1038/s41545-025-00466-2
Zhou S, Long H, Chen W et al (2025) Temperature seasonality regulates organic carbon burial in lake. Nat Commun 16. https://doi.org/10.1038/s41467-025-56399-4
Figure 1 Methodological workflow for developing TOC prediction models and projecting future concentrations under climate change scenarios, detailing the key steps from data preprocessing to model application
Click here to Correct
Figure 2 Study area in South Korea, showing the locations of (a) Sacheon and (b) Docheon Reservoirs
Click here to Correct
Figure 3 Performance of ML models trained on the original and WGAN-augmented datasets for (a) Sacheon and (b) Docheon Reservoirs
Click here to Correct
Figure 4 Projected annual TOC concentrations from 2025–2100 for (a) Sacheon and (b) Docheon Reservoirs, based on 12 GCMs. For each GCM, the faint line illustrates the projected annual values, while the bold line is the long-term trend estimated by Sen's slope. Trend significance is determined by the MK test (p < 0.05), as detailed in Tables 2 and 3
Click here to Correct
Figure 5 Projected seasonal TOC cycles for the near-term (2050–2055) and long-term (2080–2085) in (a) Sacheon and (b) Docheon Reservoirs. Lines represent the ensemble mean of 12 GCMs for each period
Click here to Correct
Table 1 Distribution similarity analysis between synthetic data and real data in each reservoir. Asterisks (*) denote that trends are statistically significant (p < 0.05)
Reservoir
Variables
Kolmogorov-Smirnov
Test
Anderson-Darling
Test
Energy Distance
Test
Statistic
p-value
Statistic
p-value
Statistic
p-value
Sacheon
Prcp
0.130
0.380
0.037
0.250
-
-
Tmax
0.111
0.575
-0.108
0.250
-
-
Tmin
0.099
0.716
-0.695
0.250
-
-
Rhum
0.079
0.907
-0.880
0.250
-
-
TOC
0.225
0.016
2.099
0.028
-
-
Overall variables
-
-
-
-
0.157
0.948
Docheon
Prcp
0.104
0.653
-0.575
0.250
-
-
Tmax
0.104
0.653
0.491
0.250
-
-
Tmin
0.083
0.878
0.004
0.250
-
-
Rhum
0.125
0.422
0.002
0.250
-
-
TOC
0.100
0.702
0.541
0.250
-
-
Overall variables
-
-
-
-
0.257
0.597
Table 2 Results of the MK test and Sen's slope analysis applied to the projected TOC time series (2025–2100) for Sacheon Reservoir. The analysis covers 12 GCMs under SSP2-4.5 and SSP5-8.5 scenarios. Asterisks (*) denote a statistically significant trend (p < 0.05)
Model
Scenario
Trend
Z-Score
Sen's Slope
ACCESS-CM2
SSP245
Increasing*
2.938
0.003
SSP585
Increasing*
5.996
0.005
ACCESS-ESM1-5
SSP245
Increasing*
5.243
0.005
SSP585
Increasing*
7.557
0.010
CMCC-ESM2
SSP245
Increasing*
6.472
0.006
SSP585
Increasing*
7.862
0.010
CanESM5
SSP245
Increasing*
4.274
0.005
SSP585
Increasing*
9.288
0.013
FGOALS-g3
SSP245
No trend
0.489
0.000
SSP585
Increasing*
3.512
0.003
KIOST-ESM
SSP245
Increasing*
2.381
0.002
SSP585
Increasing*
3.763
0.003
MIROC6
SSP245
Increasing*
2.435
0.002
SSP585
Increasing*
6.292
0.007
MRI-ESM2-0
SSP245
Increasing*
2.651
0.003
SSP585
Increasing*
4.041
0.004
NESM3
SSP245
Increasing*
3.108
0.002
SSP585
Increasing*
3.009
0.002
NorESM2-LM
SSP245
Increasing*
2.731
0.003
SSP585
Increasing*
6.552
0.007
NorESM2-MM
SSP245
Increasing*
3.826
0.004
SSP585
Increasing*
4.050
0.004
TaiESM1
SSP245
Increasing*
5.216
0.004
SSP585
Increasing*
6.813
0.007
Table 3 Results of the MK test and Sen's slope analysis applied to the projected TOC time series (2025–2100) for Docheon Reservoir. The analysis covers 12 GCMs under SSP2-4.5 and SSP5-8.5 scenarios. Asterisks (*) denote a statistically significant trend (p < 0.05)
Model
Scenario
Trend
Z-Score
Sen's Slope
ACCESS-CM2
SSP245
No trend
0.525
0.000
SSP585
No trend
-0.049
0.000
ACCESS-ESM1-5
SSP245
No trend
-0.112
0.000
SSP585
No trend
1.135
0.001
CMCC-ESM2
SSP245
No trend
1.377
0.001
SSP585
Increasing*
5.099
0.005
CanESM5
SSP245
No trend
1.673
0.002
SSP585
Increasing*
6.283
0.007
FGOALS-g3
SSP245
No trend
0.229
0.000
SSP585
Increasing*
5.557
0.006
KIOST-ESM
SSP245
No trend
1.709
0.003
SSP585
Increasing*
4.005
0.006
MIROC6
SSP245
No trend
1.180
0.001
SSP585
Increasing*
5.521
0.005
MRI-ESM2-0
SSP245
Increasing*
1.978
0.002
SSP585
Increasing*
2.238
0.002
NESM3
SSP245
Increasing*
3.252
0.004
SSP585
Increasing*
7.952
0.015
NorESM2-LM
SSP245
Increasing*
2.256
0.002
SSP585
Increasing*
3.539
0.004
NorESM2-MM
SSP245
No trend
1.404
0.001
SSP585
No trend
0.740
0.001
TaiESM1
SSP245
No trend
-0.184
0.000
SSP585
No trend
-1.601
-0.002
Total words in MS: 5372
Total words in Title: 16
Total words in Abstract: 246
Total Keyword count: 6
Total Images in MS: 10
Total Tables in MS: 6
Total Reference count: 52