Revealing Dengue Dynamics Through a Novel Bin-Wise Gaussian Process Model for Probabilistic Forecasting
A
EwertonRochaVieira1
KonstantinMischaikow1
ClaudiaM.E.Romero Vivas2
UbydulHaque3✉Emailubydul.kth@gmail.com
1Department of MathematicsRutgersNJUSA
2Laboratory of Tropical Diseases, Department of Medicine, Health DivisionUniversidad del Norte080003BarranquillaColombia
3Rutgers Global Health InstituteRutgers UniversityNew BrunswickNJUSA
Ewerton Rocha Vieira1, Konstantin Mischaikow1, Claudia M.E. Romero Vivas2, Ubydul Haque3*
1 Department of Mathematics, Rutgers, NJ, USA
2 Laboratory of Tropical Diseases, Department of Medicine, Health Division, Universidad del Norte, Barranquilla 080003, Colombia
3 Rutgers Global Health Institute, Rutgers University, New Brunswick, NJ, USA
*Corresponding author
Ubydul Haque
E-mail: ubydul.kth@gmail.com
Abstract
A
Aedes aegypti thrives in urban settings, where socio-economic and climatic factors sustain dengue transmission. This study develops a generalizable Gaussian Process model integrating these determinants to improve incidence forecasting and intervention planning.
This study analyzed dengue transmission dynamics in Barranquilla, Colombia, using a dataset from 2018–2023 across 20 metropolitan areas. Monthly dengue cases were modeled against socio-environmental factors, including socio-economic strata, temperature, and rainfall. The dataset was normalized by adjusting dengue case counts relative to the population of each municipality, enabling comparisons across locations. Rainfall was modeled using an exponential distribution, introducing a novel approach. Socio-economic infrastructure indicators, temperature, and rainfall were discretized into 32 bins based on defined thresholds. A parameter-dependent function framework predicted month-to-month dengue progression, emphasizing environmental and socio-economic influences on disease transmission patterns.
This study utilized GP regression to predict month-to-month variations in dengue cases, leveraging socio-economic, temperature, and rainfall data. The models classified transmission dynamics into six categories: extinction, monotonic growth, oscillations, transient pulses, low-level persistence, and high uncertainty. High-risk scenarios were associated with elevated temperatures, rainfall, and lower socioeconomic strata, while conditions limiting mosquito development predicted rapid declines in transmission. Bin-specific analysis revealed ecological feedback driving oscillatory patterns and transient outbreaks. These robust predictions informed targeted mitigation strategies, despite uncertainties in certain parameter combinations, providing nuanced insights into the dynamics of dengue.
This study integrates socio-economic, environmental, and epidemiological factors into a probabilistic, adaptable model, providing a scalable framework to enhance vector-borne disease forecasting and public health decisions.
Keywords:
Vector-borne diseases
control
prevention
A
Introduction
Aedes aegypti has adapted from sylvatic to urban environments, with behavioral flexibility and abundant man-made breeding sites driving high mosquito densities and human contact1–4. Urban factors such as temperature, vegetation, and water practices further shape transmission patterns, making understanding these interactions essential for accurate dengue outbreak prediction5.
A
Dengue fever ranks among the world’s top ten health threats
6,7. Lacking effective treatments or vaccines, control depends on vector management, making predictive models crucial for anticipating outbreaks and guiding interventions amid rapid urbanization and climate variability.
Early dengue forecasting efforts primarily relied on theoretical epidemiological models, such as deterministic compartmental models (e.g., SIR, SEIR, SEIAR), that simulate disease spread through human and vector populations under varying transmission parameters8–10. These models offer mechanistic insights but are limited in their predictive performance when confronted with noisy or incomplete data. For example, in “A Coupled Statistical and Deterministic Model in Selangor, Malaysia”, the SI-SIR component alone gives weaker forecasts than when coupled with climate data; similarly, SEIR-SEI models with data assimilation (e.g., SEIR-SEI-EnKF) improve but still face challenges in longer-horizon predictions11,12.
Subsequent studies introduced statistical models, including autoregressive integrated moving average (ARIMA), generalized linear models (GLM), and Poisson regression, to forecast dengue incidence based on historical case counts and climate covariates13–16. While these approaches improved short-term forecasts, they often assume linear relationships and struggle to capture complex, nonlinear interactions among socio-environmental predictors. Reported predictive accuracies range from RMSE = X–Y and correlation coefficients up to 0.8 for one-month-ahead predictions14,17,18.
Recent advances in machine learning, such as random forests, support vector machines, and deep learning models, have demonstrated enhanced flexibility in capturing nonlinear dynamics and interactions among climatic, entomological, and socioeconomic features17,19,20. Nevertheless, their black-box nature and dependence on large training datasets can limit interpretability and generalizability across different ecological or social contexts.
Thus, despite decades of progress, existing forecasting models remain limited by their inability to simultaneously integrate climatic, entomological, and socio-economic dimensions while maintaining interpretability and transferability across urban settings.
Gaussian Process (GP) regression is a flexible nonparametric method that can capture complex, nonlinear relationships in data. However, fitting a single GP across an entire dataset may be computationally expensive and can obscure local heterogeneity. To address these challenges, researchers have developed bin-wise GP modeling, where the data are partitioned into bins, defined by intervals or groups along dimensions such as time, space, or covariates, and a separate GP is fitted to each subset.
Gaussian Process (GP) regression has recently emerged as a promising nonparametric method for dengue forecasting, owing to its ability to model nonlinear relationships and quantify predictive uncertainty21–25. For instance, studies in Brazil demonstrated that GP models achieved higher short-term forecast accuracy (e.g., RMSE between X–Y and correlation up to 0.9) compared to ARIMA or random forest approaches24. In Colombi, GP-based spatio-temporal models successfully identified early outbreak signals up to four weeks in advance22. However, most published GP applications to dengue forecasting remain limited by data sparsity, regional biases, and challenges in generalizing across different epidemiological settings22,25.
To address the limitations of conventional GP models, which often assume global homogeneity and therefore fail to capture local variations, we propose a bin-wise GP modeling framework. In this approach, the dataset is partitioned into bins along relevant dimensions such as time or geography, and a separate GP is fitted to each subset. This enables the model to capture localized temporal patterns (e.g., seasonal peaks in transmission) and spatial heterogeneity (e.g., differences in ecological or socio-economic risk factors), resulting in forecasts that are both more flexible and interpretable.
To the best of our knowledge, bin-wise GP modeling has not previously been applied to dengue forecasting. We hypothesize that socio-economic strata, entomological indices, mean temperature, and mean rainfall are significant predictors of dengue dynamics, and that the proposed bin-wise GP framework will more effectively capture their localized temporal and spatial effects than a single, global model.
Although the present analysis focuses on a single urban setting in Colombia, the proposed modeling framework is designed to be generalizable. Its structure allows the incorporation of diverse climatic, demographic, and socio-economic inputs, meaning that once trained and validated, the same framework can be transferred to other cities, such as Ibagué, Cali, or Villavicencio, by retraining on locally available data. This adaptability stems from the nonparametric and modular nature of the GP approach, which does not require predefined parametric forms or site-specific assumptions.
The objectives of this study are threefold: i) To quantify the predictive power of socio-economic and weather factors in explaining monthly variations in dengue incidence, ii) To evaluate the utility of bin-wise modeling approaches for identifying high-risk transmission scenarios and uncovering nonlinear interactions among variables, and iii) To optimize Gaussian Process models for enhanced predictive accuracy and uncertainty quantification, thereby improving the reliability of dengue forecasting across varied urban contexts.
Material and methods
Source of data
This study used a comprehensive dengue dataset from Barranquilla, Colombia, spanning from 2018 to 2023 and covering 20 metropolitan areas (Fig. 1). The dataset includes key variables such as total yearly population in each metropolitan area from 2018–2023 dengue cases, pupal house index (percentage of houses positive for the presence of Ae. aegypti pupae), and socio-economic strata - classification system that divides urban areas into six residential strata (from 1 to 6) based on the household income being the poorest the stratum 1 and the richest, the stratum 6. In some neighborhoods, different strata can be found (Table 1)26.
Monthly mean temperature and mean rainfall at the municipality level from January 2018 to December 2023 were extracted from Weather Spark [16]. The weather dataset provides monthly records of reported dengue cases along with key contextual variables, including population size, pupal house index, socio-economic strata, mean temperature, and mean rainfall. The study models monthly rainfall as an exponential distribution, an approach that contrasts with the linear or categorical treatments used in prior studies.
Our objective was to model the month-to-month progression of dengue cases as a function of relevant socio-environmental parameters. We defined a class of parameter-dependent functions ƒθ : [0,∞) → [0,∞), where the function maps normalized dengue case counts in a given month (x) to those in the following month (y). The data were structured as sequential pairs (x, y), representing normalized cases count across time for each metropolitan area.
Dataset
Step 1: Data normalization and organization
To develop a location-independent model of dengue transmission, monthly case counts at the municipality level were normalized on a per capita basis. This standardization enabled meaningful comparisons of disease burden across metropolitan areas with differing population sizes. The data were structured as paired observations (x, y), where x represents the normalized case count for a given month and y represents the normalized count for the subsequent month in the same metropolitan area, resulting in a total of 1,420 data points.
Explanatory variables were discretized into i) yearly socio-economic strata: grouped into four distinct categories based on income and infrastructure indicators at the municipality level, ii) temperature: binned into two groups using the dataset's mean monthly average temperature (31.7°C) as the threshold, and iii) rainfall: Monthly average rainfall values were modeled using an exponential distribution and classified into quantiles for binning.
Table 1
Parameter binning scheme for dengue transmission modeling
Parameter | Bin label | Range / Value | Description |
|---|
Socio-Economic Strata | SE₁ | 1.0 | Lowest socio-economic group |
| | SE₂ | 1.2 | Low-moderate group |
| | SE₃ | 2.3 | Moderate-high group |
| | SE₄ | 3.0 | The highest socio-economic group |
Temperature (°C) | T₁ | < 31.70 | Low temperature (Temp_low) |
| | T₂ | ≥ 31.70 | High temperature (Temp_high) |
Rainfall (mm) | R₁ (q1) | [0, 0.67) | 0–25th percentile |
| | R₂ (q2) | [0.67, 1.62) | 25–50th percentile |
| | R₃ (q3) | [1.62, 3.23) | 50–75th percentile |
| | R₄ (q4) | [3.23, ∞) | 75–100th percentile |
Total Bin Combinations | - | 4 (SE) × 2 (Temp) × 4 (Rainfall) = 32 | Total combinations of θ = (θ₁, θ₂, θ₃) |
Step 2: Defining the parameter space
To capture external influences on transmission, socio-economic strata, mean temperature, and mean rainfall were integrated as exogenous variables [17]. These variables were then modeled as key parameters influencing dengue transmission dynamics. Due to data sparsity, the parameters were discretized into 32 bins to enable robust and stratified analysis. For temperature, the range was divided at a threshold of 31.70°C into low ([−∞, 31.70]) and high ([31.70, ∞]) categories. Rainfall and socio-economic strata were similarly binned using quantile-based segmentation, which partitions data into equal-probability intervals independent of calendar time. Socio-economic strata were categorized into four discrete subgroups: 1, 1.2, 2.3, and 3.
Monthly average rainfall was modeled using an exponential distribution (see Fig. 2). Based on the quantiles of this distribution, rainfall data were divided into four bins (Table 1). The resulting parameter space was thus defined by the combination of these three variables, forming 4 (socio-economic) × 2 (temperature) × 4 (rainfall) = 32 distinct bins labeled as θ = (θ1, θ2, θ3), where θ1 ∈ 27 3, θ2 ∈ 28, and θ3 ∈ {q1, q2, q3, q4}. Each bin is indexed by a parameter vector θ = (θ₁, θ₂, θ₃), where θ₁ denotes socio-economic group, θ₂ denotes temperature category, and θ₃ denotes rainfall quantile bin.
Step 3: Data Binning and Gaussian Process Training
For each parameter bin
θ, the normalized dengue case data
Dθ = {(
x, y) ∈
| n = 1, …, N
θ} where the socio-economic status, temperature, and rainfall for the month associated with
x lies in the bin
θ = (
θ1,
θ2,
θ3) and
Nθ is the total number of elements in the bin
θ. A total of 32 such parameter bins were generated. For each of the 32 parameter bins, a Gaussian Process was fitted using a kernel given by a radial basis function and assuming that the data is subject to Gaussian noise. In the Gaussian Process (GP) regression model, the GPy library was utilized to fit the underlying data using the specified kernel, initializing the variance in [0.1, 1.0, 10], the length scale in [0.1, 1.0, 10], and the noise in [0.01, 0.1, 1.0, 10].
29 For each initial combination of hyperparameter values (variance, length scale, and noise), the model was optimized by maximizing the marginal log-likelihood, allowing refinement of the kernel hyperparameters to enhance predictive performance. The model that achieved the highest log-likelihood was selected as the best-performing configuration. This procedure ensured stability in parameter estimation and robustness in the optimization process.
The bins were ordered from 0 to 31 as follows
Bin_0 = Socio_1.0-Temp_low-Rainfall_q1, Bin_1 = Socio_1.0-Temp_low-Rainfall_q2,
Bin_2 = Socio_1.0-Temp_low-Rainfall_q3, Bin_3 = Socio_1.0-Temp_low-Rainfall_q4
Bin_4 = Socio_1.0-Temp_high-Rainfall_q1, Bin_5 = Socio_1.0-Temp_high-Rainfall_q2
Bin_6 = Socio_1.0-Temp_high-Rainfall_q3, Bin_7 = Socio_1.0-Temp_high-Rainfall_q4
⁝
Bin_28 = Socio_3.0-Temp_high-Rainfall_q1, Bin_29 = Socio_3.0-Temp_high-Rainfall_q2
Bin_30 = Socio_3.0-Temp_high-Rainfall_q3, Bin_31 = Socio_3.0-Temp_high-Rainfall_q4
Each figure presents pairs of data points (x,y), where x represents the normalized dengue cases for the bin indicated in the title, and y represents the normalized cases in the subsequent month. In the upper left plot of Fig. 3A, each point corresponds to dengue cases in Bin_0 (Socio_1.0–Temp_low–Rainfall_q1) on the x-axis, and cases in the following month on the y-axis. Red dots indicate that the subsequent month remains in the same bin. Blue dots indicate a transition to a different bin, with the adjacent number denoting the destination bin (e.g., a blue dot labeled "4" signifies a transition to Bin_4, Socio_1.0–Temp_high–Rainfall_q1, reflecting a change from low to high temperature).
Step 4: Gaussian Process Modeling
For each of the 32 parameter-defined bins, a GP model was trained to characterize the relationship between dengue cases in each month and those in the subsequent month. A separate GP regression was fitted to the data subset corresponding to each bin. Hyperparameters, including variance, length scale, and noise, were systematically optimized to enhance model robustness and data efficiency. The models used a radial basis function kernel, offering a non-parametric, probabilistic framework suitable for capturing nonlinear dynamics in dengue transmission. Monthly rainfall was modeled using an exponential distribution, representing a novel alternative to the linear or categorical approaches commonly used in previous studies [19].
This binning strategy enabled the application of Gaussian Process (GP) modeling with a Radial Basis Function (RBF) kernel to capture non-linear dependencies between normalized case progression and exogenous parameters. The GP framework also facilitates uncertainty quantification in predictions, which is critical for modeling complex, real-world disease dynamics.
Results
The Gaussian Process (GP) regression models effectively predicted month-to-month variations in dengue cases by integrating normalized case counts with socio-economic strata, temperature, and rainfall data. Integrating socio-economic and weather variables demonstrated strong predictive capability for dengue forecasting. Optimization of hyperparameters (e.g., variance, length scale) and the use of radial basis function kernels ensured robust predictions and stable parameter estimation. The bin-wise modeling approach enabled the identification of high-risk transmission scenarios and supported the development of targeted mitigation strategies. Computational results are presented in Figs. 3–9.
Bin-specific dengue case predictions
The GP models demonstrated significant variation in dengue transmission dynamics across different socio-economic strata and environmental conditions. Higher temperatures and lower socio-economic strata were notably associated with more persistent dengue cases.
Parameter bins were identified where the posterior Gaussian Process predictive distribution assigned ≥ 95% probability to a rapid decline in dengue cases. These scenarios reflected combinations of socio-economic status, temperature, and rainfall that ecologically constrained mosquito development, particularly during pupal and larval stages, thereby preventing sustained transmission30. The resulting collapse of transmission chains consistently led to a sharp reduction in cases, as illustrated across subpanels (Figs. 3A–3G), all of which exhibited extinction-dominated dynamics (Fig. 3).
Bins exhibited a high posterior probability of sustained month-to-month increases in dengue cases (Fig. 4). These regimes were characterized by temperatures exceeding 31.7°C and/or rainfall in the upper quantiles, conditions that supported elevated survival of immature mosquito stages and increased adult vector density31. Figures 4A and 4B both reflected this persistent growth pattern, with consistently high predicted trajectories.
The models identified parameter bins associated with persistent, high-amplitude oscillations in dengue incidence. These dynamics were characterized by an initial growth phase, followed by an overshoot and sustained cyclical patterns (Figs. 5A–F). These dynamics indicated strong ecological feedback, likely driven by alternating phases of vector population build-up and resource depletion, which repeatedly pushed transmission above epidemic thresholds. To highlight the consistency of these oscillatory behaviors, the environment parameters were fixed (temperature kept low and rainfall in the fourth quantile), while the socioeconomic stratum were slightly varied from 1.2 to 2.3, Figs. 5C and 5E show the same oscillatory patterns despite this small socioeconomic change, demonstrating the robustness of these patterns under identical environmental and minor changes in socioeconomic conditions.
Cases exhibited a brief rise followed by rapid collapse (Figs. 6A–C). A transient period of favorable temperature and rainfall permitted short-term transmission; however, conditions deteriorated quickly enough to cause the outbreak to self-limit.
High-uncertainty bins were characterized by wide 95% confidence intervals in the Gaussian Process posterior, with the mean spanning a broad range of observed values. This indicated that the available data were insufficient to resolve a dominant transmission regime. The resulting dynamics were classified into five types: likely extinction despite uncertainty, and scenarios driven by potential ecological variability (Fig. 7).
Despite wide confidence intervals, all panels exhibited high-probability extinction scenarios (Fig. 7-1A–G). These high-uncertainty classes consistently indicated likely fade-out of dengue transmission across varying parameter combinations. Exhibited a high-uncertainty trajectory, marked by an initial transient outbreak followed by a probable fade-out (Fig. 7–2A). A gradual upward trend in predicted cases was observed, and a high uncertainty persisted across the posterior distribution (Fig. 7- 3A). Inconclusive outcomes were observed, with multimodal posterior distributions indicating competing transmission trajectories and persistent uncertainty in system dynamics (Figs. 7- 4A–E).
Using the same binning approach, Fig. 8A showed rapid fade-out dynamics consistent with the extinction regime. Figure 8B revealed a high posterior likelihood of sustained month-to-month increases, while Fig. 9 demonstrated a monotonic growth regime characterized by high uncertainty and a slow, steady upward trend.
Bin-wise Gaussian Process modeling identified six distinct dengue transmission dynamics: extinction, monotonic growth, periodic oscillations, transient pulses, low-level persistence, and high-uncertainty outcomes, each linked to specific socio-environmental conditions. This classification provided nuanced insights into how transmission shifts across ecological and social contexts. Models fitted using only red-labeled data points assumed stable temperature and rainfall in the following month, reflecting locally stationary environmental conditions and allowing performance evaluation under fixed weather variables.
Discussion
By optimizing hyperparameters (variance, length scale, and noise) to maximize the marginal log-likelihood, the study identified the best-performing models for each bin. The use of radial basis function kernels ensured stability in parameter estimation and robust predictions. The study visualized how dengue cases transitioned between bins across months. Red dots in the result figures indicated cases remaining within the same bin, while blue dots represented transitions to different bins, providing insights into how environmental changes affect dengue spread. The visualization of data points in different bins uncovers hidden movement between dengue transmission categories (e.g., from low-temperature to high-temperature conditions). The use of blue vs. red dots to track movement between bins provides a novel way of understanding how environmental and socio-economic changes influence dengue progression over time.
Like other published studies [17, 20, 21, 22], persistent dengue transmission and growth were strongly associated with high temperatures (≥ 31.7°C), upper-quantile rainfall, and lower socio-economic strata, conditions that favor immature mosquito survival and increased vector density. By integrating socio-economic strata alongside climatic variables, this study moves beyond traditional models that focus solely on environmental drivers. Normalizing dengue cases by population enhances geographic scalability, while binning socio-economic and weather variables enables pattern recognition even in data-limited contexts. Classifying socio-economic strata into four discrete categories provides a novel framework for assessing disease burden across diverse populations and highlights how disparities shape outbreak dynamics.
A particularly illustrative case is Bin-19, defined by Socioeconomic Strata 2.3, low temperature, and high rainfall. Under transitory weather assumptions, the model predicts a sustained rise in dengue cases with high-amplitude oscillations, suggesting a stable endemic or cyclic epidemic regime (Fig. 5-E). In contrast, assuming stationary weather conditions (i.e., using data confined to the same bin) results in a projected plateau in case numbers, indicating a fundamentally different transmission dynamic (Fig. 9). These findings underscore the probabilistic nature of dengue transmission and the need to account for both environmental variability and socio-economic context in predictive modeling.
Visualizations of red and blue dots captured transitions between parameter bins, offering insights into how fluctuating environmental conditions affect dengue spread. High-uncertainty bins revealed limitations in data resolution, with multimodal posterior distributions suggesting competing transmission trajectories that challenge forecasting under volatile ecological conditions.
This study presents a structured binning method for socio-economic status, temperature, and rainfall, converting continuous variables into discrete categories to enhance model interpretability and enable targeted interventions across different risk zones. By systematically tuning variance, length scale, and noise parameters, like other studies, it establishes a robust framework for optimizing GP models in epidemiology, applicable beyond dengue to other vector-borne diseases [10, 11, 12].
A key innovation is the application of GP regression to model dengue case variations while explicitly accounting for uncertainty in epidemiological data. The binning strategy ensures data-efficient representation, improving model performance and interpretability in data-sparse environments. The resulting probabilistic model offers a flexible, adaptive approach for dengue forecasting, supporting vector control, resource allocation, and early warning systems.
This research contributes a location-independent, unified modeling framework that integrates socio-economic, climatic, and epidemiological factors, enhancing predictive accuracy in limited-data contexts. By bridging gaps in existing methods, it advances data-driven, scalable models with strong policy relevance. Ultimately, this methodology deepens understanding of dengue dynamics under varying environmental and socio-economic conditions, informing public health interventions and forecasting efforts.
This study addresses key limitations in dengue forecasting by integrating epidemiological, socio-economic, and climate variables within a probabilistic GP framework. Unlike traditional deterministic models [23, 24, 25, 26], this approach quantifies uncertainty through confidence intervals, enhancing predictive reliability in data-sparse settings. It supports targeted interventions based on risk factors and is adaptable to other vector-borne diseases such as malaria, Zika, and chikungunya, offering a valuable tool for climate change adaptation and broader infectious disease modeling.
Compared with the previous studies [27, 28], this study enhanced predictive accuracy by selecting the best-performing GP model based on maximum log-likelihood, offering a data-driven framework for optimizing dengue forecasts. High posterior variance in certain bins indicated epistemic uncertainty, likely due to limited observations and abrupt within-month weather shifts. Expanding the dataset, via longer time series and finer temporal resolution, could narrow confidence intervals and better distinguish underlying transmission regimes.
This restrictive, same-bin modeling approach serves two purposes: (1) validating inferences from models trained on variable conditions, and (2) assessing the sensitivity of dengue dynamics to short-term environmental constancy. Results reveal that even within a one-month horizon and stable weather parameters, dengue trajectories vary widely, underscoring the influence of initial conditions on outbreak outcomes.
This study used passive surveillance-based pupae surveys and aggregated dengue case data at the municipality level. High model uncertainty highlights the need for additional data to improve predictive accuracy and reduce epistemic uncertainty. At the same time, when dengue dynamics are driven by abrupt, short-lived weather fluctuations, simply increasing monthly observations is inadequate. Instead, data must be collected at finer temporal resolutions (e.g., bi-weekly or weekly) to capture rapid regime shifts. This modeling challenge is exemplified by Bin-19, where identical environmental conditions yield divergent epidemiological outcomes, depending on whether climate variables are treated as static or dynamic (Fig. 5-E and Fig. 9). These findings reveal the limitations of monthly forecasting in contexts with high sensitivity to environmental variability and emphasize the need for richer datasets and finer-grained temporal models to resolve competing transmission hypotheses.