Title: Using Wearable Data to Detect Depression in a Cross-Population Study

Authors

Miriam Ina Hehlmann

Ph.D.

1✉ Phone+495419697612 Emailmiriam.hehlmann@uni-osnabrueck.de

Rayyan Tutunji

Ph.D.

Wolfgang Lutz

Ph.D.

Carlotta L. Rieble

M.Sc.

Ricarda K. K. Proppert

M.Sc.

Fabienne Mink

M.Sc.

Julian A. Rubel

Ph.D.

Marieke Schreuder

Ph.D.

Eiko I. Fried

Ph.D.

1 Osnabrück University Lise-Meitner-Strasse 3 49078 Osnabrück Germany

Leiden University

3 Trier University

4 KU Leuven, Tilburg University

Miriam I. Hehlmann, Ph.D.¹, Rayyan Tutunji, Ph.D.², Wolfgang Lutz, Ph.D.^3, Carlotta L. Rieble, M.Sc.², Ricarda K. K. Proppert, M.Sc.², Fabienne Mink, M.Sc.³, Julian A. Rubel, Ph.D.¹, Marieke Schreuder, Ph.D.⁴, & Eiko I. Fried, Ph.D.²

¹Osnabrück University

²Leiden University

³Trier University

⁴KU Leuven, Tilburg University

Corresponding author:

Miriam Ina Hehlmann, Ph.D.

Osnabrück University

Lise-Meitner-Strasse 3, 49078 Osnabrück, Germany

Email: miriam.hehlmann@uni-osnabrueck.de

Phone: +495419697612

Manuscript word count

4366 words; main text = 2202 words

Abstract

Early detection of depression is crucial, yet current assessment methods depend on self-report questionnaires and clinical interviews, which are resource-intensive. Wearable devices provide a scalable way to assess physiological and behavioral features, but their predictive value across clinical and non-clinical populations remains insufficiently established. Wearable-derived features were collected from a student sample (n = 187) and an outpatient sample (n = 95). Depressive symptoms were assessed using the Patient Health Questionnaire-9 (PHQ-9), and participants were categorized as depressed (≥ 10) or non-depressed (< 10). An elastic net regularized logistic regression model was used for classification, with performance evaluated in held-out test data. Sensitivity analyses controlled for age and sleep timing, tested alternative PHQ-9 cutoffs, and comparisons to baseline models with and without wearable features. Across the combined sample (n = 282), the model achieved good discriminative performance (area under the curve = 0.82; accuracy = 79%). Sensitivity analyses revealed that site was a strong predictor, but Garmin-derived features still added incremental value. Minimum awake heart rate, variability in sleep duration, and maximum step count emerged as the strongest predictors. Wearable-derived features show promise for detecting depressive symptoms across clinical and non-clinical populations. Site-specific factors should be considered in future applications and replications.

Keywords:

Depression detection

Passive sensing

Machine learning

Digital Phenotyping

Multi-site data

Introduction

Depression is one of the most pressing health-related issues, significantly impacting quality of life. It can cause severe and enduring functional impairments, reduce workplace productivity, increase the risk of suicidality, and is often considered a chronic disease^1–3. Despite advances in treatments, their effectiveness has remained limited, with many individuals still experiencing depressive symptoms post-treatment^{4, 5}. As interventions in routine care are often initiated when symptoms are already well established, early and accurate detection may represent an important opportunity to prevent chronic courses and potentially improve treatment outcomes. Despite its importance, the assessment of depression largely relies on traditional methods such as clinical observations, patient history, and self-report questionnaire data. In line with this, the current study uses the term depression to refer to symptom severity as measured by self-report screening tools, rather than a clinical diagnosis. These traditional assessment approaches, while valuable, present several challenges: they are time-consuming, often depend on potentially biased retrospective self-report⁶, and are further complicated by a growing shortage of mental health professionals⁷.

In recent years, wearable technologies such as smartwatches and fitness trackers have attracted attention as potential tools for unobtrusively detecting depression. They monitor physiological indicators such as heart rate, physical activity, and sleep patterns, which are closely related to core depressive symptoms including sleep disturbances and reduced energy or motivation⁸. A recent meta-analysis by Abd-Alrazaq et al.⁹ and a systematic review by De Angel et al.¹⁰ support the potential of wearable data to detect depressive symptomatology. However, both reviews emphasize that the field is still at an early stage, with several challenges. First, half of the included studies relied on small samples (≤ 55 participants), calling for work in larger datasets⁹. Second, current findings are limited in generalizability¹⁰, calling for studies in more diverse and representative populations. Third, while several recent large-scale studies^{11, 12} include thousands of participants and advance the field methodologically, these samples are typically drawn from general populations without confirmed clinical diagnoses. It thus remains unclear whether wearable-based depression detection performs consistently across clinically diagnosed outpatients and non-clinical groups such as university students, and to our knowledge, no study has evaluated performance across both populations within the same analytic framework.

We aim to tackle these three challenges in the present work. By using two distinct datasets, one from an outpatient sample and the other from a student sample, our study strengthens this area of research and provides a more comprehensive analysis by including larger, more diverse, and clinically relevant samples. The present work is designed as a classification study, aiming to distinguish between individuals above and below a clinically relevant threshold of depressive symptom severity. The study has two main objectives: (1) to study the utility of wearable data for detecting depression, and (2) to identify the most predictive indicators for this purpose. By combining data from different countries and populations, and by using affordable, consumer-grade wearable technology, our research aims to advance understanding in this promising area and contribute to the development of scalable tools for depression detection.

Results

After excluding participants with missing data, the final dataset comprised 282 participants. Among them, 129 were classified as depressed (47 [masked] A, 82 [masked] B) based on the predefined PHQ-9 cutoff (≥ 10), including 47 from the [masked] A sample and 82 from the [masked] B sample. On average, the [masked] B sample showed considerably higher depression scores (mean PHQ-9 = 15, SD = 4.67), than the [masked] A sample (M = 7.15, SD = 4.46). Table 2 presents the descriptive statistics (M, SD) of the predictor variables for depressed and non-depressed groups. Site-specific descriptive statistics for each variable are provided in the supplemental material (Table 2).

Table 2

Descriptive Statistics of Predictor Variables Across Depressed and Non-Depressed Groups
	Non-depressed (n = 153)	Depressed (n = 129)
	M (SD)	M (SD)	p
Variable
Steps
Mean	6314 (2339)	6359 (2905)	0.89
Minimum	1333 (1494)	1536 (1569)	0.27
Maximum	12968 (4650)	12422 (5025)	0.35
SD	3355 (1305)	3182 (1341)	0.27
Sleep (in minutes)
Mean	510 (45.4)	487 (61.3)	< 0.001
Minimum	373 (69.8)	339 (97.2)	< 0.001
Maximum	636 (73.7)	621 (92.4)	0.131
SD	74.8 (28.1)	84.6 (37.9)	0.016
Start time (in minutes from midnight)
Mean	750 (363)	926 (372)	< 0.001
SD	503 (179)	356 (264)	< 0.001
Stop Time (minutes from midnight)
Mean	511 (83.8)	435 (121)	< 0.001
SD	76.8 (26.0)	87.7 (64)	0.073
Awake HR
Mean	80.6 (7.75)	81.1 (8.10)	0.61
Minimum	61.3 (7.59)	60.4 (7.66)	0.3
Maximum	113 (12.3)	126 (20.9)	< 0.001
SD	12.1 (2.02)	13.4 (2.78)	< 0.001
Resting HR
Mean	64.2 (8.04)	69.2 (9.32)	< 0.001
Minimum	55.7 (7.40)	58.7 (8.32)	0.002
Maximum	84.0 (9.61)	93.6 (15.8)	< 0.001
SD	5.72 (1.49)	7.21 (2.31)	< 0.001
Kurtosis	1.28 (1.90)	0.827 (1.77)	0.038
Skewness	1.06 (0.528)	0.929 (0.523)	0.043
Note. M = Mean; SD = Standard deviation; HR = Heart rate.

The final elastic net model, selected via 10-fold cross-validation, used Ridge regularization (alpha = 0) with an optimal penalty parameter of lambda = 0.02 On the test set, the model achieved a strong AUC of 0.82 in both cross-validation and independent test set (95% CI [0.75, 0.96])¹³, an accuracy of 0.79, and an F1-score of 0.80. Sensitivity and specificity were 0.79, indicating balanced performance in identifying depressed and non-depressed participants (see Table 3 for key metrics). The most influential predictors were the SD of sleep duration, minimum awake heart rate, and maximum step count; greater sleep variability, lower minimum awake heart rate, and fewer maximum daily steps were associated with depression. A full ranking of predictor importance, including variable importance, standardized coefficients, and odds ratios, is provided in Table 3 of the supplementary material.

To assess whether Garmin-derived features added predictive value beyond site, we compared the elastic net model using Garmin features only to two subsequent models: a baseline logistic regression using site only, and an elastic net model including site plus Garmin features. As shown in Table 3, the site-only model achieved high sensitivity (0.92) but moderate specificity (0.65) and a lower AUC (0.78) and F1 score (0.74). The elastic net model including both site and Garmin features improved overall performance (Accuracy = 0.81, AUC = 0.85, F1 score = 0.74), while the elastic net model based on Garmin features alone achieved slightly lower AUC (0.82) and accuracy (0.79) but the highest F1 score (0.82) and balanced sensitivity (0.83) and specificity (0.73). Full confusion matrices are provided in the supplementary material (Table 4). These findings indicate that while site captures substantial predictive signal, Garmin-derived features also provide incremental discriminative value across models.

Table 3

Classification Performance across models: Site-only, Site + Garmin, and Elastic Net (Garmin only)
Model	Accuracy	Sensitivity	Specificity	AUC	F1 Score
Elastic Net Garmin only	0.79	0.83	0.73	0.82	0.82
Site only	0.79	0.92	0.65	0.78	0.74
Elastic Net Site + Garmin	0.81	0.89	0.70	0.85	0.77
Note. Values show key performance metrics for each model. Sensitivity = proportion of correctly identified depressed cases; Specificity = proportion of correctly identified non-depressed cases; AUC = area under curve.

In our second sensitivity analysis, estimating models controlling for age and start sleep time in the combined dataset, performance decreased slightly in both cases, but the overall classification pattern remained stable. Notably, the same three top predictors emerged when controlling for age. When controlling for bedtime, SD of sleep duration remained among the top predictors, while average heart rate during waking hours and the typical time participants woke up replaced the others. Lastly, varying the PHQ-9 cutoff for depression classification (using thresholds of 9 and 11) did not change the results. Full results of these analyses are presented in supplementary material Tables 5 and 6.

Discussion

This study examined the potential of consumer-grade wearable data to classify depression across two distinct datasets, one from an outpatient sample and another from a student sample. While models relied heavily on site-related differences in the data for classification, the final elastic net model including Garmin data achieved the strongest classification performance. The most influential predictors for depression were greater variability (SD) in sleep duration, lower minimum awake heart rate, and fewer maximum daily steps. These findings contribute to the growing body of research suggesting that physiological and behavioral data from wearable devices may serve as valuable indicators for detecting depression.

The most influential predictors in our model spanned different domains rather than being confined to a single measure type, highlighting the potential advantage of integrating multiple data streams for depression classification. Greater variability in sleep duration was associated with depression, potentially reflecting irregular sleep patterns common in depression, such as difficulties falling asleep or disrupted sleep¹⁴. Fewer maximum steps were linked to depression, consistent with findings that depressed individuals tend to engage in lower levels of physical activity in daily life, such as reduced walking, which is associated with higher symptom severity¹⁵. A lower minimum awake HR was observed in the depressed group, which may reflect lower activity levels but could also be influenced by factors such as metabolic differences or blunted physiological reactivity, warranting further investigation. However, given the interdependencies among these features, future research should explore whether integrating diverse signals can meaningfully improve predictive performance beyond individual domains. Additionally, studies should explore the role of key predictors not only in detecting current depression but also in predicting future symptom trajectories.

Our findings align with a recent meta-analysis⁹ reporting high classification performance (accuracy = 0.89, sensitivity = 0.87, specificity = 0.93) for AI-based depression classification using Fitbit and Actiwatch data, and extend this literature to a new device (Garmin) and more heterogeneous populations, including students and outpatients. Our model showed slightly lower accuracy (79%) but comparable sensitivity (0.83). Unlike the mentioned AI models that favor detecting non-depressed individuals, it performed better at identifying depression. This could be due to our enriched sample, allowing for a broader range of depressive symptom severity and contributing to the model's higher sensitivity. While our model showed strong performance, as in similar studies, caution is warranted due to potential overfitting.

Consistent with previous research, our study shows that passive sensing demonstrates some ability to detect depression, providing promising but not yet fully actionable results. Moving the field forward to fully leverage the potential of passive sensing will require several complementary strategies. First, our study represents an initial step by examining a diverse sample, made possible through collaboration across sites using the same device and similar assessment periods, as implemented in infrastructures such as Remote Assessment of Disease and Relapse – Central Nervous System (RADAR-CNS^12,15). Future progress will depend on fostering larger multi-site collaborations that employ harmonized assessment strategies, enabling not only larger datasets but also more diverse and representative data. Second, future research should adopt a multi-modal approach, combining wearable, smartphone sensors and Ecological Momentary Assessment self-reports. Integrating physiological signals and self-reports with contextual information such as location or screen time could provide richer, more informative predictors. Finally, future studies should explore integrating passive sensing with EMA or clinical monitoring in structured workflows where sensor data supports but does not replace clinical judgment. Such approaches can help evaluate the real-world utility of predictive models and inform how they might be applied safely and effectively.

Several limitations should be considered when interpreting the findings. The performance of wearable devices may be limited by sensor accuracy and missing data. Many participants, particularly in the clinical population, were excluded because we required at least 50% coverage per feature. The main limitation of this study is that combining the student and outpatient samples introduces the risk that the model may partly distinguish between these groups, which differ in setting and depression prevalence, rather than accurately identifying individuals with depression. In addition, the unbalanced distribution of depression status across sites limited our ability to train the model on one dataset and test it on the other. Future research should prioritize larger, more balanced multi-site datasets to reliably evaluate the contribution of passive sensing features within individual populations.

Measurement-related issues should also be noted. We used a binary PHQ-9 cutoff for classification, which captures clinically relevant thresholds but not the full spectrum of depressive severity. For the [masked] A dataset, but not the [masked] B data, a slightly adapted version of the PHQ was used from which the PHQ-9 was reconstructed, potentially introducing measurement non-equivalence and affecting comparability across sites. Furthermore, student and outpatient data were collected during different periods, and seasonal effects (e.g., winter vs. summer) were not explicitly controlled, which may have influenced the physiological and behavioral patterns captured by the wearable devices. Finally, while our study included participants from two sites representing both student and outpatient populations, generalizability remains limited. Several sociodemographic variables (e.g., gender identity, immigration history) were not collected, and the sample was relatively homogeneous (primarily German and Dutch participants). Future studies should include broader and more diverse demographic information to enhance external validity.

This study advances research on wearable-based depression detection by leveraging consumer-grade smartwatches within a larger, more diverse sample. Our findings show both the promise and the complexity of using wearable data in detecting depression. While model performance may still be influenced by site-related factors like demographic differences, our findings highlight the potential of wearable technology for mental health monitoring. Future cross-site prediction efforts will be crucial to supporting its clinical translation.

Methods

Participants and Procedure

[masked] sample A. The WARN-D study is a multi-cohort student study in the Netherlands, funded by the European Research Council (Horizon 2020, No. 949059), and with ethical approval by Leiden University (2021-09-06-E.I.FriedV2-3406). All procedures were performed in accordance with the relevant guidelines and regulations. Data for the [masked] sample A were collected between April 2021 and October 2024. Here, the first two stages of [masked] are relevant: After screening for inclusion criteria via an online screener, Stage 1 consisted of an online survey including depression measures that was filled out by all participants. In Stage 2, participants provided Garmin smartwatch data for three months. In the current study, we used only the first two weeks of wearable data. Exclusion criteria for [masked] included moderate to severe depression (Patient Health Questionnaire-9 (PHQ-9)¹⁶ ≥14), current mania, thought disorders, primary substance use disorder, current or pending mental health treatment for any of the aforementioned, moderate to severe suicidal ideation, or distress from seeing daily calorie estimates. Although the [masked] study includes a large student sample, we selected a subset to match the [masked] sample B on age and reduce class imbalance between depressed and non-depressed participants (see Statistical Analyses for details). Further details on the [masked] are available [masked]¹⁷.

[masked] sample B. This study is embedded in the outpatient clinic of Trier University and funded by the German Research Foundation (LU 600/19 − 1), with ethical approval from Trier University (03.03.2022). All procedures were performed in accordance with the relevant guidelines and regulations. Data for the [masked] sample B were collected between October 2019 and August 2024. As part of standard care, all patients completed pre-treatment assessments, including the PHQ-9, which were submitted at the initial interview. Study invitations were sent alongside initial interview invitations, during which inclusion and exclusion criteria were assessed. Exclusion criteria included current suicidality, psychosis, or mania. Eligible participants attended an introductory session, provided written informed consent, and began wearing a Garmin fitness tracker for two weeks of treatment, starting immediately after the diagnostic session. Further details on study design and procedure are available in [masked]¹⁸.

The final sample consisted of 282 participants (mean age = 26.95, SD = 9.83; 81% female). Table 1 provides an overview of the sociodemographic characteristics for [masked] A and [masked] B dataset. Please note that no data were collected on gender identity, and information on other diversity indicators such as socioeconomic background or immigration history is also limited.

Table 1

Sociodemographic Characteristics of Participant and Baseline Depression
	[masked] A (n = 187)	[masked] B (n = 95)
PHQ-9, Mean (SD)	7.15 (4.46)	15 (4.67)
Age, Mean (SD)	24.4 (7.77)	30 (11.1)
Sex, No. (%)
Male	29 (16)	23 (24)
Female	158 (84)	70 (74)
Other	-	2 (2)
Nationality, No. (%)
German	13 (7)	86 (91)
Dutch	111 (59)	-
Other	63 (34)	9 (9)
Family status, No. (%)
Single	82 (44)	63 (66)
Married	5 (3)	15 (16)
Other	100 (53)	17 (18)
Educational Level, No. (%)
Student	187 (100)	15 (16)
University degree	106 (57)	49 (51)
Other	-	31 (33)
Note. N = 282. “Other” includes all relationship statuses beyond “single” and “married” (e.g., separated, divorced, widowed, living with a partner, or dating), which were harmonized across sites to ensure comparability.

Measures

Depressive symptom severity. The 9-item PHQ-9 is a self-report measure, assessing depressive symptoms over the past two weeks on a 4-point scale (0 = not at all to 3 = nearly every day). Total scores range from 0 to 27, with higher scores indicating greater symptom severity. Here, baseline PHQ-9 scores were used for binary classification, with a cutoff of ≥ 10 to identify moderate depression¹⁹. In [masked]A, an adapted 15-item version of the PHQ-9 was used, from which PHQ-9 scores were reconstructed (see Table 1 in the supplemental material for more details).

Wearable data. Participants in both studies were instructed to wear a Garmin Vivosmart 4 smartwatch continuously, except for recharging. The device recorded heart rate (HR) every 15 seconds and provided daily measures such as step count, sleep duration, sleep start and end time (calculated as minutes from midnight). HR was assessed using photoplethysmography, which estimates pulse by detecting changes in light absorption due to blood flow. While Garmin devices provide these daily aggregated activity and sleep metrics, they do not offer continuous wear-time indicators or validated non-wear flags, which limits the ability to determine exact periods of non-wear.

Statistical Analysis

Data processing and analysis were conducted in R version 4.4.0²⁰, all scripts and supplementals are accessible online (https://osf.io/fjzb3/?view_only=8b2f882e6d0f458da103ebb3051430db). To reduce class imbalance, we matched a subsample of the larger [masked] A dataset to the [masked] B sample by age using nearest-neighbor matching with the Mahalanobis distance in MatchIt (v4.7.0)²¹. This reduced but did not fully eliminate age differences.

Missing data in this study was handled by excluding participants 1) with missing PHQ-9 scores or 2) if wearable data coverage was below 50% for any given feature (see flowchart in the supplemental material, Fig. 1). With a 14-day monitoring period, we required ≥ 7 valid days to ensure stable person-level feature estimates. No imputation was performed, as participants with only 50% coverage would yield unreliable estimates; complete-case aggregation was used instead.

Several features were extracted: for step count and sleep duration, we calculated the mean (M), standard deviation (SD), minimum, and maximum values. To minimize the influence of date-related information that might bias the models, sleep start and end times were converted to times relative to midnight. HR was split into resting, measured during Garmin-recorded sleep periods, and awake. To control for outliers, the lowest and highest 5% of HR values were removed within person. For both resting and awake HR, we extracted the M, SD, minimum, and maximum values. Additionally, we calculated kurtosis and skewness for resting HR only. In total, 22 predictors were used for depression detection; all were z-transformed.

After merging the two datasets ([masked] A and [masked] B), the data was split into a training (70%) and test (30%) set. Elastic net binary classification models were then fit using 10-fold cross validation to predict the presence (1, PHQ ≥ 10), or absence (0) of moderate depression using standardized wearable data as predictors. Elastic net models were chosen as they allow regularization off potentially highly-correlated wearable data²². The elastic net penalty, which combines L1 and L2 regularization, was tuned by optimizing the alpha and lambda hyperparameters to select the best-performing model in the training set. The final model was then evaluated on the test set, with performance assessed using accuracy, area under the curve (AUC), and F1 score. Variable importance was derived from the absolute magnitude of standardized model coefficients, reflecting each predictor's contribution to the classification. Standardized coefficients and odds ratios are also reported to show the direction and magnitude of each predictor’s effect, noting that predictors with low importance can still have meaningful effects due to Elastic Net shrinkage and correlations among features. This study was not preregistered.

Sensitivity Analyses

We conducted two sensitivity analyses to determine whether sample characteristics, such as study site or age, would be offer more accuracy than just the Garmin data. To account for potential site-related influences, we fit a baseline logistic regression using site as the sole predictor and a second logistic regression including both site and Garmin-derived features. The performance of these models was compared to the previously described Elastic Net model. Second, we used the combined dataset to address potential confounding by age and sleep timing. Specifically, we residualized age and time of going to bed from the predictor variables and re-applied the elastic net model using the same procedure described above. Finally, we tested the robustness of our classification results by using alternative PHQ-9 cutoffs (≥ 9 and ≥ 11) to define depressed versus non-depressed individuals.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

Supplementary Material 2

References

Cuijpers, P., Beekman, A. T. F. & Reynolds, C. F. Preventing depression: a global priority. JAMA 307, 1033 (2012).

Cuijpers, P., Karyotaki, E., Reijnders, M. & Huibers, M. J. H. Who benefits from psychotherapies for adult depression? A meta-analytic update of the evidence. Cogn. Behav. Ther. 47, 91–106 (2018).

Berman, A. L. Depression and suicide. In Handbook of Depression (eds (eds Gotlib, I. H. & Hammen, C. L.) 510–530 (Guilford Press, (2009).

Delgadillo, J. & Lutz, W. A development pathway towards precision mental health care. JAMA Psychiatry. 77, 889 (2020).

Lutz, W., Vehlen, A. & Schwartz, B. Data-informed psychological therapy, measurement-based care, and precision mental health. J. Consult Clin. Psychol. 92, 671–673 (2024).

Jacobson, N. C., Weingarden, H. & Wilhelm, S. Digital biomarkers of mood disorders and symptom change. npj Digit. Med. 2, 3 (2019).

Oladeji, B. D. & Gureje, O. Brain drain: a challenge to global mental health. BJPsych Int. 13, 61–63 (2016).

American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders (DSM-5-TR) (American Psychiatric Association Publishing, 2022).

Abd-Alrazaq, A. et al. Systematic review and meta-analysis of performance of wearable artificial intelligence in detecting and predicting depression. npj Digit. Med. 6, 84 (2023).

10.

De Angel, V. et al. Digital health tools for the passive monitoring of depression: a systematic review of methods. npj Digit. Med. 5, 3 (2022).

11.

Price, G. D., Heinz, M. V., Collins, A. C. & Jacobson, N. C. Detecting major depressive disorder presence using passively-collected wearable movement data in a nationally-representative sample. Psychiatry Res. 332, 115693 (2024).

12.

Zhang, Y. et al. Large-scale digital phenotyping: identifying depression and anxiety indicators in a general UK population with over 10,000 participants. J. Affect. Disord. 375, 412–422 (2025).

13.

White, N., Parsons, R., Collins, G. & Barnett, A. Evidence of questionable research practices in clinical prediction models. BMC Med. 21, 339 (2023).

14.

Zainal, N. H. & Hitchcock, P. F. Sleep disturbance recorded via wearable sensors predicts depression severity 9 years later. J Affect. Disord 120426 (2025).

15.

Zhang, Y. et al. Associations between depression symptom severity and daily-life gait characteristics derived from long-term acceleration signals in real-world settings: retrospective analysis. JMIR Mhealth Uhealth. 10, e40667 (2022).

16.

Kroenke, K., Spitzer, R. L. & Williams, J. B. W. The PHQ-9: validity of a brief depression severity measure. J. Gen. Intern. Med. 16, 606–613 (2001).

17.

Fried, E. I., Proppert, R. K. K. & Rieble, C. L. Building an early warning system for depression: rationale, objectives, and methods of the WARN-D study. Clin. Psychol. Eur. 5, e10075 (2023).

18.

Hehlmann, M. I. et al. The use of digitally assessed stress levels to model change processes in CBT: a feasibility study on seven case examples. Front. Psychiatry. 12, 613085 (2021).

19.

Levis, B., Benedetti, A. & Thombs, B. D. Accuracy of Patient Health Questionnaire-9 (PHQ-9) for screening to detect major depression: individual participant data meta-analysis. BMJ 364, l1476 (2019).

20.

R Core Team. R: A language and environment for statistical computing (R Foundation for Statistical Computing, 2024).

21.

Ho, D., Imai, K., King, G. & Stuart, E. A. MatchIt: nonparametric preprocessing for parametric causal inference (R package version 4.7.0) (2023).

22.

Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R Stat. Soc. B. 67, 301–320 (2005).

Author Contribution

M. I. H.: Conceptualization; Data curation; Formal analysis; Methodology; Writing – original draft; Writing – review & editing. R. T.: Conceptualization; Data curation; Investigation; Project administration; Validation; Formal analysis; Writing – review & editing. W. L.: Conceptualization; Funding acquisition; Methodology; Project administration; Supervision; Writing – review & editing. C. L. R.: Data curation, Investigation; Project administration; Validation; Writing – review & editing. R. K. K. P.: Data curation; Investigation; Project administration; Validation; Writing – review & editing. F. M.: Data curation; Investigation; Project administration; Validation; Writing – review & editing. J. A. R.: Methodology; Validation; Writing – review & editing. M. Sch.: Methodology; Validation; Writing – review & editing. E. I. F.: Conceptualization; Data curation; Funding acquisition; Methodology; Project administration; Supervision; Writing – review & editing.

Author Contributions:

M. I. H.: Conceptualization; Data curation; Formal analysis; Methodology; Writing – original draft; Writing – review & editing.

R. T.: Conceptualization; Data curation; Investigation; Project administration; Validation; Formal analysis; Writing – review & editing.

W. L.: Conceptualization; Funding acquisition; Methodology; Project administration; Supervision; Writing – review & editing.

C. L. R.: Data curation, Investigation; Project administration; Validation; Writing – review & editing.

R. K. K. P.: Data curation; Investigation; Project administration; Validation; Writing – review & editing.

F. M.: Data curation; Investigation; Project administration; Validation; Writing – review & editing.

J. A. R.: Methodology; Validation; Writing – review & editing.

M. Sch.: Methodology; Validation; Writing – review & editing.

E. I. F.: Conceptualization; Data curation; Funding acquisition; Methodology; Project administration; Supervision; Writing – review & editing.

Funding:

Eiko I. Fried, Rayyan Tutunji, Ricarda K. K. Proppert, and Carlotta L. Rieble are supported by funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program, Grant No. 949059.

Wolfgang Lutz and Fabienne Mink are supported by funding from the German Research Foundation (DFG), Grant No. LU 600/19 − 1 and Grant No. LU 660/20 − 1.

Acknowledgement

We wholeheartedly thank all participants of WARN-D; all prior and current WARN-D team members; all members of the scientific advisory board of WARN-D; and all experts who participated in the Delphi study to inform our baseline measurement battery. We sincerely thank all participants of the Trier EMA study as well as all past and present team members for their valuable contributions to the project.

Competing Interests:

The authors have declared that no competing interests exists.

Data Availability

We will make [masked] A data available on the [masked] A project hub ([masked]) when all data are collected, cleaned, and deidentified. To make this article reproducible in the future, we share the exact participant IDs we used for this article in the supplementary materials. The participants of the [masked] B study did not provide written consent for their data to be shared publicly. Due to the sensitive nature of the research, the supporting data from the [masked] B study is not available.

Yes