Language-Based Detection of Depression with Machine Learning: Systematic Review and Meta-Analysis

HadarFisher

Ph.D.

1,2,4✉Phone617-564-6140Emailhbfisher@mclean.harvard.edu

NigelM.Jaffe

KristinaPidvirny

AnnaO.Tierney

MiaS.Vaidean1

PoorveshDongre

PhD

ChristianA.Webb

PhD

1,2

1McLean HospitalBelmontMAUSA

2Department of PsychiatryHarvard Medical SchoolBostonMAUSA

3Department of Computer ScienceVirginia TechBlacksburgVAUSA

Harvard Medical School & McLean Hospital115 Mill Street02478BelmontMA

Hadar Fisher, PhD^1,2

Nigel M. Jaffe, BA¹

Kristina Pidvirny, BA¹

Anna O. Tierney, BA¹

Mia S. Vaidean¹

Poorvesh Dongre, PhD³

Christian A. Webb, PhD^1,2

¹ McLean Hospital, Belmont, MA, USA

² Department of Psychiatry, Harvard Medical School, Boston, MA, USA

³ Department of Computer Science, Virginia Tech, Blacksburg, VA, USA

Corresponding author:

Hadar Fisher, Ph.D.

Harvard Medical School & McLean Hospital

115 Mill Street, Belmont, MA 02478.

Email: hbfisher@mclean.harvard.edu

Phone: 617-564-6140

Fax: 617-855-4231

Abstract

Early detection of depression is critical for timely intervention. Natural language processing (NLP) and machine learning (ML) approaches have increasingly been used to automatically detect depression from text data, yet comprehensive evidence regarding their diagnostic performance remains limited.

We systematically reviewed and meta-analyzed studies applying NLP and ML to identify depression from spoken or written language. Six electronic databases and additional sources were searched, yielding 892 full-text articles, of which 123 met inclusion criteria. One representative result per dataset was selected for quantitative synthesis, resulting in 50 independent studies. Pooled accuracy across studies (k = 43; n = 40,983) was 0.80 (95% CI, 0.76–0.83). Precision (k = 28) was 0.78 (95% CI, 0.72–0.83), recall (k = 33) 0.76 (95% CI, 0.68–0.83), AUC (k = 14) 0.79 (95% CI, 0.70–0.85), and balanced accuracy (k = 16) 0.71 (95% CI, 0.63–0.78). Subgroup analyses showed significant differences by language, text source, feature type, and classifier (all p < .001). Accuracy was highest in studies using structured clinical interviews, non-English languages, and linguistic or embedding-based features. However, in one-at-a-time meta-regressions, only text source remained a significant predictor (QM(3) = 8.78, p = .032), explaining 13.6% of the between-study variance. Publication bias was minimal. Automated depression detection from text shows promising performance with substantial heterogeneity. Performance varies by language, data source, feature extraction, and model type. Findings highlight both current limitations and potential of text-based depression detection and underscore the need for methodological standardization and validation before clinical use.

Key words:

Depression

Artificial intelligence

Natural Language Processing

Machine learning

Introduction

Depression is among the most prevalent psychiatric disorders worldwide and a leading cause of disability.¹ Its global prevalence has risen steadily over the past three decades, with the United States alone seeing depressive symptoms increase approximately threefold from 2017 to 2020.² Despite this high burden, depression remains substantially underdiagnosed and undertreated, with many individuals never seeking professional help or doing so only after symptoms have become severe and impairing.³ Untreated depression is associated with worse prognosis, highlighting the importance of early identification for timely intervention and improved outcomes.^4,5 Emerging digital health tools offer opportunities to expand access and may enable earlier depression detection.⁶

One promising avenue for automated depression detection lies in the analysis of language.⁷ Advances in natural language processing (NLP) and machine learning (ML) now enable scalable, automated analysis of linguistic data to detect depression.⁸ The use of NLP and ML to detect depression has grown exponentially in recent years. However, despite this progress, it remains uncertain how accurate these approaches are overall and whether their performance generalizes reliably across different contexts. Importantly, much of this work has been published in computer science journals, making it less accessible to psychiatric researchers and clinicians who might benefit from these insights. Clinical researchers must be engaged and informed to critically evaluate, guide, and translate these tools into meaningful clinical use.

Several prior reviews have mapped the landscape of NLP and ML approaches for detecting depression, summarizing common text sources, feature extraction strategies, and model architectures. ^8–15 However, despite the rapid expansion of this research area, none of these reviews quantitatively evaluated the accuracy of these models across studies. In this systematic review and meta-analysis, we evaluated the performance of NLP- and ML-based models in detecting depression using datasets where depression status was determined independently (e.g., through validated questionnaires or clinical diagnoses). We further examined study-level moderators, including language, text source, feature type, and model class.

Methods

This systematic review and meta-analysis evaluated the performance of NLP and ML models for detecting depression from text data. The review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) reporting guidelines (Table S1).¹⁶ The study protocol was preregistered with the International Prospective Register of Systematic Reviews (PROSPERO; ID CRD42024513390).

Screening and review management were conducted using Covidence software (Veritas Health Innovation, Melbourne, Australia). In the first and second phases (title and abstract review, full text review), two independent raters screened each record. Disagreements were resolved by consensus between reviewers. To ensure consistency and reliability, all reviewers participated in six structured training sessions prior to data collection: two sessions each for title screening, abstract screening, and data extraction. Meetings continued periodically throughout the screening and extraction period.

Data Sources and Search Strategy

We systematically searched both computer science and psychiatric databases to capture the multidisciplinary literature on this topic. Searches were conducted during March 2024 in ScienceDirect, IEEE Xplore, ACM Digital Library, Scopus, MEDLINE, PubMed, and PsycINFO. The search was supplemented in September 2025 by a hand search of reference lists and relevant reviews. The search string combined disorder-specific and computational terms: “depress*” AND (“language” OR “text”). Searches were restricted to peer-reviewed articles published in English.

Study Eligibility Criteria

Inclusion criteria required that studies: (1) used adult participants’ (≥ 18 years) own spoken or written text data to detect (concurrent) depression; (2) employed formal measures of depression (e.g., structured clinical interview, self-report questionnaire, clinical diagnoses); (3) applied machine learning (ML) models to predict the person’s depression status (i.e., binary classification) or depression severity; (4) relied solely on textual input for model development and prediction (i.e., studies incorporating additional modalities such as audio, video, or demographic variables were excluded if they lacked a text-specific model); (5) were peer-reviewed and fully available in scientific databases; and (6) provided sufficient information to extract classification performance (e.g., accuracy, precision, recall, F1, or equivalent).

For the purpose of this review, we included only studies that evaluated model performance on an external sample (e.g., through cross-validation or independent test data), ensuring that reported estimates reflected true predictive capacity rather than overfitting to training data. Large Language Models (LLMs) that were used directly for text classification were not required to include additional validation, as these models are pretrained on large, diverse corpora and apply learned representations without task-specific parameter fitting, reducing the risk of overfitting. However, when LLMs were used solely for feature extraction (e.g., deriving sentiment or emotion scores) and the extracted features were subsequently entered into a separate classification or prediction model to detect depression, such studies were required to include a validation process to ensure the robustness and generalizability of the results.

Studies relying exclusively on social media text were excluded. We made the decision to exclude social media text in order to ensure the inclusion of high-quality studies with clear clinical relevance. Social media data pose several limitations¹⁵: first, depressed individuals may change their online activity, leading to underrepresentation and biased samples. Second, in many social media datasets, depression status is inferred directly from the text, for example, by external annotators rating whether the text contains depression-related content, so the same material is used both to define and to predict depression. This circular annotation strategy undermines label validity and limits generalizability to clinically assessed depression.

Data Extraction and Risk of Bias Assessment

HF and two reviewers per study independently extracted data from eligible studies using a standardized template and assessed risk of bias using the ML-specific quality assessment tool developed by Ciharova et al.,¹⁷ adapted from established instruments (PROBAST,¹⁸ Cochrane RoB 2.0,¹⁹ ROBINS-I²⁰). When studies reported results from multiple ML models (e.g., based on different feature sets or classifiers), we extracted the performance metrics of the model that achieved the highest performance. Any discrepancies were resolved in group discussion until consensus was reached. In addition to reading the full text of each article, extracted responses were validated using NotebookLM, which supported but never replaced reviewer judgment. This process was implemented to maximize accuracy and agreement while ensuring that decisions were grounded in direct review of the original studies.

Extracted variables included: study sample size; text source and language; depression measure; feature extraction method (e.g., statistical, linguistic, embeddings, transformer, hybrid); classifier type (traditional machine learning, neural, transformer); type of validation (e.g., k-fold cross validation, train-test split); and reported outcomes (e.g., accuracy, precision, recall, and other model performance metrics).

Given the absence of a standardized framework for evaluating the quality of machine learning studies, we adopted the quality assessment approach used in prior research on ML-based prediction models.¹⁷ In line with that framework, study quality was assessed across four domains: (1) adequate sample size (≥ 100 participants), (2) balanced class distribution in cases where participants are divided into multiple groups, such as depressed vs. non-depressed (no group more than ten times smaller than others), (3) appropriate model validation (model parameters tuned on a training set and evaluated on an independent test set), (4) use of a validated outcome measure assessing depression presence or severity (a.k.a., ground truth). These criteria were designed to minimize risk of overfitting and enhance generalizability to unseen data. Since using validation was an inclusion criterion, item 3 was not relevant and is excluded from the quality assessment table. See Table S2 for the risk of bias table.

Outcomes

The primary outcome was pooled classification accuracy, which quantifies the overall proportion of classifications made by the model that were correct, or in this case, the overall rate at which the model correctly identifies both depressed and non-depressed individuals. Secondary outcomes included precision (positive predictive value), sensitivity (recall), and the area under the receiver operating characteristic curve (AUC). Precision is the proportion of correctly predicted positive cases among all predicted positive cases, whereas sensitivity is the proportion of actual positive cases correctly identified. High precision reflects fewer false alarms, while high sensitivity reflects fewer missed cases. AUC considers both sensitivity and specificity in its calculation, showing the balance between predicting a positive outcome when the outcome is indeed positive and predicting a negative outcome when the outcome is indeed negative. Mathematically, it represents the area under the curve plotting sensitivity (true positive rate) against 1 – specificity (false positive rate). When possible, we also calculated balanced accuracy, defined as the average of sensitivity and specificity. By giving equal weight to correctly identifying positive and negative cases regardless of their prevalence, it provides a less sample-biased estimate of model performance, particularly in datasets where one class (e.g., depressed or non-depressed) is more frequent than the other.

The F1-score was used for interpretation in the systematic review but was not included as an outcome in the meta-analysis. This is because it is a harmonic mean of two ratios that already contain partial information about sample size, making its sampling variance difficult to estimate for meta-analytic weighting.

Statistical Analysis

We extracted both qualitative and quantitative data from each selected study. Meta-analysis was conducted for studies where data availability permitted summary estimation with 95% confidence intervals (CI). Accuracy and related performance metrics were first logit-transformed to stabilize variances and normalize distributions before pooling. Primary pooled estimates were obtained using random-effects models fit with maximum likelihood estimation. To assess heterogeneity, we used

$\:{\tau\:}^{2}$

, which quantifies the absolute magnitude of the true between-study variance, and

$\:{I}^{2}$

, which indicates the percentage of the total observed variance that is attributable to real differences in effect sizes rather than to sampling error. Publication bias was evaluated through visual inspection of funnel plots and the Galbraith (radial) plot, Egger’s regression tests, and the trim-and-fill method. As a sensitivity analysis, leave-one-out analyses assessed the influence of individual studies. We repeated the same procedure with precision, sensitivity and AUC.

To enable pooled estimation across studies that reported performance metrics but not underlying classification counts (i.e., confusion-matrix data), we reconstructed approximate event counts for AUC, precision, and recall following the same logic applied to accuracy. For each study, the number of “events” was computed as the reported metric multiplied by the total sample size, rounded to the nearest integer, with the remainder treated as “failures.” This reconstruction allowed standard error estimation based on the binomial distribution and ensured consistent weighting across studies with differing sample sizes. While AUC reflects overall discriminative ability and precision and recall represent class-specific performance, all three were analyzed using the same framework to ensure comparability. Using total sample size instead of class- or threshold-specific denominators provided a conservative approximation for variance estimation across studies. Effect sizes were calculated using logit transformation of proportions (PLO). A continuity correction of 0.5 was applied to stabilize estimates when proportions were 0 or 1. All pooled estimates and CIs were back-transformed to the probability scale for interpretability.

Subgroup analyses compared pooled accuracies across language (English, Chinese, Other), source of text (clinical interview, open-ended questions, communication, interaction with therapist), feature extraction method (simple, linguistic, embeddings, transformer, hybrid), and classifier type (traditional, neural, transformer). Between-group heterogeneity (Q_between) was estimated using random-effects models with restricted maximum likelihood estimation. To examine whether these factors explained variability in accuracy, we then ran one-at-a-time meta-regressions with omnibus tests (QM) and R² based on reductions in τ². All analyses were performed in R, version 4.3.3, using the metafor package (version 4.8.0). Code is available at https://osf.io/x7tm9.

Results

The search yielded 35,000 records. After removing duplicates and screening titles and abstracts, 892 full-text articles were reviewed, and 123 publications contributing 129 effects were included in the qualitative synthesis (Fig. 1). When articles reported results based on multiple datasets, each dataset-specific effect was treated as a separate study.

Over half of the included studies (73/129, 56.6%) used the Distress Analysis Interview Corpus (DAIC) or related datasets derived from it, including three studies that used DAIC alongside another dataset. The DAIC is a well-known benchmark dataset comprising semi-structured clinical interviews. Its Wizard of Oz (DAIC-WOZ) subset,²¹ released as part of the AVEC 2016 challenge, includes 189 interviews conducted by a virtual agent, with audio, video, and transcript recordings available. An extended version (E-DAIC)²² was later released for AVEC 2019, expanding the sample to 275 interviewed participants. In both datasets, depression severity was labeled using the Patient Health Questionnaire-8 (PHQ-8)²³. Table 1 summarizes studies using unique datasets, and Table 2 summarizes those using DAIC data.

Table 1
Description of Studies Included in the Systematic Review and Meta-Analyses
Study	Language	N	Diagnosis	Source of text	Features extraction method	Outcome	Type of classification model	Validation	Accuracy	F1	AUC	Specificity	Precision	Recall/Sensitivity
Abilkaiyrkyzy 2024²⁹	English	20 (+ 275 trained on E-DAIC)	Mildly depressed: 8, Moderately depressed: 6 Not depressed: 4	Open-ended questions	Transformer (BERT tokenizer and LanguageModelFeaturizer)	PHQ-9	Fine-tuned BERT sequence classifier for multi-class depression severity (Softmax output).	Trained on E-DAIC tested on sample of 20 university students	0.65
Aloshban 2021³⁰	Italian	59	Depressed: 29 Not Depressed: 30	Interviewed about everyday life aspects (e.g., activities in the weekend of interaction with family members)	Embedding (Wikipedia2Vec)	Professional psychiatrists’ diagnosis	BiLSTM	5-fold cross validation	0.729	0.619			1	0.448
Antoniou 2022³¹	English	773 (270)	Depression/stress: 356 sessions of 184 patients. Other problems: 417 sessions of 86 patients	Interaction with therapist (text-based counseling)	LIWC	Patients report presenting problem before the first interaction	Quadratic discriminant analyses	5-fold cross validation	0.778	0.71	0.76
Banerjee 2021³²	English	1999	Unclear, 71.4% from the data before cleaning	Open-ended questions	Embedding (doc2vec); Affective features; Word polarity; linguistic tags (e.g., Proper Noun Tag, Singular Noun Tag)	PHQ-9	CNN-Dynamic Attention	60-20-20 random train-validation-test split		0.644
Boian 2025³³	Romanian	3955 (861)	Per-item classification Not at all (NO): 457 Several days (SD): 1,063, More than half the days (HA): 446, Nearly every day (EV): 442 or Irrelevant (IR): 227	Clinical interview (conducted by aiCARE chatbot)	TF-IDF	PHQ-9	Logistic regression	Split to train and test set Train: 1320 Test: 2635	0.840	0.80		0.80	0.80	0.80
Burkhardt 2022³⁴	English	13327 (6551)	Unclear	Interaction with therapist	LIWC; Embedding (BERT-based model for GoEmotions was used to extract emotion features)	PHQ-9	Random forest	Training set (80%): 4,913 patients, 10,006 observations Test set (20%): 1,638 patients, 3,321 observations		0.520	0.67		0.612	0.453
Cao 2025³⁵	Chinese	50 (the full sample included 100 patients the best model included 50)	Very severe 37 Mild 27 Severe 19 Normal 13 Moderate 13	Clinical interview	LLM (Qwen2.5-7B-Instruct)	HAMD-17	LLM: Qwen2.5-7B-Instruct fine-tuned with LoRA	leave-one-out cross-validation		0.61			0.61	0.61
Chen 2024³⁶ (CMDC) (Same as Zou 2023)	Chinese	78	Depressed: 26 Not depressed: 52	Clinical interview	Transformer (the Chinese-BERT); Xmnlp: Word-level features: ratios of adjectives, adverbs, exclamations, verbs, auxiliary words, modal particles, and total word count; Sentence-level features: number of sentences, ratio of positive and negative sentences, and overall sentiment score; Lexical-emotion feature: proportion of modal words	MINI	IIFDD	5-fold cross-validation	0.87*	0.8			0.82	0.79
Chen 2024³⁶ (EATD) (Same as Shen 2022)*	Chinese	162	Depressed: 30 Not depressed: 132	General interview	Same as Chen 2024a (above)	The Self-rating Depression Scale (SDS).	IIFDD	3-fold cross-validation		0.45			0.36	0.70
Cohen 2023³⁷	English	73 (68)	Depressed: 15 Control: 58	Interaction with an online agent, Tina	TDF-IDF	PHQ-9	SVM	leave-one-subject-out			0.54
Cook 2016³⁸	Spanish	1458	Depressed: 662 Not depressed: 796	Free-text responses to the question: "how do you feel today?"	n-gram	GHQ-12	Logistic regression	50% split to train and test	0.53	0.42		0.79	0.64	0.31
deHond 2024³⁹	English	4070	Depressed: 127 Not depressed: 3943	Patient (cancer)-generated emails to their health care teams	Transformer (BERT)	ICD-9 and ICD-10 codes obtained from electronic health record data	LASSO logistic regression	Train (67%): 2713 Test (33%): 1357	0.925	0.091	0.54	0.95	0.925	0.13
Demiroglu 2020⁴⁰	Turkish	77 (70)	Depressed: 50 records. Not depressed: 27 records	Interview, with 3 types of questions: neutral, positive, and negative.	Average length of the utterances, subjects in negative, positive, and neutral answers separately. Three-dimensional feature. Rate of speech for negative, positive, and neutral answers. Sentiments of the question-answer pairs.	BDI	SVM	Leave-on-out	0.65*	0.68			0.76	0.67
Demiroglu 2020⁴⁰	German	100 (84)	Depressed: 44 records. Not depressed: 56 records	Interview, general questions (e.g., “What is your favorite dish?”)	Same as above	BDI	SVM	Leave-one-out	0.85*	0.77			0.89	0.75
Gao 2024a⁴¹	Chinese	156	Depressed: 77 Not depressed:, 79	Responses to four questions about recent events, sleep, mood, and suicidal tendencies	Transformer (BERT and an improved TextCNN)	Medical records	Dual-branch BERT + improved TextCNN model	Train: 94 (60%) Validation: 31 (20%) Test: 31 (20%)	0.942	0.947			0.931	0.964
Guo 2024⁴²	Chinese	524	Depressed: 59 Not depressed: 465	Clinical interview	Transformer (EmoLLM + GraphRAG)	HAMD, HAMA	EmoLLM	N/A	0.84	0.49			0.38	0.68
Hayati 2022⁴³	Dialectal Malay	53	Depressed: 11 Not depressed: 42	Clinical interview	Transformer (GPT-3)	BDI	GPT3 (They compared its performance using 2–10 examples)	N/A	0.71	0.67
He 2022 (same data as Yuan 2021)⁴⁴	Chinese	108	Depressed: 54 Not depressed: 54	Picture description and question-answering tasks)	Embedding (Glove)	BDI-II and PHQ-9	GRU based RNN	8:1:1 random train-validation-test split	0.659	0.631			0.688	0.583
Howes 2014⁴⁵	English	882 (167)	Unclear	Interaction with therapist (text-based counseling)	n-gram	PHQ-9	Logistic regression	10-fold cross-validation		0.686
Iyortsuun 2024⁴⁶ (Same as Shen 2022)	Chinese	162	Depressed: 30 Not depressed: 132	General interview	Transformer (Transformer-based, USE-large)	SDS	BiLSTM + Attention	3-Fold cross-validation	0.606	0.66			0.79	0.58
Joharee 2023⁴⁷	Bahasa Malaysia	511 (172)	Unclear (in the teste set 28 depressed and 23 not depressed)	3 open-ended questions	TF-IDF	BDI-II and PHQ-9	Extra Tree Classifier	Split 70% training and 30% test	0.73	0.63
Krishnamurti 2022⁴⁸	English	1007 (666)	Not depressed: 48.2% Mild: 38.3% moderate: 10.5% severe: 3.0%	Open-ended questions documenting their pregnancy journey	LIWC; Embedding (Word2Vec); Latent Dirichlet Allocation (LDA), SentiWordNet (SWN)	Edinburgh Postnatal Depression Scale (EPDS)	LASSO regression model	70% training, 15% for prediction (additional 15% were not used)			0.87
Li 2023⁴⁹	Chinese	387 (329)	Euthymia = 46, mild = 102, moderate = 160, severe = 79	Clinical interview	Transformer (BERT)	HAMD-17	BiLSTM + Self-Attention + Multilayer Perceptron (MLP) + Softmax,	Training = 273 recordings, Test = 114 recordings.	0.86	0.911		0.696	0.921	0.901
Liu 2022⁵⁰	English	219	Depressed: 64 Not Depressed: 155	Text message	LIWC	PHQ-8	Logistic Regression with L2 regularization	leave-one-out			0.72
Munthuli 2023⁵¹	Thai	80	Depressed: 40 Healthy control: 40	Clinical interview	Fine-tuned transformer encoder (XLM-RoBERTa)	PHQ-9 and HAM-D	Transformer-based binary classifier (XLM-RoBERTa)	K×L-fold stratified and nested cross-validation	0.9	0.898		0.925	0.921	0.875
Nobles 2018⁵²	English	1213 (33)	Suicidality day: 685 Depression Day: 528	Text message	TDF-IDF	Depression: periods where the individual had no suicidal ideation or attempt	DNN	10-fold cross-validation	0.7	0.75		0.56	0.71	0.81
Oh 2024⁵³	Korean	166 (77)	Depressed: 60 Other psychiatric illnesses: 17	Clinical interview	Emotional Analysis Module patented by Acryl Inc.	Clinical diagnosis (DSM-5), provided by psychiatrist	XGBoost	Train: 136 Test:30	0.794	0.877	0.85	0.25		0.962
Ohse 2024⁵⁴	German	84	Depressed: 25 Not Depressed: 59	Clinical interview	GPT3.5 fine-tuned	PHQ-8	GPT3.5 fine-tuned	N/A	0.910	0.820			0.850	0.840
Orhan 2019⁵⁵	Turkish	60	Depressed: 30 Healthy control: 30	10-minute free verbal samples of the subjects	Turkish version of the Harvard-III Psychological Dictionary	Structured clinical diagnosis	Bayesian Logistic Regression	Train: 42 (21 for each category) Test: 18 (9 for each set)	0.89
Parkeaw 2025⁵⁶	Thai	373	Low risk: 261, High risk: 112	SCT consisted of 34 items covering four key depression-related domains: 1) family, 2) society, 3) health, and 4) self-concept	LLM (LLama3.1) was used to extract sentiment scores	PHQ-9	Random forest	5-fold cross-validation	0.786				0.782
Pérez-Toro 2022⁵⁷	Spanish	60	Depressed Parkinson's Disease patients (D-PD): 25 Non-depressed Parkinson's Disease patients (ND-PD): 35	Free response prompt (asked to talk about their daily routines)	Transformer (BERT)	Depression item from the MDS-UPDRS	Gaussian Mixture Model-Universal Background Model	Nested leave one out cross-validation	0.67*	0.7	0.7	0.8		0.56
Podina 2025⁵⁸	Romanian	765	Depressed: 397 Not depressed: 367	Clinical interview (with the aiCARE chatbot)	TF-IDF	PHQ-9	Logistic regression	This is a test set for the algorithm that was built in Boian et al., 2025³³	0.84	0.85		0.78	0.76	0.93
Qin 2025¹⁵	English	37	Depressed: 17 Control: 20	3 Phases: 1. small talk, 2. semi-structural interview. 3. Demographic questions	LLM (qCammel-13B-GPTQ)	MINI	LLM (qCammel-13B-GPTQ)	N/A	0.81	0.87	0.88			0.80
Ren 2024⁵⁹	English	1070 (94)	Depressed: 570 Not depressed: 500	Interaction with therapist (message-based online therapy)	LIWC; Transformer (BERT)	PHQ-9	Neural network (classification head, unspecified)	Training: 870 Test set: 200 Each of the 94 participants contribute 3 observations for training and one for the test.	0.60	0.59	0.64
Resnik 2013⁶⁰	English	124	Depressed: 12 Not depressed: 112	Students were asked to “describe your deepest thoughts and feelings about being in college”.	LIWC; Topic modeling (LDA)	BDI	Logistic Regression	Split to train (94) and test (30)	0.80*	0.50			0.50	0.50
Rutowski 2020⁶¹	English	15,950 (11,000)	Depressed: 4259 Not depressed: 11,691	Participants interacted with an app that presented questions on different topics, such as “work” or “home”.	Transfer learning, implemented via ULMFiT	PHQ-8	LSTM	Split to train (80%) and test (20%)	0.75		0.82	0.75		0.75
Shen 2022⁶²	Chinese	162	Depressed: 30 Not depressed: 132	Interviews	Embedding (ELMo)	PHQ-8 SDS	BiLSTM with Attention	3-fold CV		0.65			0.65	0.66
Shin 2022⁶³	Korean	166	Depressed: 83 Healthy control: 83	Clinical interview	LIWC; Bag-of-words	MINI	Naive bayes	80/20 split	0.83		0.91	0.96		0.70
Shin 2024⁶⁴	Korean	428 (91)	Depressed: 73 Not depressed: 357	Daily diary	Transformer Gpt3.5_ft_CoT (fine-tuned models, Chain-of-thought)	PHQ-9 and Beck Scale for Suicide Ideation (BSS)	Gpt3.5_ft_CoT (fine-tuned models, Chain-of-thought)	N/A	0.90	0.69		0.95	0.75	0.64
Smirnova 2018⁶⁵	Russian	201	Depressed: 124 Healthy control: 77	Free response prompt; (Participants wrote narratives on the topic ‚"The current state of life and future expectations)	Lexico-semantic features: metaphors, similes, informal words, repetitions Syntactic features: sentence types, word order, ellipses Lexico-grammatical features: pronouns, verb tenses/forms	Clinical psychiatric interviews coded using ICD-10 diagnostic criteria	Linear discriminant analysis	Mention cross validation, but not clear which type	0.99
Smirnova 2019⁶⁶	Russian	201	Same as above	Same as above	Component lexis analysis	HDRS-21	Linear discriminant analysis	Mention cross validation, but not clear which type	0.96
Sood 2023⁶⁷	English	626	Depressed: 152 Not depressed: 474	Clinical interview [Combination of 3 data sets: DAIC, E-DAIC and EATD corpus (originally chines but translated to English)]	TDF-IDF	PHQ-8 and SDS	SVM	Training set: 399 Development set: 108 Test set: 119 (34 depressed)	0.90*	0.82			0.83	0.83
Tao 2023⁶⁸	Chinese	139	Depressed: 64 Anxious: 75	Interaction with chatbot asking about daily activities	Transformer (ChatGPT)	Psychiatrist diagnosis	ChatGPT	N/A	0.68	0.71			0.69	0.72
Tlachac 2020⁶⁹	English	162	Depressed:55 Not Depressed: 107	Text message	Lexical category features via Empath; POS tag frequencies; Sentiment scores (polarity and subjectivity); Volume features: number of messages, words, characters	PHQ-9	Logistic Regression	5-fold cross-validation	0.804*	0.806		0.742	0.728	0.925
Tlachac 2022a⁷⁰	English	302	Depressed: 142 (47.0%) Not depressed: 160 (53.0%)	Free response prompt	Transformer (BERT); Part-of-speech (POS) tagging Lexical category features via Empath	PHQ-9	BERT-LSTM (a variation of BERT incorporating a Long Short-Term Memory layer)	Training set: 218 Test set: 84 (27.8%)	0.55	0.67		0.17	0.51	0.97
Tlachac 2022b⁷¹	English	3,000 (unclear of how many participants)	Unclear	Text message	Transformer (BERT)	PHQ-9	Fine-tuned BERT classifier	Training: 2400 (1,200 messages per class) Testing: 600 (300 messages per class)	0.711
Tlachac 2022⁷²	English	88	Depressed: 53 Not depressed: 35	Text message	Lexical category; Frequency features; BOW	PHQ-9	Logistic Regression	leave-group-out cross validation	0.71	0.79		0.4		0.93
Weber 2025⁷³	German	126 (65 from 44 participants, 61 synthetic)	n/a	Clinical interview	Transformer (BERT-base-German-cased)	MADRS	Linear regression	5-fold cross validation	0.83
Wright-Berryman 2023⁷⁴	English	2416 (1433)	Depressed: 863 Not depressed:1553	Clinical interview	TF-IDF	PHQ-9	SVM	Leave-one-subject-out cross-validation	0.69		0.77	0.04	0.68	0.55
Xue 2024⁷⁵ (Same as Shen 2022)	Chinese	162	Depressed: 30 Not depressed: 132	EATD-Corpus (general interview)	Transformer (BERT)	SDS	Fine-tuned BERT model with fully connected layers	Does not specify what type of validation was used		0.72			0.66	0.80
Ye 2021⁷⁶	Chinese	160	Depressed: 80 Not depressed: 80	Clinical interview	Embedding (Word2vec)	HAMD	One-hot Transformer	of 5-fold cross validation.	0.882	0.874
Yuan 2021⁷⁷	Chinese	108	Depressed: 54 Not depressed: 54	Picture descriptions and responses to 30 questions.	Embedding	BDI-II and PHQ-9	Text Recurrent Encoder (TRE)	8:1:1 random train-validation-test split	0.659	0.651			0.688	0.583
Zhang 2024a⁷⁸	Chinese	240	Depressed: 120 Not depressed: 120	Clinical interview	BERT Chinese pre-training model with Multi-Head Attention (MHA) module	PHQ-9	Fully connected deep learning classifier	Training set: 168, Test set: 72	0.64	0.64			0.64	0.64
Zou 2023⁷⁹	Chinese	78	Depressed: 26 Not depressed: 52	Clinical interview	Transformer (Chinese BERT)	MINI	Logistic Regression	5-fold cross-validation	0.92*	0.93	0.99		0.87	0.93
Studies Reporting Continuous Outcomes
Study	Language	N	Source of text	Features extraction method		Outcome	Type of classification model	Validation	MAE		RMSE		R2
Morales 2016⁸⁰	German	138 (84)	Interview on everyday life aspects	LIWC; n-gram; Part-of-Speech (POS); Text-based speech rate features		BDI-II	SVM	leave-one-out cross-validation	7.56		9.21		0.526
Ozkanca 2018⁸¹	Turkish	70	Open-ended questions (neutral, positive, and negative questions)	Manual sentiment tagging (positive/negative/neutral), number of responses per sentiment, average utterance length, speech rate, features computed separately for positive/negative/neutral questions (15 total features)		BDI-II	SVR	Leave-one-out			10.3
Note: The number in the “N” column represents the total number of text observations, with the value in parentheses indicating the number of participants from whom these observations were collected. The accuracy result marked with a * has been computed by us. BiLSTM: Bidirectional Long Short-Term Memory; LIWC: Linguistic Inquiry and Word Count; CNN: Convolutional Neural Network; TF-IDF: Term Frequency–Inverse Document Frequency; PHQ-9: Patient Health Questionnaire–9; BERT: Bidirectional Encoder Representations from Transformers; IIFDD: Intra- and Inter-modal Fusion Model for Depression Detection; SVM: Support Vector Machine; LASSO: Least absolute shrinkage and selection operator; BDI-II: Beck Depression Inventory–II; GRU: Gated Recurrent Unit; RNN: Recurrent Neural Network; LDA: Latent Dirichlet Allocation; POS: Part-of-speech; HAMD: Hamilton Depression Rating Scale; DNN: Deep Neural Net; XGBoost: Extreme Gradient Boosting; MDS-UPDRS: Movement Disorders Society Unified Parkinson's Disease Rating Scale; ULMFiT: Universal Language Model Fine-tuning; SDS: Self-rating Depression Scale questionnaire; HDRS-21: Hamilton Depression Rating Scale-21; SVR: Support Vector Regression; SCT: Sentence Completion Test; MADRS: Montgomery-Åsberg Depression Rating Scale.

Table 2
Description of Studies That Use DAIC Dataset
First author	Features extraction method	Outcome type	Type of classification model	Accuracy	F1	Precision	Recall/ Sensitivity	MAE	RMSE
Agarwal 2022⁸²	Embedding (GloVe)	Binary	MV-IA-Mean	0.72	0.73	0.74	UAR: 0.72
Agarwal 2024⁸³	Embedding (Sentence embeddings from all-mpnet-base-v2; graph built using cosine similarity between embeddings)	Binary	GCN + Transformer multi-head attention	0.83	0.81	0.80	UAR: 0.82
Al-Hanai 2018⁸⁴	Embedding (Word2Vec)	Binary	LSTM		0.67	0.57	0.8	5.18	6.38
Ansari 2023⁸⁵	Count vectorization	Binary	LR and LSTM	LR: 0.748, LSTM: 0.73	LR: 0.67, LSTM: 0.61
Burdisso 2023*⁸⁶	TF-IDF; PMI (Pointwise Mutual Information).; PageRank	Binary	node-weighted GCN		0.84
Cao 2022⁸⁷	Transformer (BERT)	Binary	BERT	0.91
Chen 2024⁸⁸	TF-IDF; PMI (Pointwise Mutual Information).; PageRank	Binary	GCN		0.84
Correia 2016⁸⁹	Embedding (GloVe)	Binary	SVM	Per sentence: 0.533 Per interview: 1.00
Dang 2017⁹⁰	• SALAT; • siNLP; • TAALES; • SEANCE; • ANEW; • EmoLex; • SenticNet; • Lasswell	Cont.	SVR					4.98	6.02
Danner 2023⁹¹	Transformer (BERT)	Binary	BERT		0.82	0.83	0.82
Fang 2023⁹²	Transformer (USE)	Cont.	Bi-LSTM with an attention mechanism					3.61	4.76
Firoz 2023a⁹³	• BoW • TF-IDF • Embedding (Word2Ve, FastTex)	Binary	Ensemble model of CNN-LSTM-and Bi-LSTM	0.80
Firoz 2023b⁹⁴	Transformer (BERT); Counts of absolutist language (e.g., always, never, completely)	Cont.	LSTM					5.65	9.45
Flores 2023²⁴	Transformer (BERT)	Binary	LSTM		0.72
Guo 2024⁹⁵	Transformer (BERT)	Binary	PTDD	0.69	0.60	0.48	0.73
Hadzic 2024⁹⁶	GPT4	Binary	GPT4		0.71	0.81	0.70
Hong 2022⁹⁷	Embedding (GRL using Schema Encoders)	Cont.	Schema-Based Graph Neural Network					3.76
Iyortsuun 2024⁴⁵	Transformer (Transformer-based, USE-large)	Binary and cont.	BiLSTM + Attention	0.727	0.78	0.80	0.76	3.96
Jo 2022*⁹⁸	Embedding (unclear the exact type)	Binary	CNN	0.8171	0.8101	0.80	0.8205
Kokkera 2023⁹⁹	• Word frequencies • POS tags • Sentiment scores	Binary	RF	0.40	0.40	0.44	0.43
Lam 2019¹⁰⁰	Manual topic modelling + augmentation + embedding + Transformer	Binary	Transformer architecture		0.78	0.91	0.83
Lau, 2021¹⁰¹	Transformer (BERT)	Binary and cont.	BiLSTM + attention		0.83	0.83	0.83	4.23	5.32
Lau 2023¹⁰²	Transformer (BERT and RoBERTa)	Cont.	BiLSTM + attention					4.17	0.02
Li 2022a¹⁰³	Embedding (from scratch)	Binary	biLSTM + RNN network	0.745	0.706	0.701	0.715
Li 2022b¹⁰⁴	Transformer (BERT) (utterance-based)	Binary	BiLSTM + attention with an MLP-Softmax classifier		0.78		UAR: 0.79
Li 2023¹⁰⁵	Part-of-Speech (POS); Named Entity Recognition (NER); Embedding (GloVe)	Binary	BiLSTM		0.79	0.69	0.80
Lin 2020¹⁰⁶	Embedding (Elmo)	Binary	BiLSTM + Attention		0.83	0.83	0.83
Lopez-Otero 2017¹⁰⁷	Embedding (GloVe)	Binary	SVM	0.857	0.730
Lorenc 2022¹⁰⁸	Embedding and transformer (USE5, DAN, sBERT )	Binary	Chunk-based biLSTM model				UAR: 0.803
Lu 2023¹⁰⁹	Transformer (BERT)	Binary	BERT		0.76
Rodrigues Makiuchi 2019¹¹⁰*	Transformer (BERT)	Cont.	8 CNN blocks-LSTM					4.22
Mallol-Ragolta 2019¹¹¹	Embedding (GloVe)	Binary	HCAN		0.63		UAR: 0.66
Mao 2023¹¹²	Embedding (GloVe)	5-levels classification	BiLSTM	0.968	0.971
Milintsevich 2023¹¹³	Transformer (RoBERTa)	Binary, 5-levels classification and cont.	BiLSTM + Attention		Binary: Micro-F1 = 0.766 Macro-F1 = 0.739 5-Class: Micro-F1 = 0.426 Macro-F1 = 0.270			3.78
Niu 2021¹¹⁴	Embedding (GloVe)	Binary and cont.	Hierarchical context-aware graph attention model	0.77		0.70	0.82	3.73	4.8
Pampouchidou 2016¹¹⁵	• LIWC; • Total number of words and sentences • Average sentence length • Laughter-to-word ratio • Depression-related word ratio: • ANEW • Mean and SD of pleasure, arousal, dominance ratings • Word frequency	Binary	Decision Tree		Depressed: 0.23 Not depressed: 0.79			8.99	10.75
Prabhu 2022¹¹⁶	Embedding (Word2vec pretrained)	Binary	LSTM	0.823
Qureshi 2019¹¹⁷	Embedding (from scratch, feature learning via an LSTM encoder)	Continuous and 5-class	DNN	0.67	0.53			3.90	4.96
Qureshi 2020¹¹⁸	Transformer (USE)	Cont. and 5-level class	LSTM	0.667	0.62			Class: 0.66 Cont: 3.81	Class: 1.23 Cont: 4.70
Qureshi 2021¹¹⁹	Transformer (USE)	Cont.	LSTM					3.78	4.88
Rasipuram 2022¹²⁰	Transformer (GPT2)	Cont.	BiLSTM					3.21	4.25
Ray 2019¹²¹*	Transformer (USE)	Cont.	stacked BiLSTM + feedforward network					4.02	4.73
Rinaldi 2020¹²²	Embedding (GloVe)	Binary	Joint Latent Prompt Categorization (JLPC)		0.604
Rohanian 2019¹²³	Embedding (GloVe)	Binary and cont.	LSTM		0.69	0.68		4.98	6.05
Sadeghi 2023¹²⁴*	Transformer (GPT-3.5-Turbo and DepRoBERTa )	Cont.	SVR with a polynomial (poly) kernel					4.26	5.36
Sadeghi 2024¹²⁵*	Transformer (GPT-3.5-Turbo (prompt asking the model to describe the interview + DepRoBERTa and GPT-3.5-Turbo response to 11 questions on the interview)	Cont.	SVR					3.86	4.66
Samareh 2018¹²⁶	• Basic linguistic stats (e.g., word count); • Dictionary based depression-related word ratio; • Sentiment features (AFINN)	Cont.	RF regression with confidence-based decision-level fusion.					4.78	5.59
Senn 2022¹²⁷	Transformer (BERT and RoBERTa)	Binary	Ensemble of BERT, RoBERTa, DistilBERT		0.62		0.64
Shen 2022⁶¹ (used also eatd corpus)	Embedding (ELMo)	Binary	BiLSTM with Attention		0.83	0.83	0.83
Stasak 2017¹²⁸	Word Affect Features: single affect word-rating reference, such as the General Index	Binary	decision tree classification	0.82
Stepanov 2018¹²⁹	BOW	Cont.	SVR					4.88	5.83
Sun 2017¹³⁰	Selected key phrases related to symptoms	Cont.	RF		0.55	0.40	0.89	3.87	4.98
Tlachac 2022⁷⁰	Transformer (BERT)	Binary	fine-tuned BERT classifier	0.48
Toto 2021¹³¹	Transformer (BERT)	Binary	LSTM		0.67
Marriwala 2023¹³²	Embedding (Word2vec)	Binary	CNN	0.8	0.6	0.63	0.68
Van Steijn 2022¹³³*	LIWC; Transformer (BERT); Sentiment; speech rate; Repetition rate; Confidence score	Cont.	KELM						6.06
Villatoro-Tello 2021¹³⁴*	Lexical Availability	Binary	MLP (Train the model on E-DAIC and tested on DAIC-WOZ)		0.83	0.87	0.81
Williamson 2016¹³⁵	Embedding (GloVe); Topics	Binary and cont.	SVR		0.84			3.34	4.46
Xezonaki 2020¹³⁶	LIWC; TDF-IDF; Embedding (GloVe) ; Affective lexica (AFINN, Bing Liu, MPQA, Emolex, SemEval15)	Binary	Hierarchical Attention Network with Lexicon and Summary Integration		0.70	0.70
Xia 2024¹³⁷	Embedding (Word2vec)	Binary	BiLSTM-GNN	0.64	0.60	0.585	0.584
Xiao 2021¹³⁸	Transformer (BERT)	Binary	BERT				0.70
Xu 2023¹³⁹*	Transformer (BERT)	Binary	Two-layer STM network		0.82	0.81	0.83
Xue 2024⁷³	Transformer (BERT)	Binary	Fine-tuned BERT model with fully connected (FC) layers		0.85	0.79	0.92
Yadav 2023¹⁴⁰	Embedding (Word2Vec, ELMo); Transformer (BERT)	5-levels class	BGRU model with two Fully Coupled (FC) networks as output layers		0.923	0.929	0.928
Yang 2017a¹⁴¹	Embedding (PV); Global structural and behavioral text features (e.g., Number of words)	Binary	SVM		Depressed: 0.667 Not depressed: 0.885	Depressed: 1.000 Not depressed: 0.793	Depressed: 0.50 Not depressed: 1.00
Yang 2017b¹⁴²	Embedding (PV)	Cont.	DCNN and DNN					Female: 3.750 Male: 3.525	Female: 4.361 Male: 4.406
Yang 2018¹⁴³	Embedding (PV)	Binary	SVM	0.75
Yang 2019¹⁴⁴	Embedding (Doc2vec) and Text Convolutional Neural Network	Binary	SVM	0.72
Zhang 2020a¹⁴⁵*	Embedding (PV or doc2vec )	Binary and cont.	Multitask Deep Neural Network (DNN)	0.839	0.907				4.66
Zhang 2020b¹⁴⁶	Transformer (BERT); Key phrase matching	Binary	bidirectional variable-length LSTM model		0.81	0.82	0.8
Zhang 2024b¹⁴⁷	Transformer [Sentence-BERT (nli-bert-large)]	Binary	BiLSTM		0.87
Zhang 2024c¹⁴⁸	Transformer (T5-Encoder and BERT)	Binary, 3-levels, 5-levels	T5 + BERT dual-branch fusion	Binary: 0.8913 3-level: 0.6739 5-level: 0.5435	Binary: 0.8276 3-level: 0.6677 5-level: 0.5259	Binary: 0.80	Binary: 0.857	5.283
Zhao 2022¹⁴⁹	n-gram	Cont.	Transformer-based architecture with self-attention and feed-forward layers					5.03	5.95
Note: * indicate studies using the E-DAIC dataset; all others are based on DAIC-WOZ. Since only two studies reported AUC and three studies reported specificity, these metrics were removed from the table. MV-IA-Mean: Multi-view model with inter-view attention coupled with the mean function; GCN: Graph Convolutional Network; SALAT: Suite of Linguistic Analysis Tools: This open-source toolkit was used to extract various linguistic and word affect features from transcripts. siNLP: Simple Natural Language Processing Tool; TAALES: Tool for Automatic Analysis of Lexical Sophistication; SÉANCE: Sentiment Analysis and Cognition Engine; ANEW: Affective Norms for English Words; EmoLex: This provided features based on token words related to eight emotion types (e.g., anger, anticipation, disgust, fear, joy, sadness, surprise, trust); SenticNet: This provided features based on nearly 13,000 token words, evaluating perceptual polarity norms for aptitude, attention, pleasantness, and sensitivity; Lasswell: This provided 146 features from 63 different word lists categorized by eight semantic characterizations, with a particular interest in the well-being category; BERT: Bidirectional Encoder Representations from Transformers; SVR: Linear Support Vector Regression; Bi-LSTM: Bi-directional LSTM; SVM: Support Vector Machine ; NB: Naïve Bayes; LR: Logistic Regression; LSTM: Long Short Term Memory; USE: Universal Sentence Encoder; BoW: Bag of Words; TF-IDF: Term Frequency-Inverse Document Frequency; PTDD: Prompt-based Topic-modeling method for Depression Detection; RF: Random Forest; UAR: Unweighted Average Recall; USE5: USE Transformer-based; DAN: Deep Averaging Network a simpler sentence embedding model; sBERT: Sentence-BERT Transformer fine-tuned for sentence similarity; HCAN: Hierarchical Contextual Attention Network; POS: Part-of-speech; DepRoBERTa: fine-tuned RoBERTa language model, which is specifically designed for depression detection; CNN: Convolutional Neural Network; KELM: Kernel Extreme Learning Machine; BGRU: Bidirectional Gated Recurrent Unit; DCNN :Deep Convolutional Neural Network; DNN: Deep Neural Network; MLP: Multi-layer Perceptron; MDSD-T5: the T5-based (Google encoder–decoder Transformer) branch of the MDSD-FGPL system; PV: paragraph vector, an extension of Word2Vec.

Across studies, a total of 35,171 unique participants (removing duplicates from overlapping datasets) contributed 58,413 text samples (e.g., utterances, transcripts, or messages). Publication years ranged from 2013 to 2025, with 95 articles (77.2%) published since 2020, reflecting growing interest in automated text-based depression detection.

Dataset Characteristics

Among the 56 studies using unique datasets, 20 (35.7%) analyzed English text, 14 (25%) analyzed Chinese text, and 22 (37.5%) examined other languages (e.g., Italian, Spanish, Turkish, Korean, Russian, German, Malay, Thai). One study (1.7%) combined Chinese and English datasets by translating the Chinese texts into English. The DAIC datasets included English-language clinical interviews.

The number of text observations ranged from 53 to 15,950 (mean = 1,216; median = 210). Key categories of data sources included: Structured clinical interview (19/56, 33.9%), responses to open-ended questions (e.g., “ Describe your weekend activities”; 27/56, 48.2%), text messaging and chat logs (6/56, 10.7%) or interaction with therapists (i.e., text-based therapy; 4/56, 7.1%).

Outcome Formulations

The great majority of studies framed automated depression detection as a binary classification task (100/129, 77.5%). In these cases, models were trained to identify whether an individual (or a given text sample) was produced by a depressed vs. non-depressed person. Typically, “depressed” status was defined using a clinical cutoff on a depression severity questionnaire (for example, a PHQ-9 score ≥ 10) or based on a formal diagnostic evaluation.

NLP feature extraction and classification methods

To train ML models to predict depression, text should be converted into numeric values that will be used as the input. Broadly, feature extraction methods fell into four categories: (1) Simple textual features (12/129, 9.3%): Unstructured representations such as Bag-of-Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF) term vectors, which convert text into numerical frequency-based vectors without incorporating external knowledge. (2) Lexicon-based linguistic features (16/129, 12.4%): Specific variables derived from psychological dictionaries or lexica (e.g., LIWC, EmoLex, SenticNet). These typically involve counting the occurrences of words in predefined semantic, emotional, or psychological categories (e.g., number of sad words in the text). (3) Pre-trained word embeddings (31/129, 24.0%): Dense, distributed vector representations of words or documents learned from large text corpora, excluding transformer-based models. Common examples included Word2Vec and GloVe embeddings, which represent words based on their co-occurrence patterns in large datasets. (4) Transformer-based language model features (48/129, 37.2%): Deep text representations from large pre-trained models (e.g., BERT, GPT) that use attention mechanisms to capture relationships between words across entire sentences, enabling a more nuanced understanding of context and meaning. Studies also employed hybrid feature approaches, combining multiple feature types to enrich the model’s input (22/129, 17.0%).

In line with broader NLP trends, classification methods evolved from traditional machine-learning models to deep learning approaches capable of directly interpreting and learning from raw text. About one-third of studies (44/129, 34.1%) used traditional classifiers such as support vector machines, logistic regression, naïve Bayes, or random forests. A large portion (56/129, 43.4%) of the studies employed recurrent neural network architectures, most commonly Long-Short-Term Memory (LSTM) or bidirectional LSTM, often augmented with attention mechanisms to capture the temporal dependencies among words. By the early 2020s, transformer-based models (29/129, 22.4%) became increasingly prevalent, with studies either fine-tuning pre-trained architectures (e.g., BERT, RoBERTa) for depression detection or using their contextual embeddings as features to separate classifiers. For example, Flores et al.²⁴ found that while BERT embeddings alone achieved relatively modest performance (maximum F1: 0.58), combining them with an LSTM substantially improved results (average F1: 0.72).

Meta-analysis results

Forty-three studies (n = 40,983) were included in the pooled analysis of classification accuracy (Fig. 2). The pooled accuracy was 0.80 (95% CI, 0.76–0.83), with substantial heterogeneity (

$\:{\tau\:}^{2}$

=.471,

$\:{I}^{2}$

= 98.4%). Egger’s test was nonsignificant (z = 1.17, p = 0.25), and trim-and-fill identified no missing studies, indicating no evidence of publication bias. Leave-one-out analyses demonstrated stability (range: 0.79–0.80; Table S3), and the Galbraith plot (Fig. 3) revealed no extreme outliers, confirming that heterogeneity reflects broad between-study variability rather than undue influence from individual studies. As a sensitivity analysis, we also excluded seven studies in which accuracy was approximated from other reported metrics (k = 36), yielding an identical pooled estimate (0.80, 95% CI, 0.75–0.83) and similarly high heterogeneity (Fig. 4).

Moderator analysis

There were significant between-group differences based on language (Q_between= 153.24, p < .001). Pooled classification accuracy was highest for studies conducted in languages other than English or Chinese (k = 8, Accuracy = 0.82, 95% CI, 0.76–0.86), followed by those in Chinese (k = 9, Accuracy = 0.81, 95% CI, 0.70–0.88), and studies in English (k = 16, Accuracy = 0.77, 95% CI, 0.71–0.83). Between-group differences were also significant for text source (Q_between=187.05, p < .001). Studies using structured clinical interviews (k = 18) demonstrated the highest pooled accuracy (0.84, 95% CI, 0.81–0.88), followed by communication-based interactions (k = 5, Accuracy = 0.79, 95% CI, 0.66–0.88), open-ended questions (k = 18, Accuracy = 0.75, 95% CI: 0.68–0.81), and finally therapist-patient interactions (k = 2, Accuracy = 0.70, 95% CI, 0.50–0.84). A significant between-group difference was observed based on feature type (Q_between= 164.62, p < .001). Linguistic features produced the highest pooled accuracy (k = 5, Accuracy = 0.86, 95% CI 0.75–0.93), followed by embeddings-based features (k = 6, Accuracy = 0.84, 95% CI, 0.75–0.90), transformer (k = 18, Accuracy = 0.81, 95% CI, 0.75–0.85), simple features (k = 7, 0.75, 95% CI, 0.65–0.82), and hybrid features (k = 7, Accuracy = 0.74, 95% CI, 0.63–0.83). Classifier type also showed a significant between-group effect (Q_between= 166.25, p < .001). Transformer-based and traditional classifiers performed similarly, both with Accuracy = 0.81 (transformers: k = 14, 95% CI, 0.74–0.87; traditional: k = 21, 95% CI, 0.76–0.85), outperforming neural networks (k = 8, 0.72, 95% CI, 0.63–0.80).

In one-at-a-time meta-regressions, only text source was significant (QM(3) = 8.78, p = .032), accounting for 13.6% of the variance. Other moderators including risk of bias, sample balance, language, feature type, classifier, and sample size (log N) did not significantly predict variability in accuracy (all p > 0.17; R²= 0–7.7%).

Secondary analysis

Precision. Twenty-eight studies (n = 31,644) yielded a pooled precision of 0.78 (95% CI, 0.72–0.83) with high heterogeneity (

$\:{\tau\:}^{2}$

=.731,

$\:{I}^{2}$

= 99.1%, Figure S1). Leave-one-out analyses were stable (range = 0.77–0.79, Table S4).

Recall. Thirty-three studies (n = 47,738) produced a pooled recall of 0.76 (95% CI, 0.68–0.83) with high heterogeneity (

$\:{\tau\:}^{2}$

=1.42

$\:{I}^{2}$

= 99.7%, Figure S2). Leave-one-out analyses confirmed stability (range = 0.75–0.77, Table S5).

AUC. Fourteen studies (n = 39,412) showed a pooled AUC of 0.79 (95%CI: 0.70–0.85) with high heterogeneity (

$\:{\tau\:}^{2}$

=0.66

$\:{I}^{2}$

= 99.6%, Figure S3). Leave-one-out analyses were stable (range = 0.76–0.81, Table S6). Across all three metrics, precision, recall, and AUC, Egger’s tests were nonsignificant, and trim-and-fill analyses indicated that no studies were missing, suggesting no evidence of publication bias.

Balanced accuracy. Sixteen studies (n = 31,661) showed a pooled AUC of 0.71 (95%CI: 0.63–0.78) with high heterogeneity (

$\:{\tau\:}^{2}$

=0.54

$\:{I}^{2}$

= 99.4%, Figure S4). Leave-one-out analyses were stable (range = 0.70–0.74, Table S7). Egger’s test was nonsignificant, and trim-and-fill analyses suggested that only one study should be imputed, which slightly reduced the pooled balanced accuracy estimate from 0.71 to 0.70 (Figure S5).

Discussion

This study revealed a substantial increase in studies using NLP- and ML-based approaches for automated depression detection. Among studies included in the quantitative synthesis, models achieved a pooled accuracy of 80%, correctly distinguishing depressed from non-depressed individuals in roughly four out of five cases. For comparison, meta-analyses using resting-state fMRI found similar accuracy (~ 80%)²⁵, whereas those using wearable AI reported higher accuracy (~ 89%)²⁶. Yet given the substantial heterogeneity, results should be interpreted cautiously.

Subgroup analyses showed significant differences in accuracy across several factors. Studies conducted in English yielded lower accuracy (77%) than those in Chinese or other languages, possibly reflecting linguistic or cultural differences²⁷ or dataset-specific effects. This finding warrants further investigation into whether cultural and language factors affect automated detection. Models trained on structured clinical interviews achieved the highest accuracy (84%), whereas those analyzing free-form patient–therapist conversations performed worse (70%), suggesting that direct questioning about mood elicits clearer linguistic signals of depression.

Interestingly, lexicon-based features showed the highest pooled accuracy (86%), outperforming more complex or hybrid models (74%). One explanation is that targeted linguistic markers of depression, (e.g., frequent use of negative emotion words or first-person pronouns) are robust across contexts, allowing simpler models to perform well. However, this finding is based on few studies and requires further validation. Traditional machine learning models performed comparably to transformer-based models (81%), potentially indicating a performance ceiling, possibly due to limited input text that provides insufficient linguistic signal. Other factors may include between-person variability in how depression is expressed, small and imbalanced datasets, insufficient transformer fine-tuning, shallow linguistic cues in the data collection task, and differences in model optimization or evaluation procedures. These factors may constrain even advanced models, underscoring the need for richer longitudinal data and more personalized and context-aware modeling approaches. Consistent with this interpretation, only the type of text source significantly explained some between-study variance in meta-regression (accounting for 13.5% of heterogeneity), suggesting that depression-focused linguistic content may be more critical than model complexity for detection accuracy.

Secondary analysis reveals a pooled AUC of 0.79, suggesting that NLP models can detect depression from text at a level that could be clinically useful, especially for early screening or augmenting clinicians’ assessments. The pooled precision (0.78) and recall (0.76) suggest that, on average, models show a reasonable balance between correctly identifying depressed individuals and avoiding false positives. However, there was substantial variability across studies, likely because these metrics are complementary, and researchers can optimize one at the expense of the other. This trade-off means that while some models favor higher sensitivity to reduce missed cases, others prioritize precision to avoid false alarms.

This study is one of the few systematic reviews and the first meta-analyses to quantitatively evaluate NLP and ML techniques for detecting depression. Engaging clinical researchers in this rapidly evolving field is essential to ensure these tools are properly understood, evaluated, and directed toward real clinical needs. To maintain a high standard, we included only studies that validated their models against independent measures of depression. Nonetheless, several limitations should be acknowledged. The high heterogeneity means the pooled accuracy should be interpreted cautiously, as individual study results varied widely. Although moderator analyses explained some variability, much remains unexplained, likely reflecting differences in validation methods, preprocessing, or sample characteristics.

Furthermore, many papers did not report all the standard classification metrics. The field would benefit from more standardized reporting. Relatedly, accuracy alone can be misleading in imbalanced datasets where high accuracy may simply reflect the majority (non-depressed) class rather than true model performance.

An additional limitation is that over half of studies used variants of the DAIC corpus, meaning a large portion of the literature is built on the same or very similar data, which limits generalizability. While the DAIC is a valuable benchmark, there is a clear need for new datasets that cover different populations, languages, and modes of communication.

Gender differences were not examined, as few studies reported this information. Given that depression is more prevalent among women, they were likely overrepresented in training data, potentially biasing model performance and increasing misclassification risk for men.²⁸ Finally, automated detection differs conceptually from tracking within-person change. Identifying who is depressed does not necessarily translate to monitoring fluctuations or improvement over time, which may rely on different linguistic dynamics.

In summary, NLP and ML can detect depression with good accuracy across a variety of settings. The growing interest since 2020 has yielded many promising approaches. Advancing the field will require greater standardization in reporting, the use of more diverse datasets, and explicit attention to gender and cultural fairness to address the substantial heterogeneity observed. With continued refinement and validation, text-based depression detection systems have the potential to complement traditional assessment methods and broaden access to mental health screening.

Author Contribution

HF had full access to all study data and takes responsibility for the integrity and accuracy of the data analysis. HF and CW contributed to the study concept and design. HF, NJ, KP, AT, and MV acquired, analyzed, and interpreted the data. HF drafted the manuscript, and HF, NJ, KP, AT, MV, PD, and CW critically revised it for important intellectual content. HF performed the statistical analysis, and CW supervised the study. All authors read and approved the final manuscript.

Funding

HF was supported by the VATAT Scholarship (Israel), the Livingston Award, and the Kaplen Fellowship at Harvard Medical School. CW was partially supported by R01MH116969, R01MH135844, R01AT011002, the Tommy Fuss Fund, a NARSAD Young Investigator Grant from the Brain & Behavior Research Foundation, and the Klingenstein Third Generation Foundation. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Competing Interests

CW has received consulting fees from King & Spalding law firm but declares no non-financial competing interests. CW’s interests were reviewed and are managed by McLean Hospital and Mass General Brigham in accordance with their conflict of interest policies. No funding from this entity was used to support the current work, and all views expressed are solely those of the authors. The other authors declare no competing financial or non-financial interests.

Data Availability

Data and materials underlying the findings of this study are publicly available at [https://osf.io/x7tm9/overview](https:/osf.io/x7tm9/overview) .

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

References

GBD 2019 Mental Disorders Collaborators. Global, regional, and national burden of 12 mental disorders in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet Psychiatry 9, 137–150 (2022).

Ettman, C. K. et al. Prevalence of depression symptoms in US adults before and during the COVID-19 pandemic. JAMA Netw. Open 3, e2019686–e2019686 (2020).

Mohr, D. C. et al. Perceived barriers to psychological treatments and their relationship to depression. J. Clin. Psychol 66, 394–409 (2010).

Kraus, C., Kadriu, B., Lanzenberger, R., Zarate Jr, C. A. & Kasper, S. Prognosis and improved outcomes in major depression: a review. Transl. Psychiatry 9, 127 (2019).

Halfin, A. Depression: the benefits of early and appropriate treatment. Am. J. Manag. Care 13, S92 (2007).

Naslund, J. A. et al. Digital technology for treating and preventing mental disorders in low-income and middle-income countries: a narrative review of the literature. Lancet Psychiatry 4, 486–500 (2017).

Mao, K., Wu, Y. & Chen, J. A systematic review on automated clinical depression diagnosis. NPJ Ment. Health Res. 2, 20 (2023).

Zhang, T., Schoene, A. M., Ji, S. & Ananiadou, S. Natural language processing applied to mental illness detection: a narrative review. NPJ Digit. Med. 5, 46 (2022).

Le Glaz, A. et al. Machine learning and natural language processing in mental health: systematic review. J. Med. Internet Res. 23, e15708 (2021).

10.

Teferra, B. G. et al. Screening for depression using natural language processing: literature review. Interact. J. Med. Res. 13, e55067 (2024).

11.

Nanomi Arachchige, I. A., Sandanapitchai, P. & Weerasinghe, R. Investigating machine learning & natural language processing techniques applied for predicting depression disorder from online support forums: A systematic literature review. Inf. 12, 444 (2021).

12.

William, D. & Suhartono, D. Text-based depression detection on social media posts: A systematic literature review. Procedia Comput. Sci. 179, 582–589 (2021).

13.

Malgaroli, M., Hull, T. D., Zech, J. M. & Althoff, T. Natural language processing for mental health interventions: a systematic review and research framework. Transl. Psychiatry 13, 309 (2023).

14.

Li, Y. et al. Automated Depression Detection from Text and Audio: A Systematic Review. IEEE J. Biomed. Health Inform. (2025).

15.

Qin, R. et al. Language models for online depression detection: A review and benchmark analysis on remote interviews. ACM Trans. Manag. Inf. Syst. 16, 1–35 (2025).

16.

Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372, (2021).

17.

Ciharova, M. et al. Use of machine learning algorithms based on text, audio, and video data in the prediction of anxiety and posttraumatic stress in general and clinical populations: a systematic review. Biol. Psychiatry 96, 519–531 (2024).

18.

Wolff, R. F. et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann. Intern. Med. 170, 51–58 (2019).

19.

Higgins, J. P. et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ 343, (2011).

20.

Sterne, J. A. et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ 355, (2016).

21.

Gratch, J. et al. The distress analysis interview corpus of human and computer interviews. in Proceedings of the Ninth International Conference on Language Resources and Evaluation vol. 14 3123–3128 (Reykjavik, Iceland, 2014).

22.

DeVault, D. et al. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. in Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems 1061–1068 (2014).

23.

Kroenke, K. et al. The PHQ-8 as a measure of current depression in the general population. J. Affect. Disord. 114, 163–173 (2009).

24.

Flores, R., Tlachac, M., Toto, E. & Rundensteiner, E. Transfer learning for depression screening from follow-up clinical interview questions. in Deep Learning Applications, Volume 4 53–78 (Springer, 2023).

25.

Chen, Y., Zhao, W., Yi, S. & Liu, J. The diagnostic performance of machine learning based on resting-state functional magnetic resonance imaging data for major depressive disorders: a systematic review and meta-analysis. Front. Neurosci. 17, 1174080 (2023).

26.

Abd-Alrazaq, A. et al. Systematic review and meta-analysis of performance of wearable artificial intelligence in detecting and predicting depression. NPJ Digit. Med. 6, 84 (2023).

27.

Hoemann, K., Şencan, R. S., Cochez, A., Beckers, C. & de Mesquita, B. G. Cultural models of emotion manifest in descriptions of everyday experience: A case study of the US and Belgium. (2024).

28.

Cirillo, D. et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digit. Med. 3, 81 (2020).

29.

Abilkaiyrkyzy, A., Laamarti, F., Hamdi, M. & El Saddik, A. Dialogue system for early mental illness detection: toward a digital twin solution. IEEE Access 12, 2007–2024 (2024).

30.

Aloshban, N., Esposito, A. & Vinciarelli, A. Language or Paralanguage, This is the Problem: Comparing Depressed and Non-Depressed Speakers Through the Analysis of Gated Multimodal Units. in Interspeech 2496–2500 (2021).

31.

Antoniou, M. et al. Predicting mental health status in remote and rural farming communities: computational analysis of text-based counseling. JMIR Form. Res. 6, e33036 (2022).

32.

Banerjee, T. et al. Predicting mood disorder symptoms with remotely collected videos using an interpretable multimodal dynamic attention fusion network. arXiv preprint arXiv:2109.03029 (2021).

33.

Boian, R. et al. A conversational agent framework for mental health screening: Design, implementation, and usability. Behav. Inf. Technol. 44, 2364–2378 (2025).

34.

Burkhardt, H., Pullmann, M., Hull, T., Areán, P. & Cohen, T. Comparing emotion feature extraction approaches for predicting depression and anxiety. in Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology 105–115 (2022).

35.

Cao P, Zhang Y, Zhang C, et al. A multimodal depression consultation dataset of speech and text with hamd-17 assessments. Scientific Data. 12(1), 1577 (2025).

36.

Chen, J. et al. IIFDD: Intra and inter-modal fusion for depression detection with multi-modal information from Internet of Medical Things. Inf. Fusion 102, 102017 (2024).

37.

Cohen, J. et al. A multimodal dialog approach to mental state characterization in clinically depressed, anxious, and suicidal populations. Front. Psychol. 14, 1135469 (2023).

38.

Cook, B. L. et al. Novel use of natural language processing (NLP) to predict suicidal ideation and psychiatric symptoms in a text-based mental health intervention in Madrid. Comput. Math. Methods Med. 2016, 8708434 (2016).

39.

de Hond, A. et al. Predicting depression risk in patients with cancer using multimodal data: algorithm development study. JMIR Med. Inform. 12, e51925 (2024).

40.

Demiroglu, C., Beşirli, A., Ozkanca, Y. & Çelik, S. Depression-level assessment from multi-lingual conversational speech data using acoustic and text features. EURASIP J. Audio Speech Music Process. 2020, 17 (2020).

41.

Gao, H., Zhou, Y., Chen, L. & Chi, K. Deep Depression Detection Based on Feature Fusion and Result Fusion. in Chinese Conference on Pattern Recognition and Computer Vision (PRCV) 64–74 (Springer, 2023).

42.

Guo, Y. & Guo, Y. A Knowledge Graph and Large Language Model-Based Framework for Depression Detection. In 2024 International Conference on Image Processing, Computer Vision and Machine Learning 670–673 (IEEE, 2024).

43.

Hayati, M. F. M., Ali, M. A. M. & Rosli, A. N. M. Depression detection on Malay dialects using GPT-3. in 2022 IEEE-EMBS Conference on Biomedical Engineering and Sciences 360–364 (IEEE, 2022).

44.

He, Y., Lu, X., Yuan, J., Pan, T. & Wang, Y. Depressive Tendency Recognition by Fusing Speech and Text Features: A Comparative Analysis. in 344–348 (IEEE, 2022).

45.

Howes, C. & Purver, M. Linguistic indicators of severity and progress in online text-based therapy for depression. in 2022 13th International Symposium on Chinese Spoken Language Processing (2014).

46.

Iyortsuun, N. K., Kim, S.-H., Yang, H.-J., Kim, S.-W. & Jhon, M. Additive cross-modal attention network (ACMA) for depression detection based on audio and textual features. IEEE Access 12, 20479–20489 (2024).

47.

Joharee, I. N., Hashim, N. N. W. N. & Shah, N. S. M. Sentiment analysis and text classification for depression detection. J. Integr. Adv. Eng. 3, 65–78 (2023).

48.

Krishnamurti, T., Allen, K., Hayani, L., Rodriguez, S. & Davis, A. L. Identification of maternal depression risk from natural language collected in a mobile health app. Procedia. Comput. Sci. 206, 132–140 (2022).

49.

Li, N. et al. Using deeply time-series semantics to assess depressive symptoms based on clinical interview speech. Front. Psychiatry 14, 1104190 (2023).

50.

Liu, T. et al. The relationship between text message sentiment and self-reported depression. J. Affect. Disord. 302, 7–14 (2022).

51.

Munthuli, A. et al. Classification and analysis of text transcription from Thai depression assessment tasks among patients with depression. PLoS One 18, e0283095 (2023).

52.

Nobles, A. L., Glenn, J. J., Kowsari, K., Teachman, B. A. & Barnes, L. E. Identification of imminent suicide risk among young adults using text messages. in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems 1–11 (2018).

53.

Oh, J. et al. Development of depression detection algorithm using text scripts of routine psychiatric interview. Front. Psychiatry 14, 1256571 (2024).

54.

Ohse, J. et al. Zero-Shot Strike: Testing the generalisation capabilities of out-of-the-box LLM models for depression detection. Comput. Speech Lang. 88, 101663 (2024).

55.

Orhan, Z., Mercan, M. & Gökgöl, M. K. A new digital mental health system infrastructure for diagnosis of psychiatric disorders and patient follow-up by text analysis in Turkish. in International Conference on Medical and Biological Engineering 395–402 (Springer, 2019).

56.

Porkaew, P., Zhu, T., Li, A. & Chuenphitthayavut, K. The effectiveness of a sentence completion test for depression screening using large language models. Acta Psychol. 259, 105425 (2025).

57.

Pérez-Toro, P. A. et al. Depression assessment in people with Parkinson’s disease: The combination of acoustic features and natural language processing. Speech Commun. 145, 10–20 (2022).

58.

Podina, I. R., Bucur, A.-M., Fodor, L. & Boian, R. Screening for common mental health disorders: a psychometric evaluation of a chatbot system. Behav. Inf. Technol. 44, 2160–2169 (2025).

59.

Ren, X., Burkhardt, H. A., Areán, P. A., Hull, T. D. & Cohen, T. Deep representations of first-person pronouns for prediction of depression symptom severity. in AMIA Symposium vol. 2023 1226 (2024).

60.

Resnik, P., Garron, A. & Resnik, R. Using topic modeling to improve prediction of neuroticism and depression in college students. in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing 1348–1353 (2013).

61.

Rutowski, T. et al. Depression and anxiety prediction using deep language models and transfer learning. in 2020 7th International Conference on Behavioural and Social Computing 1–6 (IEEE, 2020).

62.

Shen, Y., Yang, H. & Lin, L. Automatic depression detection: An emotional audio-textual corpus and a gru/bilstm-based model. in International Conference on Acoustics, Speech and Signal Processing 6247–6251 (IEEE, 2022).

63.

Shin, D. et al. Detection of depression and suicide risk based on text from clinical interviews using machine learning: possibility of a new objective diagnostic marker. Front. Psychiatry 13, 801301 (2022).

64.

Shin, D., Kim, H., Lee, S., Cho, Y. & Jung, W. Using large language models to detect depression from user-generated diary text data as a novel approach in digital mental health screening: instrument validation study. J. Med. Internet Res. 26, e54617 (2024).

65.

Smirnova, D. et al. Language patterns discriminate mild depression from normal sadness and euthymic state. Front. Psychiatry 9, 105 (2018).

66.

Smirnova, D. et al. Language in mild depression: How it is spoken, what it is about, and why it is important to listen. Psychiatria Danub. 31, 427–433 (2019).

67.

Sood, P., Yang, X. & Wang, P. Enhancing depression detection from narrative interviews using language models. in International Conference on Bioinformatics and Biomedicine 3173–3180 (IEEE, 2023).

68.

Tao, Y. et al. Classifying anxiety and depression through LLMs virtual interactions: a case study with ChatGPT. in International Conference on Bioinformatics and Biomedicine 2259–2264 (IEEE, 2023).

69.

Tlachac, M. & Rundensteiner, E. Screening for depression with retrospectively harvested private versus public text. IEEE J. Biomed. Health Inform. 24, 3326–3332 (2020).

70.

Tlachac, M. et al. Studentsadd: Rapid mobile depression and suicidal ideation screening of college students during the coronavirus pandemic. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 6, 1–32 (2022).

71.

Tlachac, M. et al. Text generation to aid depression detection: a comparative study of conditional sequence generative adversarial networks. in International Conference on Big Data 2804–2813 (IEEE, 2022).

72.

Tlachac, M., Shrestha, A., Shah, M., Litterer, B. & Rundensteiner, E. A. Automated construction of lexicons to improve depression screening with text messages. IEEE J. Biomed. Health Inform. 27, 2751–2759 (2022).

73.

Weber S, Deperrois N, Heun R, et al. Using a fine-tuned large language model for symptom-based depression evaluation. npj Digital Medicine.8(1), 598 (2025).

74.

Wright-Berryman, J., Cohen, J., Haq, A., Black, D. P. & Pease, J. L. Virtually screening adults for depression, anxiety, and suicide risk using machine learning and language from an open-ended interview. Front. Psychiatry 14, 1143175 (2023).

75.

Xue, J. et al. Fusing multi-level features from audio and contextual sentence embedding from text for interview-based depression detection. in International Conference on Acoustics, Speech and Signal Processing 6790–6794 (IEEE, 2024).

76.

Ye, J. et al. Multi-modal depression detection based on emotional audio and evaluation text. J. Affect. Disord. 295, 904–913 (2021).

77.

Yuan, J. et al. Depressive tendency recognition using the gated recurrent unit from speech and text features. in 2021 International Conference on Asian Language Processing 42–46 (IEEE, 2021).

78.

Zhang Z, Zhang S, Ni D, et al. Multimodal sensing for depression risk detection: Integrating audio, video, and text data. Sensors. 24(12), 3714 (2024).

79.

Zou, B. et al. Semi-structural interview-based Chinese multimodal depression corpus towards automatic preliminary screening of depressive disorders. IEEE Trans. Affect. Comput. 14, 2823–2838 (2022).

80.

Morales, M. R. & Levitan, R. Speech vs. text: A comparative analysis of features for depression detection systems. in Spoken Language Technology Workshop 136–143 (IEEE, 2016).

81.

Özkanca, Y., Demiroglu, C., Besirli, A. & Celik, S. Multi-Lingual Depression-Level Assessment from Conversational Speech Using Acoustic and Text Features. in Interspeech 3398–3402 (2018).

82.

Agarwal, N., Dias, G. & Dollfus, S. Agent-based splitting of patient-therapist interviews for depression estimation. In Empowering Communities: A Participatory Approach to AI for Mental Health (2022).

83.

Agarwal, N., Dias, G. & Dollfus, S. Multi-view graph-based interview representation to improve depression level estimation. Brain Inform. 11, 14 (2024).

84.

Al Hanai, T., Ghassemi, M. M. & Glass, J. R. Detecting depression with audio/text sequence modeling of interviews. in Interspeech 1716–1720 (2018).

85.

Ansari, G., Sharma, A., Arya, P. & Saxena, Y. Multimodal Depression Detection System Using Machine Learning. In 2023 Second International Conference on Informatics 1–7 (IEEE, 2023).

86.

Burdisso, S., Villatoro-Tello, E., Madikeri, S. & Motlicek, P. Node-weighted Graph Convolutional Network for Depression Detection in Transcribed Clinical Interviews. in Interspeech 3617–3621 (2023).

87.

Cao, Y., Hao, Y., Li, B. & Xue, J. Depression prediction based on BiAttention-GRU. J. Ambient Intell. Humaniz. Comput. 13, 5269–5277 (2022).

88.

Chen, Z. et al. Depression detection in clinical interviews with LLM-empowered structural element graph. in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 8181–8194 (2024).

89.

Correia, J., Trancoso, I. & Raj, B. Detecting psychological distress in adults through transcriptions of clinical interviews. in International Conference on Advances in Speech and Language Technologies for Iberian Languages 162–171 (Springer, 2016).

90.

Dang, T. et al. Investigating word affect features and fusion of probabilistic predictions incorporating uncertainty in AVEC 2017. in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge 27–35 (2017).

91.

Danner, M. et al. Advancing mental health diagnostics: GPT-based method for depression detection. in 2023 62nd Annual Conference of the Society of Instrument and Control Engineers 1290–1296 (IEEE, 2023).

92.

Fang, M., Peng, S., Liang, Y., Hung, C.-C. & Liu, S. A multimodal fusion model with multi-level attention mechanism for depression detection. Biomed. Signal Process. Control 82, 104561 (2023).

93.

Firoz, N., Beresteneva, O. G., Vladimirovich, A. S., Tahsin, M. S. & Tafannum, F. Automated text-based depression detection using hybrid ConvLSTM and Bi-LSTM model. in 2023 Third International Conference on Artificial Intelligence and Smart Energy 734–740 (IEEE, 2023).

94.

Firoz, N., Beresteneva, O. G. & Aksyonov, S. V. Enhancing depression detection: Employing autoencoders and linguistic feature analysis with BERT and LSTM model. in 2023 International Russian Automation Conference 299–304 (IEEE, 2023).

95.

Guo, Y. et al. A prompt-based topic-modeling method for depression detection on low-resource data. IEEE Trans. Comput. Soc. Syst. 11, 1430–1439 (2023).

96.

Hadzic, B. et al. Enhancing early depression detection with AI: a comparative use of NLP models. SICE J. Control. Meas. Syst. Integr. 17, 135–143 (2024).

97.

Hong, S., Cohn, A. & Hogg, D. C. Using graph representation learning with schema encoders to measure the severity of depressive symptoms. in International Conference on Learning Representations (2022).

98.

Jo, A.-H. & Kwak, K.-C. Diagnosis of depression based on four-stream model of bi-LSTM and CNN from audio and text information. IEEE Access 10, 134113–134135 (2022).

99.

Kokkera, A., Varsha, N. & Vasanth, A. V. Multimodal Approach for Detecting Depression Using Physiological and Behavioural Data. in 2023 3rd International Conference on Pervasive Computing and Social Networking 53–65 (IEEE, 2023).

100.

Lam, G., Dongyan, H. & Lin, W. Context-aware deep learning for multi-modal depression detection. in International Conference on Acoustics, Speech and Signal Processing 3946–3950 (IEEE, 2019).

101.

Lau, C., Chan, W.-Y. & Zhu, X. Improving depression assessment with multi-task learning from speech and text information. in 2021 55th Asilomar Conference on Signals, Systems, and Computers 449–453 (IEEE, 2021).

102.

Lau, C., Zhu, X. & Chan, W.-Y. Automatic depression severity assessment with deep learning using parameter-efficient tuning. Front. Psychiatry 14, 1160291 (2023).

103.

Li, C., Braud, C. & Amblard, M. Multi-Task Learning for Depression Detection in Dialogs. in Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. 1–8 (2022).

104.

Li, M., Xu, H., Liu, W. & Liu, J. Bidirectional lstm and attention for depression detection on clinical interview transcripts. in IEEE 10th International Conference on Information, Communication and Networks (ICICN) 638–643 (IEEE, 2022).

105.

Li, M., Sun, X. & Wang, M. Detecting depression with heterogeneous graph neural network in clinical interview transcript. IEEE Trans. Comput. Soc. Syst. 11, 1315–1324 (2023).

106.

Lin, L., Chen, X., Shen, Y. & Zhang, L. Towards automatic depression detection: A BiLSTM/1D CNN-based model. Appl. Sci. 10, 8701 (2020).

107.

Lopez-Otero, P., Fernández, L. D., Abad, A. & Garcia-Mateo, C. Depression Detection Using Automatic Transcriptions of De-Identified Speech. in Interspeech 3157–3161 (2017).

108.

Lorenc, P., Uban, A.-S., Rosso, P. & Šedivý, J. Detecting early signs of depression in the conversational domain: The role of transfer learning in low-resource scenarios. in International Conference on Applications of Natural Language to Information Systems 358–369 (Springer, 2022).

109.

Lu, K.-C., Thamrin, S. A. & Chen, A. L. Depression detection via conversation turn classification. Multimed. Tools Appl. 82, 39393–39413 (2023).

110.

Rodrigues Makiuchi, M., Warnita, T., Uto, K. & Shinoda, K. Multimodal fusion of bert-cnn and gated cnn representations for depression detection. in Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop 55–63 (2019).

111.

Mallol-Ragolta, A., Zhao, Z., Stappen, L., Cummins, N. & Schuller, B. A hierarchical attention network-based approach for depression detection from transcribed clinical interviews. in Interspeech 221–225 (2019).

112.

Mao, K. et al. Prediction of depression severity based on the prosodic and semantic features with bidirectional LSTM and time distributed CNN. in IEEE Transactions on Affective Computing vol. 14 2251–2265 (IEEE, 2022).

113.

Milintsevich, K., Sirts, K. & Dias, G. Towards automatic text-based estimation of depression through symptom prediction. Brain Inform. 10, 4 (2023).

114.

Niu, M., Chen, K., Chen, Q. & Yang, L. Hcag: A hierarchical context-aware graph attention model for depression detection. in International Conference on Acoustics, Speech and Signal Processing 4235–4239 (IEEE, 2021).

115.

Pampouchidou, A. et al. Depression assessment by fusing high and low level features from audio, video, and text. in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge 27–34 (2016).

116.

Prabhu, S., Mittal, H., Varagani, R., Jha, S. & Singh, S. Harnessing emotions for depression detection. Pattern Anal. Appl. 25, 537–547 (2022).

117.

Qureshi, S. A., Saha, S., Hasanuzzaman, M. & Dias, G. Multitask representation learning for multimodal estimation of depression level. IEEE Intelligent Systems 34, 45–52 (2019).

118.

Qureshi, S. A., Dias, G., Hasanuzzaman, M. & Saha, S. Improving depression level estimation by concurrently learning emotion intensity. IEEE Comput. Intell. Mag. 15, 47–59 (2020).

119.

Oureshi, S. A., Dias, G., Saha, S. & Hasanuzzaman, M. Gender-aware estimation of depression severity level in a multimodal setting. In 2021 International Joint Conference on Neural Networks 1–8 (IEEE, 2021).

120.

Rasipuram, S., Bhat, J. H., Maitra, A., Shaw, B. & Saha, S. Multimodal depression detection using task-oriented transformer-based embedding. in 2022 IEEE Symposium on Computers and Communications 01–04 (IEEE, 2022).

121.

Ray, A., Kumar, S., Reddy, R., Mukherjee, P. & Garg, R. Multi-level attention network using text, audio and video for depression prediction. in Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop 81–88 (2019).

122.

Rinaldi, A., Tree, J. E. F. & Chaturvedi, S. Predicting depression in screening interviews from latent categorization of interview prompts. in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 7–18 (2020).

123.

Rohanian, M., Hough, J. & Purver, M. Detecting depression with word-level multimodal fusion. in Interspeech 1443–1447 (2019).

124.

Sadeghi, M. et al. Exploring the capabilities of a language model-only approach for depression detection in text data. in International Conference on Biomedical and Health Informatics 1–5 (IEEE, 2023).

125.

Sadeghi, M. et al. Harnessing multimodal approaches for depression detection using large language models and facial expressions. NPJ Ment. Health Res. 3, 66 (2024).

126.

Samareh, A., Jin, Y., Wang, Z., Chang, X. & Huang, S. Predicting depression severity by multi-modal feature engineering and fusion. In Proceedings of the AAAI Conference on Artificial Intelligence vol. 32 (2018).

127.

Senn, S., Tlachac, M., Flores, R. & Rundensteiner, E. Ensembles of bert for depression classification. in 2022 4th Annual International Conference of the IEEE Engineering in Medicine & Biology Society 4691–4694 (IEEE, 2022).

128.

Stasak, B., Epps, J. & Goecke, R. Elicitation Design for Acoustic Depression Classification: An Investigation of Articulation Effort, Linguistic Complexity, and Word Affect. in Interspeech vol. 17 834–838 (2017).

129.

Stepanov, E. A. et al. Depression severity estimation from multiple modalities. in 2018 IEEE 20th International Conference on e-Health Networking, Application and Services 1–6 (IEEE, 2018).

130.

Sun, B. et al. A random forest regression method with selected-text feature for depression assessment. in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge 61–68 (2017).

131.

Toto, E., Tlachac, M. & Rundensteiner, E. A. Audibert: A deep transfer learning multimodal classification framework for depression screening. in Proceedings of the 30th ACM International Conference on Information & Knowledge Management 4145–4154 (2021).

132.

Marriwala, N. & Chaudhary, D. A hybrid model for depression detection using deep learning. Meas.: Sensors 25, 100587 (2023).

133.

Van Steijn, F., Sogancioglu, G. & Kaya, H. Text-based interpretable depression severity modeling via symptom predictions. in Proceedings of the 2022 International Conference on Multimodal Interaction 139–147 (2022).

134.

Villatoro-Tello, E., Ramírez-de-la-Rosa, G., Gática-Pérez, D., Magimai.-Doss, M. & Jiménez-Salazar, H. Approximating the mental lexicon from clinical interviews as a support tool for depression detection. in Proceedings of the 2021 International Conference on Multimodal Interaction 557–566 (2021).

135.

Williamson, J. R. et al. Detecting depression using vocal, facial and semantic communication cues. in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge 11–18 (2016).

136.

Xezonaki, D., Paraskevopoulos, G. & Potamianos, A. Affective Conditioning on Hierarchical Attention Networks applied to Depression Detection from Transcribed Clinical Interviews. in Interspeech 4556–4560 (2020).

137.

Xia, Y. et al. A depression detection model based on multimodal graph neural network. Multimed. Tools Appl. 83, 63379–63395 (2024).

138.

Xiao, J., Huang, Y., Zhang, G. & Liu, W. A deep learning method on audio and text sequences for automatic depression detection. in 2021 3rd International Conference on Applied Machine Learning 388–392 (IEEE, 2021).

139.

Xu, X., Zhang, G., Lu, Q. & Mao, X. Multimodal depression recognition that integrates audio and text. in 2023 4th International Symposium on Computer Engineering and Intelligent Communications 164–170 (IEEE, 2023).

140.

Yadav, U. & Sharma, A. K. A novel automated depression detection technique using text transcript. Int. J. Imaging Syst. Technol. 33, 108–122 (2023).

141.

Yang, L. et al. Hybrid depression classification and estimation from audio video and text information. in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge 45–51 (2017).

142.

Yang, L. et al. Multimodal measurement of depression using deep learning models. in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge 53–59 (2017).

143.

Yang, L., Jiang, D. & Sahli, H. Integrating deep and shallow models for multi-modal depression analysis—hybrid architectures. IEEE Trans. Affect. Comput. 12, 239–253 (2018).

144.

Yang, C., Lai, X., Hu, Z., Liu, Y. & Shen, P. Depression tendency screening use text based emotional analysis technique. in J. Phys.: Conf. Ser. vol. 1237 032035 (IOP Publishing, 2019).

145.

Zhang, Z., Lin, W., Liu, M. & Mahmoud, M. Multimodal deep learning framework for mental disorder recognition. in Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition 344–350 (IEEE, 2020).

146.

Zhang, Y., Wang, Y., Wang, X., Zou, B. & Xie, H. Text-based decision fusion model for detecting depression. in Proceedings of the 2020 2nd Symposium on Signal Processing Systems 101–106 (2020).

147.

Zhang, W., Mao, K. & Chen, J. A Multimodal approach for detection and assessment of depression using text, audio and video. Phenomics 4, 234–249 (2024).

148.

Zhang, J. & Guo, Y. Multilevel depression status detection based on fine-grained prompt learning. Pattern Recog. Lett. 178, 167–173 (2024).

149.

Zhao, Z. & Wang, K. Unaligned multimodal sequences for depression assessment from speech. in 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society 3409–3413 (IEEE, 2022).

Fig. 1

Flow Diagram of Study Selection

Fig. 2

Forest plot of study-level classification accuracy with pooled random-effects estimate

Fig. 3

Galbraith (radial) plot assessing heterogeneity across studies in classification accuracy

Note. The Galbraith (radial) plot displays each study’s standardized effect size (z_i) against its statistical precision (1 / √(v_i + τ²)), where precision represents the inverse of the standard error, that is, the reliability of each study’s estimate. The x-axis indicates statistical precision, studies farther to the right are more precise (smaller SE), and the y-axis represents the standardized effect size, with points near the top or bottom deviating more from the pooled effect. The solid line represents the pooled effect from the random-effects model, and the shaded 95% confidence region indicates the expected range of variation.

In this analysis, most studies fall within the shaded region, suggesting that the observed heterogeneity (I² = 98.8%) reflects broad variability across studies rather than the influence of a single outlier.

Fig. 4

Forest Plot of Classification Accuracy from Sensitivity Analysis Excluding Studies

Yes

Because the DAIC dataset and its derivatives were repeatedly used across numerous studies, including overlapping samples and similar recording procedures, their inclusion could bias descriptive summaries toward the characteristics of that single corpus. Therefore, for dataset characteristics section, DAIC-based studies are summarized separately from those using unique datasets to better represent the diversity of data sources included in the review.

Many studies tested multiple NLP feature extraction and classification methods. The counts reported here represent the best-performing approach in each study rather than the only method used. For example, complex models were compared with simpler or traditional approaches to test improved performance.