Language-Based Detection of Depression with Machine Learning: Systematic Review and Meta-Analysis
HadarFisher
Ph.D.
1,2,4✉
Phone617-564-6140Email
NigelM.Jaffe
BA
1
KristinaPidvirny
BA
1
AnnaO.Tierney
BA
1
MiaS.Vaidean1
PoorveshDongre
PhD
3
ChristianA.Webb
PhD
1,2
1McLean HospitalBelmontMAUSA
2Department of PsychiatryHarvard Medical SchoolBostonMAUSA
3Department of Computer ScienceVirginia TechBlacksburgVAUSA
4
A
Harvard Medical School & McLean Hospital115 Mill Street02478BelmontMA
Hadar Fisher, PhD1,2
Nigel M. Jaffe, BA1
Kristina Pidvirny, BA1
Anna O. Tierney, BA1
Mia S. Vaidean1
Poorvesh Dongre, PhD3
Christian A. Webb, PhD1,2
1 McLean Hospital, Belmont, MA, USA
2 Department of Psychiatry, Harvard Medical School, Boston, MA, USA
3 Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
Corresponding author:
Hadar Fisher, Ph.D.
Harvard Medical School & McLean Hospital
115 Mill Street, Belmont, MA 02478.
Email: hbfisher@mclean.harvard.edu
Phone: 617-564-6140
Fax: 617-855-4231
Abstract
Early detection of depression is critical for timely intervention. Natural language processing (NLP) and machine learning (ML) approaches have increasingly been used to automatically detect depression from text data, yet comprehensive evidence regarding their diagnostic performance remains limited.
A
We systematically reviewed and meta-analyzed studies applying NLP and ML to identify depression from spoken or written language. Six electronic databases and additional sources were searched, yielding 892 full-text articles, of which 123 met inclusion criteria. One representative result per dataset was selected for quantitative synthesis, resulting in 50 independent studies. Pooled accuracy across studies (k = 43; n = 40,983) was 0.80 (95% CI, 0.76–0.83). Precision (k = 28) was 0.78 (95% CI, 0.72–0.83), recall (k = 33) 0.76 (95% CI, 0.68–0.83), AUC (k = 14) 0.79 (95% CI, 0.70–0.85), and balanced accuracy (k = 16) 0.71 (95% CI, 0.63–0.78). Subgroup analyses showed significant differences by language, text source, feature type, and classifier (all p < .001). Accuracy was highest in studies using structured clinical interviews, non-English languages, and linguistic or embedding-based features. However, in one-at-a-time meta-regressions, only text source remained a significant predictor (QM(3) = 8.78, p = .032), explaining 13.6% of the between-study variance. Publication bias was minimal. Automated depression detection from text shows promising performance with substantial heterogeneity. Performance varies by language, data source, feature extraction, and model type. Findings highlight both current limitations and potential of text-based depression detection and underscore the need for methodological standardization and validation before clinical use.
Key words:
Depression
Artificial intelligence
Natural Language Processing
Machine learning
A
A
A
Introduction
Depression is among the most prevalent psychiatric disorders worldwide and a leading cause of disability.1 Its global prevalence has risen steadily over the past three decades, with the United States alone seeing depressive symptoms increase approximately threefold from 2017 to 2020.2 Despite this high burden, depression remains substantially underdiagnosed and undertreated, with many individuals never seeking professional help or doing so only after symptoms have become severe and impairing.3 Untreated depression is associated with worse prognosis, highlighting the importance of early identification for timely intervention and improved outcomes.4,5 Emerging digital health tools offer opportunities to expand access and may enable earlier depression detection.6
One promising avenue for automated depression detection lies in the analysis of language.7 Advances in natural language processing (NLP) and machine learning (ML) now enable scalable, automated analysis of linguistic data to detect depression.8 The use of NLP and ML to detect depression has grown exponentially in recent years. However, despite this progress, it remains uncertain how accurate these approaches are overall and whether their performance generalizes reliably across different contexts. Importantly, much of this work has been published in computer science journals, making it less accessible to psychiatric researchers and clinicians who might benefit from these insights. Clinical researchers must be engaged and informed to critically evaluate, guide, and translate these tools into meaningful clinical use.
Several prior reviews have mapped the landscape of NLP and ML approaches for detecting depression, summarizing common text sources, feature extraction strategies, and model architectures. 8–15 However, despite the rapid expansion of this research area, none of these reviews quantitatively evaluated the accuracy of these models across studies. In this systematic review and meta-analysis, we evaluated the performance of NLP- and ML-based models in detecting depression using datasets where depression status was determined independently (e.g., through validated questionnaires or clinical diagnoses). We further examined study-level moderators, including language, text source, feature type, and model class.
Methods
A
This systematic review and meta-analysis evaluated the performance of NLP and ML models for detecting depression from text data. The review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) reporting guidelines (Table S1).16 The study protocol was preregistered with the International Prospective Register of Systematic Reviews (PROSPERO; ID CRD42024513390).
A
A
Screening and review management were conducted using Covidence software (Veritas Health Innovation, Melbourne, Australia). In the first and second phases (title and abstract review, full text review), two independent raters screened each record. Disagreements were resolved by consensus between reviewers. To ensure consistency and reliability, all reviewers participated in six structured training sessions prior to data collection: two sessions each for title screening, abstract screening, and data extraction. Meetings continued periodically throughout the screening and extraction period.
Data Sources and Search Strategy
We systematically searched both computer science and psychiatric databases to capture the multidisciplinary literature on this topic. Searches were conducted during March 2024 in ScienceDirect, IEEE Xplore, ACM Digital Library, Scopus, MEDLINE, PubMed, and PsycINFO. The search was supplemented in September 2025 by a hand search of reference lists and relevant reviews. The search string combined disorder-specific and computational terms: “depress*” AND (“language” OR “text”). Searches were restricted to peer-reviewed articles published in English.
Study Eligibility Criteria
A
Inclusion criteria required that studies: (1) used adult participants’ (≥ 18 years) own spoken or written text data to detect (concurrent) depression; (2) employed formal measures of depression (e.g., structured clinical interview, self-report questionnaire, clinical diagnoses); (3) applied machine learning (ML) models to predict the person’s depression status (i.e., binary classification) or depression severity; (4) relied solely on textual input for model development and prediction (i.e., studies incorporating additional modalities such as audio, video, or demographic variables were excluded if they lacked a text-specific model); (5) were peer-reviewed and fully available in scientific databases; and (6) provided sufficient information to extract classification performance (e.g., accuracy, precision, recall, F1, or equivalent).
A
For the purpose of this review, we included only studies that evaluated model performance on an external sample (e.g., through cross-validation or independent test data), ensuring that reported estimates reflected true predictive capacity rather than overfitting to training data. Large Language Models (LLMs) that were used directly for text classification were not required to include additional validation, as these models are pretrained on large, diverse corpora and apply learned representations without task-specific parameter fitting, reducing the risk of overfitting. However, when LLMs were used solely for feature extraction (e.g., deriving sentiment or emotion scores) and the extracted features were subsequently entered into a separate classification or prediction model to detect depression, such studies were required to include a validation process to ensure the robustness and generalizability of the results.
Studies relying exclusively on social media text were excluded. We made the decision to exclude social media text in order to ensure the inclusion of high-quality studies with clear clinical relevance. Social media data pose several limitations15: first, depressed individuals may change their online activity, leading to underrepresentation and biased samples. Second, in many social media datasets, depression status is inferred directly from the text, for example, by external annotators rating whether the text contains depression-related content, so the same material is used both to define and to predict depression. This circular annotation strategy undermines label validity and limits generalizability to clinically assessed depression.
Data Extraction and Risk of Bias Assessment
HF and two reviewers per study independently extracted data from eligible studies using a standardized template and assessed risk of bias using the ML-specific quality assessment tool developed by Ciharova et al.,17 adapted from established instruments (PROBAST,18 Cochrane RoB 2.0,19 ROBINS-I20). When studies reported results from multiple ML models (e.g., based on different feature sets or classifiers), we extracted the performance metrics of the model that achieved the highest performance. Any discrepancies were resolved in group discussion until consensus was reached. In addition to reading the full text of each article, extracted responses were validated using NotebookLM, which supported but never replaced reviewer judgment. This process was implemented to maximize accuracy and agreement while ensuring that decisions were grounded in direct review of the original studies.
Extracted variables included: study sample size; text source and language; depression measure; feature extraction method (e.g., statistical, linguistic, embeddings, transformer, hybrid); classifier type (traditional machine learning, neural, transformer); type of validation (e.g., k-fold cross validation, train-test split); and reported outcomes (e.g., accuracy, precision, recall, and other model performance metrics).
A
Given the absence of a standardized framework for evaluating the quality of machine learning studies, we adopted the quality assessment approach used in prior research on ML-based prediction models.17 In line with that framework, study quality was assessed across four domains: (1) adequate sample size (≥ 100 participants), (2) balanced class distribution in cases where participants are divided into multiple groups, such as depressed vs. non-depressed (no group more than ten times smaller than others), (3) appropriate model validation (model parameters tuned on a training set and evaluated on an independent test set), (4) use of a validated outcome measure assessing depression presence or severity (a.k.a., ground truth). These criteria were designed to minimize risk of overfitting and enhance generalizability to unseen data. Since using validation was an inclusion criterion, item 3 was not relevant and is excluded from the quality assessment table. See Table S2 for the risk of bias table.
Outcomes
The primary outcome was pooled classification accuracy, which quantifies the overall proportion of classifications made by the model that were correct, or in this case, the overall rate at which the model correctly identifies both depressed and non-depressed individuals. Secondary outcomes included precision (positive predictive value), sensitivity (recall), and the area under the receiver operating characteristic curve (AUC). Precision is the proportion of correctly predicted positive cases among all predicted positive cases, whereas sensitivity is the proportion of actual positive cases correctly identified. High precision reflects fewer false alarms, while high sensitivity reflects fewer missed cases. AUC considers both sensitivity and specificity in its calculation, showing the balance between predicting a positive outcome when the outcome is indeed positive and predicting a negative outcome when the outcome is indeed negative. Mathematically, it represents the area under the curve plotting sensitivity (true positive rate) against 1 – specificity (false positive rate). When possible, we also calculated balanced accuracy, defined as the average of sensitivity and specificity. By giving equal weight to correctly identifying positive and negative cases regardless of their prevalence, it provides a less sample-biased estimate of model performance, particularly in datasets where one class (e.g., depressed or non-depressed) is more frequent than the other.
A
The F1-score was used for interpretation in the systematic review but was not included as an outcome in the meta-analysis. This is because it is a harmonic mean of two ratios that already contain partial information about sample size, making its sampling variance difficult to estimate for meta-analytic weighting.
Statistical Analysis
We extracted both qualitative and quantitative data from each selected study. Meta-analysis was conducted for studies where data availability permitted summary estimation with 95% confidence intervals (CI). Accuracy and related performance metrics were first logit-transformed to stabilize variances and normalize distributions before pooling. Primary pooled estimates were obtained using random-effects models fit with maximum likelihood estimation. To assess heterogeneity, we used
, which quantifies the absolute magnitude of the true between-study variance, and
, which indicates the percentage of the total observed variance that is attributable to real differences in effect sizes rather than to sampling error. Publication bias was evaluated through visual inspection of funnel plots and the Galbraith (radial) plot, Egger’s regression tests, and the trim-and-fill method. As a sensitivity analysis, leave-one-out analyses assessed the influence of individual studies. We repeated the same procedure with precision, sensitivity and AUC.
To enable pooled estimation across studies that reported performance metrics but not underlying classification counts (i.e., confusion-matrix data), we reconstructed approximate event counts for AUC, precision, and recall following the same logic applied to accuracy. For each study, the number of “events” was computed as the reported metric multiplied by the total sample size, rounded to the nearest integer, with the remainder treated as “failures.” This reconstruction allowed standard error estimation based on the binomial distribution and ensured consistent weighting across studies with differing sample sizes. While AUC reflects overall discriminative ability and precision and recall represent class-specific performance, all three were analyzed using the same framework to ensure comparability. Using total sample size instead of class- or threshold-specific denominators provided a conservative approximation for variance estimation across studies. Effect sizes were calculated using logit transformation of proportions (PLO). A continuity correction of 0.5 was applied to stabilize estimates when proportions were 0 or 1. All pooled estimates and CIs were back-transformed to the probability scale for interpretability.
Subgroup analyses compared pooled accuracies across language (English, Chinese, Other), source of text (clinical interview, open-ended questions, communication, interaction with therapist), feature extraction method (simple, linguistic, embeddings, transformer, hybrid), and classifier type (traditional, neural, transformer). Between-group heterogeneity (Qbetween) was estimated using random-effects models with restricted maximum likelihood estimation. To examine whether these factors explained variability in accuracy, we then ran one-at-a-time meta-regressions with omnibus tests (QM) and R² based on reductions in τ². All analyses were performed in R, version 4.3.3, using the metafor package (version 4.8.0). Code is available at https://osf.io/x7tm9.
Results
The search yielded 35,000 records. After removing duplicates and screening titles and abstracts, 892 full-text articles were reviewed, and 123 publications contributing 129 effects were included in the qualitative synthesis (Fig. 1). When articles reported results based on multiple datasets, each dataset-specific effect was treated as a separate study.
Over half of the included studies (73/129, 56.6%) used the Distress Analysis Interview Corpus (DAIC) or related datasets derived from it, including three studies that used DAIC alongside another dataset. The DAIC is a well-known benchmark dataset comprising semi-structured clinical interviews. Its Wizard of Oz (DAIC-WOZ) subset,21 released as part of the AVEC 2016 challenge, includes 189 interviews conducted by a virtual agent, with audio, video, and transcript recordings available. An extended version (E-DAIC)22 was later released for AVEC 2019, expanding the sample to 275 interviewed participants. In both datasets, depression severity was labeled using the Patient Health Questionnaire-8 (PHQ-8)23. Table 1 summarizes studies using unique datasets, and Table 2 summarizes those using DAIC data.
Table 1
Description of Studies Included in the Systematic Review and Meta-Analyses
Study
Language
N
Diagnosis
Source of text
Features extraction method
Outcome
Type of classification model
Validation
Accuracy
F1
AUC
Specificity
Precision
Recall/Sensitivity
Abilkaiyrkyzy 202429
English
20 (+ 275 trained on E-DAIC)
Mildly depressed: 8, Moderately depressed: 6
Not depressed: 4
Open-ended questions
Transformer (BERT tokenizer and LanguageModelFeaturizer)
PHQ-9
Fine-tuned BERT sequence classifier for multi-class depression severity (Softmax output).
Trained on E-DAIC tested on sample of 20 university students
0.65
     
Aloshban 202130
Italian
59
Depressed: 29
Not Depressed: 30
Interviewed about everyday life aspects (e.g., activities in the weekend of interaction with family members)
Embedding (Wikipedia2Vec)
Professional psychiatrists’ diagnosis
BiLSTM
5-fold cross validation
0.729
0.619
  
1
0.448
Antoniou 202231
English
773 (270)
Depression/stress: 356 sessions of 184 patients.
Other problems: 417 sessions of 86 patients
Interaction with therapist (text-based counseling)
LIWC
Patients report presenting problem before the first interaction
Quadratic discriminant analyses
5-fold cross validation
0.778
0.71
0.76
   
Banerjee 202132
English
1999
Unclear, 71.4% from the data before cleaning
Open-ended questions
Embedding (doc2vec); Affective features; Word polarity; linguistic tags (e.g., Proper Noun Tag, Singular Noun Tag)
PHQ-9
CNN-Dynamic Attention
60-20-20 random train-validation-test split
 
0.644
    
Boian 202533
Romanian
3955 (861)
Per-item classification Not at all (NO): 457 Several days (SD): 1,063, More than half the days (HA): 446, Nearly every day (EV): 442 or Irrelevant
(IR): 227
Clinical interview (conducted by
aiCARE chatbot)
TF-IDF
PHQ-9
Logistic regression
Split to train and test set
Train: 1320
Test: 2635
0.840
0.80
 
0.80
0.80
0.80
Burkhardt 202234
English
13327 (6551)
Unclear
Interaction with therapist
LIWC; Embedding (BERT-based model for GoEmotions was used to extract emotion features)
PHQ-9
Random forest
Training set (80%): 4,913 patients, 10,006 observations
Test set (20%): 1,638 patients, 3,321 observations
 
0.520
0.67
 
0.612
0.453
Cao 202535
Chinese
50 (the full sample included 100 patients the best model included 50)
Very severe
37
Mild
27
Severe
19
Normal
13
Moderate
13
Clinical interview
LLM (Qwen2.5-7B-Instruct)
HAMD-17
LLM: Qwen2.5-7B-Instruct fine-tuned with LoRA
leave-one-out cross-validation
 
0.61
  
0.61
0.61
Chen 202436 (CMDC)
(Same as Zou 2023)
Chinese
78
Depressed: 26 Not depressed: 52
Clinical interview
Transformer (the Chinese-BERT);
Xmnlp: Word-level features: ratios of adjectives, adverbs, exclamations, verbs, auxiliary words, modal particles, and total word count; Sentence-level features: number of sentences, ratio of positive and negative sentences, and overall sentiment score; Lexical-emotion feature: proportion of modal words
MINI
IIFDD
5-fold cross-validation
0.87*
0.8
  
0.82
0.79
Chen 202436 (EATD)
(Same as Shen 2022)*
Chinese
162
Depressed: 30 Not depressed: 132
General interview
Same as Chen 2024a (above)
The Self-rating Depression Scale (SDS).
IIFDD
3-fold cross-validation
 
0.45
  
0.36
0.70
Cohen 202337
English
73 (68)
Depressed: 15
Control: 58
Interaction with an online agent, Tina
TDF-IDF
PHQ-9
SVM
leave-one-subject-out
  
0.54
   
Cook 201638
Spanish
1458
Depressed: 662
Not depressed: 796
Free-text responses to the question: "how do you feel today?"
n-gram
GHQ-12
Logistic regression
50% split to train and test
0.53
0.42
 
0.79
0.64
0.31
deHond 202439
English
4070
Depressed: 127
Not depressed: 3943
Patient (cancer)-generated emails to their health care teams
Transformer (BERT)
ICD-9 and ICD-10 codes obtained from electronic health record data
LASSO logistic regression
Train (67%): 2713
Test (33%): 1357
0.925
0.091
0.54
0.95
0.925
0.13
Demiroglu 202040
Turkish
77 (70)
Depressed: 50 records.
Not depressed: 27 records
Interview, with 3 types of questions: neutral, positive, and negative.
Average length of the utterances, subjects in negative, positive, and
neutral answers separately. Three-dimensional feature.
Rate of speech for negative, positive, and neutral answers. Sentiments of the question-answer pairs.
BDI
SVM
Leave-on-out
0.65*
0.68
  
0.76
0.67
Demiroglu 202040
German
100 (84)
Depressed: 44 records.
Not depressed: 56 records
Interview, general questions (e.g., “What is your favorite dish?”)
Same as above
BDI
SVM
Leave-one-out
0.85*
0.77
  
0.89
0.75
Gao 2024a41
Chinese
156
Depressed: 77
Not depressed:, 79
Responses to four questions about recent events, sleep, mood, and suicidal tendencies
Transformer (BERT and an improved TextCNN)
Medical records
Dual-branch BERT + improved TextCNN model
Train: 94 (60%)
Validation: 31 (20%) Test: 31 (20%)
0.942
0.947
  
0.931
0.964
Guo 202442
Chinese
524
Depressed: 59
Not depressed: 465
Clinical interview
Transformer (EmoLLM + GraphRAG)
HAMD, HAMA
EmoLLM
N/A
0.84
0.49
  
0.38
0.68
Hayati 202243
Dialectal Malay
53
Depressed: 11
Not depressed: 42
Clinical interview
Transformer (GPT-3)
BDI
GPT3 (They compared its performance using 2–10 examples)
N/A
0.71
0.67
    
He 2022 (same data as Yuan 2021)44
Chinese
108
Depressed: 54
Not depressed: 54
Picture description and question-answering tasks)
Embedding (Glove)
BDI-II and PHQ-9
GRU based RNN
8:1:1 random train-validation-test split
0.659
0.631
  
0.688
0.583
Howes 201445
English
882 (167)
Unclear
Interaction with therapist (text-based counseling)
n-gram
PHQ-9
Logistic regression
10-fold cross-validation
 
0.686
    
Iyortsuun 202446
(Same as Shen 2022)
Chinese
162
Depressed: 30
Not depressed: 132
General interview
Transformer (Transformer-based, USE-large)
SDS
BiLSTM + Attention
3-Fold cross-validation
0.606
0.66
  
0.79
0.58
Joharee 202347
Bahasa Malaysia
511 (172)
Unclear (in the teste set 28 depressed and 23 not depressed)
3 open-ended questions
TF-IDF
BDI-II and PHQ-9
Extra Tree Classifier
Split 70% training and 30% test
0.73
0.63
    
Krishnamurti 202248
English
1007 (666)
Not depressed: 48.2%
Mild: 38.3% moderate: 10.5% severe: 3.0%
Open-ended questions documenting their pregnancy journey
LIWC; Embedding (Word2Vec); Latent Dirichlet Allocation (LDA), SentiWordNet (SWN)
Edinburgh Postnatal Depression Scale
(EPDS)
LASSO regression model
70% training, 15% for prediction (additional 15% were not used)
  
0.87
   
Li 202349
Chinese
387 (329)
Euthymia = 46, mild = 102, moderate = 160, severe = 79
Clinical interview
Transformer (BERT)
HAMD-17
BiLSTM + Self-Attention + Multilayer Perceptron (MLP) + Softmax,
Training = 273 recordings, Test = 114 recordings.
0.86
0.911
 
0.696
0.921
0.901
Liu 202250
English
219
Depressed: 64
Not Depressed: 155
Text message
LIWC
PHQ-8
Logistic Regression with L2 regularization
leave-one-out
  
0.72
   
Munthuli 202351
Thai
80
Depressed: 40
Healthy control: 40
Clinical interview
Fine-tuned transformer encoder (XLM-RoBERTa)
PHQ-9 and HAM-D
Transformer-based binary classifier (XLM-RoBERTa)
K×L-fold stratified and nested cross-validation
0.9
0.898
 
0.925
0.921
0.875
Nobles 201852
English
1213 (33)
Suicidality day: 685
Depression Day: 528
Text message
TDF-IDF
Depression: periods where the individual had no suicidal ideation or attempt
DNN
10-fold cross-validation
0.7
0.75
 
0.56
0.71
0.81
Oh 202453
Korean
166 (77)
Depressed: 60
Other psychiatric illnesses: 17
Clinical interview
Emotional Analysis Module patented by Acryl Inc.
Clinical diagnosis (DSM-5), provided by psychiatrist
XGBoost
Train: 136
Test:30
0.794
0.877
0.85
0.25
 
0.962
Ohse 202454
German
84
Depressed: 25
Not Depressed: 59
Clinical interview
GPT3.5 fine-tuned
PHQ-8
GPT3.5 fine-tuned
N/A
0.910
0.820
  
0.850
0.840
Orhan 201955
Turkish
60
Depressed: 30
Healthy control: 30
10-minute free verbal samples of the subjects
Turkish version of the Harvard-III Psychological Dictionary
Structured clinical diagnosis
Bayesian Logistic Regression
Train: 42 (21 for each category)
Test: 18 (9 for each set)
0.89
     
Parkeaw 202556
Thai
373
Low risk: 261, High risk: 112
SCT consisted of 34 items covering four key depression-related domains: 1) family, 2) society, 3) health, and 4) self-concept
LLM (LLama3.1) was used to extract sentiment scores
PHQ-9
Random forest
5-fold cross-validation
0.786
   
0.782
 
Pérez-Toro 202257
Spanish
60
Depressed Parkinson's Disease patients (D-PD): 25
Non-depressed Parkinson's Disease patients (ND-PD): 35
Free response prompt (asked to talk about their daily routines)
Transformer (BERT)
Depression item from the MDS-UPDRS
Gaussian Mixture Model-Universal Background Model
Nested leave one out cross-validation
0.67*
0.7
0.7
0.8
 
0.56
Podina 202558
Romanian
765
Depressed: 397
Not depressed: 367
Clinical interview (with the aiCARE chatbot)
TF-IDF
PHQ-9
Logistic regression
This is a test set for the algorithm that was built in Boian et al., 202533
0.84
0.85
 
0.78
0.76
0.93
Qin 202515
English
37
Depressed: 17 Control: 20
3 Phases: 1. small talk, 2. semi-structural interview. 3. Demographic questions
LLM (qCammel-13B-GPTQ)
MINI
LLM (qCammel-13B-GPTQ)
N/A
0.81
0.87
0.88
  
0.80
Ren 202459
English
1070 (94)
Depressed: 570
Not depressed: 500
Interaction with therapist (message-based online therapy)
LIWC; Transformer (BERT)
PHQ-9
Neural network (classification head, unspecified)
Training: 870
Test set: 200
Each of the 94 participants contribute 3 observations for training and one for the test.
0.60
0.59
0.64
   
Resnik 201360
English
124
Depressed: 12 Not depressed: 112
Students were asked to “describe your deepest thoughts and feelings about being in college”.
LIWC;
Topic modeling (LDA)
BDI
Logistic Regression
Split to train (94) and test (30)
0.80*
0.50
  
0.50
0.50
Rutowski 202061
English
15,950
(11,000)
Depressed: 4259
Not depressed: 11,691
Participants interacted with an app that presented
questions on different topics, such as “work” or “home”.
Transfer learning, implemented via ULMFiT
PHQ-8
LSTM
Split to train (80%) and test (20%)
0.75
 
0.82
0.75
 
0.75
Shen 202262
Chinese
162
Depressed: 30 Not depressed: 132
Interviews
Embedding (ELMo)
PHQ-8
SDS
BiLSTM with Attention
3-fold CV
 
0.65
  
0.65
0.66
Shin 202263
Korean
166
Depressed: 83
Healthy control: 83
Clinical interview
LIWC;
Bag-of-words
MINI
Naive bayes
80/20 split
0.83
 
0.91
0.96
 
0.70
Shin 202464
Korean
428 (91)
Depressed: 73
Not depressed: 357
Daily diary
Transformer Gpt3.5_ft_CoT (fine-tuned models, Chain-of-thought)
PHQ-9 and Beck Scale for Suicide Ideation (BSS)
Gpt3.5_ft_CoT (fine-tuned models, Chain-of-thought)
N/A
0.90
0.69
 
0.95
0.75
0.64
Smirnova 201865
Russian
201
Depressed: 124
Healthy control: 77
Free response prompt; (Participants wrote narratives on the topic ‚"The current state of life and future expectations)
Lexico-semantic features: metaphors, similes, informal words, repetitions
Syntactic features: sentence types, word order, ellipses
Lexico-grammatical features: pronouns, verb tenses/forms
Clinical psychiatric interviews coded using ICD-10 diagnostic criteria
Linear discriminant analysis
Mention cross validation, but not clear which type
0.99
     
Smirnova 201966
Russian
201
Same as above
Same as above
Component lexis analysis
HDRS-21
Linear discriminant analysis
Mention cross validation, but not clear which type
0.96
     
Sood 202367
English
626
Depressed: 152
Not depressed: 474
Clinical interview [Combination of 3 data sets: DAIC, E-DAIC and EATD corpus (originally chines but translated to English)]
TDF-IDF
PHQ-8 and SDS
SVM
Training set: 399
Development set: 108
Test set: 119 (34 depressed)
0.90*
0.82
  
0.83
0.83
Tao 202368
Chinese
139
Depressed: 64
Anxious: 75
Interaction with chatbot asking about daily activities
Transformer (ChatGPT)
Psychiatrist diagnosis
ChatGPT
N/A
0.68
0.71
  
0.69
0.72
Tlachac 202069
English
162
Depressed:55 Not Depressed: 107
Text message
Lexical category features via Empath;
POS tag frequencies; Sentiment scores (polarity and subjectivity);
Volume features: number of messages, words, characters
PHQ-9
Logistic Regression
5-fold cross-validation
0.804*
0.806
 
0.742
0.728
0.925
Tlachac 2022a70
English
302
Depressed: 142 (47.0%)
Not depressed: 160 (53.0%)
Free response prompt
Transformer (BERT); Part-of-speech (POS) tagging
Lexical category features via Empath
PHQ-9
BERT-LSTM (a variation of BERT incorporating a Long Short-Term Memory layer)
Training set: 218
Test set: 84 (27.8%)
0.55
0.67
 
0.17
0.51
0.97
Tlachac 2022b71
English
3,000 (unclear of how many participants)
Unclear
Text message
Transformer (BERT)
PHQ-9
Fine-tuned BERT classifier
Training: 2400 (1,200 messages per class)
Testing: 600 (300 messages per class)
0.711
     
Tlachac 202272
English
88
Depressed: 53 Not depressed: 35
Text message
Lexical category; Frequency features; BOW
PHQ-9
Logistic Regression
leave-group-out cross validation
0.71
0.79
 
0.4
 
0.93
Weber 202573
German
126 (65 from 44 participants, 61 synthetic)
n/a
Clinical interview
Transformer (BERT-base-German-cased)
MADRS
Linear regression
5-fold cross validation
0.83
     
Wright-Berryman 202374
English
2416
(1433)
Depressed: 863 Not depressed:1553
Clinical interview
TF-IDF
PHQ-9
SVM
Leave-one-subject-out cross-validation
0.69
 
0.77
0.04
0.68
0.55
Xue 202475 (Same as Shen 2022)
Chinese
162
Depressed: 30 Not depressed: 132
EATD-Corpus (general interview)
Transformer (BERT)
SDS
Fine-tuned BERT model with fully connected layers
Does not specify what type of validation was used
 
0.72
  
0.66
0.80
Ye 202176
Chinese
160
Depressed: 80 Not depressed: 80
Clinical interview
Embedding (Word2vec)
HAMD
One-hot Transformer
of 5-fold cross validation.
0.882
0.874
    
Yuan 202177
Chinese
108
Depressed: 54 Not depressed: 54
Picture descriptions and responses to 30 questions.
Embedding
BDI-II and PHQ-9
Text Recurrent Encoder (TRE)
8:1:1 random train-validation-test split
0.659
0.651
  
0.688
0.583
Zhang 2024a78
Chinese
240
Depressed: 120 Not depressed: 120
Clinical interview
BERT Chinese pre-training model with Multi-Head Attention (MHA) module
PHQ-9
Fully connected deep learning classifier
Training set: 168, Test set: 72
0.64
0.64
  
0.64
0.64
Zou 202379
Chinese
78
Depressed: 26 Not depressed: 52
Clinical interview
Transformer (Chinese BERT)
MINI
Logistic Regression
5-fold cross-validation
0.92*
0.93
0.99
 
0.87
0.93
Studies Reporting Continuous Outcomes
Study
Language
N
Source of text
Features extraction method
Outcome
Type of classification model
Validation
MAE
RMSE
R2
Morales 201680
German
138 (84)
Interview on everyday life aspects
LIWC; n-gram; Part-of-Speech (POS);
Text-based speech rate features
BDI-II
SVM
leave-one-out cross-validation
7.56
9.21
0.526
Ozkanca 201881
Turkish
70
Open-ended questions (neutral, positive, and negative questions)
Manual sentiment tagging (positive/negative/neutral), number of responses per sentiment, average utterance length, speech rate, features computed separately for positive/negative/neutral questions (15 total features)
BDI-II
SVR
Leave-one-out
 
10.3
 
Note: The number in the “N” column represents the total number of text observations, with the value in parentheses indicating the number of participants from whom these observations were collected. The accuracy result marked with a * has been computed by us. BiLSTM: Bidirectional Long Short-Term Memory; LIWC: Linguistic Inquiry and Word Count; CNN: Convolutional Neural Network; TF-IDF: Term Frequency–Inverse Document Frequency; PHQ-9: Patient Health Questionnaire–9; BERT: Bidirectional Encoder Representations from Transformers; IIFDD: Intra- and Inter-modal Fusion Model for Depression Detection; SVM: Support Vector Machine; LASSO: Least absolute shrinkage and selection operator; BDI-II: Beck Depression Inventory–II; GRU: Gated Recurrent Unit; RNN: Recurrent Neural Network; LDA: Latent Dirichlet Allocation; POS: Part-of-speech; HAMD: Hamilton Depression Rating Scale; DNN: Deep Neural Net; XGBoost: Extreme Gradient Boosting; MDS-UPDRS: Movement Disorders Society Unified Parkinson's Disease Rating Scale; ULMFiT: Universal Language Model Fine-tuning; SDS: Self-rating Depression Scale questionnaire; HDRS-21: Hamilton Depression Rating Scale-21; SVR: Support Vector Regression; SCT: Sentence Completion Test; MADRS: Montgomery-Åsberg Depression Rating Scale.
Table 2
Description of Studies That Use DAIC Dataset
First author
Features extraction method
Outcome type
Type of classification model
Accuracy
F1
Precision
Recall/
Sensitivity
MAE
RMSE
Agarwal 202282
Embedding (GloVe)
Binary
MV-IA-Mean
0.72
0.73
0.74
UAR: 0.72
  
Agarwal 202483
Embedding (Sentence embeddings from all-mpnet-base-v2; graph built using cosine similarity between embeddings)
Binary
GCN + Transformer multi-head attention
0.83
0.81
0.80
UAR: 0.82
  
Al-Hanai 201884
Embedding (Word2Vec)
Binary
LSTM
 
0.67
0.57
0.8
5.18
6.38
Ansari 202385
Count vectorization
Binary
LR and LSTM
LR: 0.748,
LSTM: 0.73
LR: 0.67,
LSTM: 0.61
    
Burdisso 2023*86
TF-IDF; PMI (Pointwise Mutual Information).; PageRank
Binary
node-weighted GCN
 
0.84
    
Cao 202287
Transformer (BERT)
Binary
BERT
0.91
     
Chen 202488
TF-IDF; PMI (Pointwise Mutual Information).; PageRank
Binary
GCN
 
0.84
    
Correia 201689
Embedding (GloVe)
Binary
SVM
Per sentence: 0.533
Per interview: 1.00
     
Dang 201790
• SALAT;
• siNLP;
• TAALES;
• SEANCE;
• ANEW;
• EmoLex;
• SenticNet;
• Lasswell
Cont.
SVR
    
4.98
6.02
Danner 202391
Transformer (BERT)
Binary
BERT
 
0.82
0.83
0.82
  
Fang 202392
Transformer (USE)
Cont.
Bi-LSTM with an attention mechanism
    
3.61
4.76
Firoz 2023a93
• BoW
• TF-IDF
• Embedding (Word2Ve, FastTex)
Binary
Ensemble model of CNN-LSTM-and Bi-LSTM
0.80
     
Firoz 2023b94
Transformer (BERT);
Counts of absolutist language (e.g., always, never, completely)
Cont.
LSTM
    
5.65
9.45
Flores 202324
Transformer (BERT)
Binary
LSTM
 
0.72
    
Guo 202495
Transformer (BERT)
Binary
PTDD
0.69
0.60
0.48
0.73
  
Hadzic 202496
GPT4
Binary
GPT4
 
0.71
0.81
0.70
  
Hong 202297
Embedding (GRL using Schema Encoders)
Cont.
Schema-Based Graph Neural Network
    
3.76
 
Iyortsuun 202445
Transformer (Transformer-based, USE-large)
Binary and cont.
BiLSTM + Attention
0.727
0.78
0.80
0.76
3.96
 
Jo 2022*98
Embedding (unclear the exact type)
Binary
CNN
0.8171
0.8101
0.80
0.8205
  
Kokkera 202399
• Word frequencies
• POS tags
• Sentiment scores
Binary
RF
0.40
0.40
0.44
0.43
  
Lam 2019100
Manual topic modelling + augmentation + embedding + Transformer
Binary
Transformer architecture
 
0.78
0.91
0.83
  
Lau, 2021101
Transformer (BERT)
Binary and cont.
BiLSTM + attention
 
0.83
0.83
0.83
4.23
5.32
Lau 2023102
Transformer (BERT and RoBERTa)
Cont.
BiLSTM + attention
    
4.17
0.02
Li 2022a103
Embedding (from scratch)
Binary
biLSTM + RNN network
0.745
0.706
0.701
0.715
  
Li 2022b104
Transformer (BERT) (utterance-based)
Binary
BiLSTM + attention with an MLP-Softmax classifier
 
0.78
 
UAR: 0.79
  
Li 2023105
Part-of-Speech (POS); Named Entity Recognition (NER); Embedding (GloVe)
Binary
BiLSTM
 
0.79
0.69
0.80
  
Lin 2020106
Embedding (Elmo)
Binary
BiLSTM + Attention
 
0.83
0.83
0.83
  
Lopez-Otero 2017107
Embedding (GloVe)
Binary
SVM
0.857
0.730
    
Lorenc 2022108
Embedding and transformer
(USE5, DAN,
sBERT )
Binary
Chunk-based biLSTM model
   
UAR: 0.803
  
Lu 2023109
Transformer (BERT)
Binary
BERT
 
0.76
    
Rodrigues Makiuchi 2019110*
Transformer (BERT)
Cont.
8 CNN blocks-LSTM
    
4.22
 
Mallol-Ragolta 2019111
Embedding (GloVe)
Binary
HCAN
 
0.63
 
UAR: 0.66
  
Mao 2023112
Embedding (GloVe)
5-levels classification
BiLSTM
0.968
0.971
    
Milintsevich 2023113
Transformer (RoBERTa)
Binary, 5-levels classification and cont.
BiLSTM + Attention
 
Binary: Micro-F1 = 0.766
Macro-F1 = 0.739
5-Class:
Micro-F1 = 0.426
Macro-F1 = 0.270
  
3.78
 
Niu 2021114
Embedding (GloVe)
Binary and cont.
Hierarchical context-aware graph attention model
0.77
 
0.70
0.82
3.73
4.8
Pampouchidou 2016115
• LIWC;
• Total number of words and sentences
• Average sentence length
• Laughter-to-word ratio
• Depression-related word ratio:
• ANEW
• Mean and SD of pleasure, arousal, dominance ratings
• Word frequency
Binary
Decision Tree
 
Depressed: 0.23
Not depressed: 0.79
  
8.99
10.75
Prabhu 2022116
Embedding (Word2vec pretrained)
Binary
LSTM
0.823
     
Qureshi 2019117
Embedding (from scratch, feature learning via an LSTM encoder)
Continuous and 5-class
DNN
0.67
0.53
  
3.90
4.96
Qureshi 2020118
Transformer (USE)
Cont. and 5-level class
LSTM
0.667
0.62
  
Class: 0.66
Cont: 3.81
Class: 1.23
Cont: 4.70
Qureshi 2021119
Transformer (USE)
Cont.
LSTM
    
3.78
4.88
Rasipuram 2022120
Transformer (GPT2)
Cont.
BiLSTM
    
3.21
4.25
Ray 2019121*
Transformer (USE)
Cont.
stacked BiLSTM + feedforward network
    
4.02
4.73
Rinaldi 2020122
Embedding (GloVe)
Binary
Joint Latent Prompt Categorization (JLPC)
 
0.604
    
Rohanian 2019123
Embedding (GloVe)
Binary and cont.
LSTM
 
0.69
0.68
 
4.98
6.05
Sadeghi 2023124*
Transformer (GPT-3.5-Turbo and
DepRoBERTa )
Cont.
SVR with a polynomial (poly) kernel
    
4.26
5.36
Sadeghi 2024125*
Transformer (GPT-3.5-Turbo (prompt asking the model to describe the interview + DepRoBERTa and GPT-3.5-Turbo response to 11 questions on the interview)
Cont.
SVR
    
3.86
4.66
Samareh 2018126
• Basic linguistic stats (e.g., word count);
• Dictionary based depression-related word ratio;
• Sentiment features (AFINN)
Cont.
RF regression with confidence-based decision-level fusion.
    
4.78
5.59
Senn 2022127
Transformer (BERT and RoBERTa)
Binary
Ensemble of BERT, RoBERTa, DistilBERT
 
0.62
 
0.64
  
Shen 202261
(used also eatd corpus)
Embedding (ELMo)
Binary
BiLSTM with Attention
 
0.83
0.83
0.83
  
Stasak 2017128
Word Affect Features: single affect word-rating reference, such as the General Index
Binary
decision tree classification
0.82
     
Stepanov 2018129
BOW
Cont.
SVR
    
4.88
5.83
Sun 2017130
Selected key phrases related to symptoms
Cont.
RF
 
0.55
0.40
0.89
3.87
4.98
Tlachac 202270
Transformer (BERT)
Binary
fine-tuned BERT classifier
0.48
     
Toto 2021131
Transformer (BERT)
Binary
LSTM
 
0.67
    
Marriwala 2023132
Embedding (Word2vec)
Binary
CNN
0.8
0.6
0.63
0.68
  
Van Steijn 2022133*
LIWC; Transformer (BERT); Sentiment; speech rate; Repetition rate; Confidence score
Cont.
KELM
     
6.06
Villatoro-Tello 2021134*
Lexical Availability
Binary
MLP (Train the model on E-DAIC and tested on DAIC-WOZ)
 
0.83
0.87
0.81
  
Williamson 2016135
Embedding (GloVe); Topics
Binary and cont.
SVR
 
0.84
  
3.34
4.46
Xezonaki 2020136
LIWC; TDF-IDF; Embedding (GloVe) ; Affective lexica (AFINN, Bing Liu, MPQA, Emolex, SemEval15)
Binary
Hierarchical Attention Network with Lexicon and Summary Integration
 
0.70
0.70
   
Xia 2024137
Embedding (Word2vec)
Binary
BiLSTM-GNN
0.64
0.60
0.585
0.584
  
Xiao 2021138
Transformer (BERT)
Binary
BERT
   
0.70
  
Xu 2023139*
Transformer (BERT)
Binary
Two-layer STM network
 
0.82
0.81
0.83
  
Xue 202473
Transformer (BERT)
Binary
Fine-tuned BERT model with fully connected (FC) layers
 
0.85
0.79
0.92
  
Yadav 2023140
Embedding (Word2Vec, ELMo); Transformer (BERT)
5-levels class
BGRU model with two Fully Coupled (FC) networks as output layers
 
0.923
0.929
0.928
  
Yang 2017a141
Embedding (PV); Global structural and behavioral text features (e.g., Number of words)
Binary
SVM
 
Depressed: 0.667
Not depressed: 0.885
Depressed: 1.000
Not depressed: 0.793
Depressed: 0.50
Not depressed: 1.00
  
Yang 2017b142
Embedding (PV)
Cont.
DCNN and DNN
    
Female: 3.750
Male: 3.525
Female: 4.361
Male: 4.406
Yang 2018143
Embedding (PV)
Binary
SVM
0.75
     
Yang 2019144
Embedding (Doc2vec) and Text Convolutional Neural Network
Binary
SVM
0.72
     
Zhang 2020a145*
Embedding (PV or doc2vec )
Binary and cont.
Multitask Deep Neural Network (DNN)
0.839
0.907
   
4.66
Zhang 2020b146
Transformer (BERT);
Key phrase matching
Binary
bidirectional variable-length LSTM model
 
0.81
0.82
0.8
  
Zhang 2024b147
Transformer [Sentence-BERT (nli-bert-large)]
Binary
BiLSTM
 
0.87
    
Zhang 2024c148
Transformer (T5-Encoder and BERT)
Binary, 3-levels, 5-levels
T5 + BERT dual-branch fusion
Binary: 0.8913
3-level: 0.6739
5-level: 0.5435
Binary: 0.8276
3-level: 0.6677
5-level: 0.5259
Binary: 0.80
Binary: 0.857
5.283
 
Zhao 2022149
n-gram
Cont.
Transformer-based architecture with self-attention and feed-forward layers
    
5.03
5.95
Note: * indicate studies using the E-DAIC dataset; all others are based on DAIC-WOZ. Since only two studies reported AUC and three studies reported specificity, these metrics were removed from the table. MV-IA-Mean: Multi-view model with inter-view attention coupled with the mean function; GCN: Graph Convolutional Network; SALAT: Suite of Linguistic Analysis Tools: This open-source toolkit was used to extract various linguistic and word affect features from transcripts. siNLP: Simple Natural Language Processing Tool; TAALES: Tool for Automatic Analysis of Lexical Sophistication; SÉANCE: Sentiment Analysis and Cognition Engine; ANEW: Affective Norms for English Words; EmoLex: This provided features based on token words related to eight emotion types (e.g., anger, anticipation, disgust, fear, joy, sadness, surprise, trust); SenticNet: This provided features based on nearly 13,000 token words, evaluating perceptual polarity norms for aptitude, attention, pleasantness, and sensitivity; Lasswell: This provided 146 features from 63 different word lists categorized by eight semantic characterizations, with a particular interest in the well-being category; BERT: Bidirectional Encoder Representations from Transformers; SVR: Linear Support Vector Regression; Bi-LSTM: Bi-directional LSTM; SVM: Support Vector Machine ; NB: Naïve Bayes; LR: Logistic Regression; LSTM: Long Short Term Memory; USE: Universal Sentence Encoder; BoW: Bag of Words; TF-IDF: Term Frequency-Inverse Document Frequency; PTDD: Prompt-based Topic-modeling method for Depression Detection; RF: Random Forest; UAR: Unweighted Average Recall; USE5: USE Transformer-based; DAN: Deep Averaging Network a simpler sentence embedding model; sBERT: Sentence-BERT Transformer fine-tuned for sentence similarity; HCAN: Hierarchical Contextual Attention Network; POS: Part-of-speech; DepRoBERTa: fine-tuned RoBERTa language model, which is specifically designed for depression detection; CNN: Convolutional Neural Network; KELM: Kernel Extreme Learning Machine; BGRU: Bidirectional Gated Recurrent Unit; DCNN :Deep Convolutional Neural Network; DNN: Deep Neural Network; MLP: Multi-layer Perceptron; MDSD-T5: the T5-based (Google encoder–decoder Transformer) branch of the MDSD-FGPL system; PV: paragraph vector, an extension of Word2Vec.
Across studies, a total of 35,171 unique participants (removing duplicates from overlapping datasets) contributed 58,413 text samples (e.g., utterances, transcripts, or messages). Publication years ranged from 2013 to 2025, with 95 articles (77.2%) published since 2020, reflecting growing interest in automated text-based depression detection.
Dataset Characteristics
Among the 56 studies using unique datasets, 20 (35.7%) analyzed English text, 14 (25%) analyzed Chinese text, and 22 (37.5%) examined other languages (e.g., Italian, Spanish, Turkish, Korean, Russian, German, Malay, Thai). One study (1.7%) combined Chinese and English datasets by translating the Chinese texts into English. The DAIC datasets included English-language clinical interviews.
The number of text observations ranged from 53 to 15,950 (mean = 1,216; median = 210). Key categories of data sources included: Structured clinical interview (19/56, 33.9%), responses to open-ended questions (e.g., “ Describe your weekend activities”; 27/56, 48.2%), text messaging and chat logs (6/56, 10.7%) or interaction with therapists (i.e., text-based therapy; 4/56, 7.1%).
Outcome Formulations
The great majority of studies framed automated depression detection as a binary classification task (100/129, 77.5%). In these cases, models were trained to identify whether an individual (or a given text sample) was produced by a depressed vs. non-depressed person. Typically, “depressed” status was defined using a clinical cutoff on a depression severity questionnaire (for example, a PHQ-9 score ≥ 10) or based on a formal diagnostic evaluation.
NLP feature extraction and classification methods
To train ML models to predict depression, text should be converted into numeric values that will be used as the input. Broadly, feature extraction methods fell into four categories: (1) Simple textual features (12/129, 9.3%): Unstructured representations such as Bag-of-Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF) term vectors, which convert text into numerical frequency-based vectors without incorporating external knowledge. (2) Lexicon-based linguistic features (16/129, 12.4%): Specific variables derived from psychological dictionaries or lexica (e.g., LIWC, EmoLex, SenticNet). These typically involve counting the occurrences of words in predefined semantic, emotional, or psychological categories (e.g., number of sad words in the text). (3) Pre-trained word embeddings (31/129, 24.0%): Dense, distributed vector representations of words or documents learned from large text corpora, excluding transformer-based models. Common examples included Word2Vec and GloVe embeddings, which represent words based on their co-occurrence patterns in large datasets. (4) Transformer-based language model features (48/129, 37.2%): Deep text representations from large pre-trained models (e.g., BERT, GPT) that use attention mechanisms to capture relationships between words across entire sentences, enabling a more nuanced understanding of context and meaning. Studies also employed hybrid feature approaches, combining multiple feature types to enrich the model’s input (22/129, 17.0%).
In line with broader NLP trends, classification methods evolved from traditional machine-learning models to deep learning approaches capable of directly interpreting and learning from raw text. About one-third of studies (44/129, 34.1%) used traditional classifiers such as support vector machines, logistic regression, naïve Bayes, or random forests. A large portion (56/129, 43.4%) of the studies employed recurrent neural network architectures, most commonly Long-Short-Term Memory (LSTM) or bidirectional LSTM, often augmented with attention mechanisms to capture the temporal dependencies among words. By the early 2020s, transformer-based models (29/129, 22.4%) became increasingly prevalent, with studies either fine-tuning pre-trained architectures (e.g., BERT, RoBERTa) for depression detection or using their contextual embeddings as features to separate classifiers. For example, Flores et al.24 found that while BERT embeddings alone achieved relatively modest performance (maximum F1: 0.58), combining them with an LSTM substantially improved results (average F1: 0.72).
Meta-analysis results
A
Forty-three studies (n = 40,983) were included in the pooled analysis of classification accuracy (Fig. 2). The pooled accuracy was 0.80 (95% CI, 0.76–0.83), with substantial heterogeneity (
=.471,
= 98.4%). Egger’s test was nonsignificant (z = 1.17, p = 0.25), and trim-and-fill identified no missing studies, indicating no evidence of publication bias. Leave-one-out analyses demonstrated stability (range: 0.79–0.80; Table S3), and the Galbraith plot (Fig. 3) revealed no extreme outliers, confirming that heterogeneity reflects broad between-study variability rather than undue influence from individual studies. As a sensitivity analysis, we also excluded seven studies in which accuracy was approximated from other reported metrics (k = 36), yielding an identical pooled estimate (0.80, 95% CI, 0.75–0.83) and similarly high heterogeneity (Fig. 4).
Moderator analysis
There were significant between-group differences based on language (Qbetween= 153.24, p < .001). Pooled classification accuracy was highest for studies conducted in languages other than English or Chinese (k = 8, Accuracy = 0.82, 95% CI, 0.76–0.86), followed by those in Chinese (k = 9, Accuracy = 0.81, 95% CI, 0.70–0.88), and studies in English (k = 16, Accuracy = 0.77, 95% CI, 0.71–0.83). Between-group differences were also significant for text source (Qbetween=187.05, p < .001). Studies using structured clinical interviews (k = 18) demonstrated the highest pooled accuracy (0.84, 95% CI, 0.81–0.88), followed by communication-based interactions (k = 5, Accuracy = 0.79, 95% CI, 0.66–0.88), open-ended questions (k = 18, Accuracy = 0.75, 95% CI: 0.68–0.81), and finally therapist-patient interactions (k = 2, Accuracy = 0.70, 95% CI, 0.50–0.84). A significant between-group difference was observed based on feature type (Qbetween= 164.62, p < .001). Linguistic features produced the highest pooled accuracy (k = 5, Accuracy = 0.86, 95% CI 0.75–0.93), followed by embeddings-based features (k = 6, Accuracy = 0.84, 95% CI, 0.75–0.90), transformer (k = 18, Accuracy = 0.81, 95% CI, 0.75–0.85), simple features (k = 7, 0.75, 95% CI, 0.65–0.82), and hybrid features (k = 7, Accuracy = 0.74, 95% CI, 0.63–0.83). Classifier type also showed a significant between-group effect (Qbetween= 166.25, p < .001). Transformer-based and traditional classifiers performed similarly, both with Accuracy = 0.81 (transformers: k = 14, 95% CI, 0.74–0.87; traditional: k = 21, 95% CI, 0.76–0.85), outperforming neural networks (k = 8, 0.72, 95% CI, 0.63–0.80).
In one-at-a-time meta-regressions, only text source was significant (QM(3) = 8.78, p = .032), accounting for 13.6% of the variance. Other moderators including risk of bias, sample balance, language, feature type, classifier, and sample size (log N) did not significantly predict variability in accuracy (all p > 0.17; R²= 0–7.7%).
Secondary analysis
A
Precision. Twenty-eight studies (n = 31,644) yielded a pooled precision of 0.78 (95% CI, 0.72–0.83) with high heterogeneity (
=.731,
= 99.1%, Figure S1). Leave-one-out analyses were stable (range = 0.77–0.79, Table S4).
A
Recall. Thirty-three studies (n = 47,738) produced a pooled recall of 0.76 (95% CI, 0.68–0.83) with high heterogeneity (
=1.42
= 99.7%, Figure S2). Leave-one-out analyses confirmed stability (range = 0.75–0.77, Table S5).
A
AUC. Fourteen studies (n = 39,412) showed a pooled AUC of 0.79 (95%CI: 0.70–0.85) with high heterogeneity (
=0.66
= 99.6%, Figure S3). Leave-one-out analyses were stable (range = 0.76–0.81, Table S6). Across all three metrics, precision, recall, and AUC, Egger’s tests were nonsignificant, and trim-and-fill analyses indicated that no studies were missing, suggesting no evidence of publication bias.
A
Balanced accuracy. Sixteen studies (n = 31,661) showed a pooled AUC of 0.71 (95%CI: 0.63–0.78) with high heterogeneity (
=0.54
= 99.4%, Figure S4). Leave-one-out analyses were stable (range = 0.70–0.74, Table S7). Egger’s test was nonsignificant, and trim-and-fill analyses suggested that only one study should be imputed, which slightly reduced the pooled balanced accuracy estimate from 0.71 to 0.70 (Figure S5).
Discussion
This study revealed a substantial increase in studies using NLP- and ML-based approaches for automated depression detection. Among studies included in the quantitative synthesis, models achieved a pooled accuracy of 80%, correctly distinguishing depressed from non-depressed individuals in roughly four out of five cases. For comparison, meta-analyses using resting-state fMRI found similar accuracy (~ 80%)25, whereas those using wearable AI reported higher accuracy (~ 89%)26. Yet given the substantial heterogeneity, results should be interpreted cautiously.
Subgroup analyses showed significant differences in accuracy across several factors. Studies conducted in English yielded lower accuracy (77%) than those in Chinese or other languages, possibly reflecting linguistic or cultural differences27 or dataset-specific effects. This finding warrants further investigation into whether cultural and language factors affect automated detection. Models trained on structured clinical interviews achieved the highest accuracy (84%), whereas those analyzing free-form patient–therapist conversations performed worse (70%), suggesting that direct questioning about mood elicits clearer linguistic signals of depression.
Interestingly, lexicon-based features showed the highest pooled accuracy (86%), outperforming more complex or hybrid models (74%). One explanation is that targeted linguistic markers of depression, (e.g., frequent use of negative emotion words or first-person pronouns) are robust across contexts, allowing simpler models to perform well. However, this finding is based on few studies and requires further validation. Traditional machine learning models performed comparably to transformer-based models (81%), potentially indicating a performance ceiling, possibly due to limited input text that provides insufficient linguistic signal. Other factors may include between-person variability in how depression is expressed, small and imbalanced datasets, insufficient transformer fine-tuning, shallow linguistic cues in the data collection task, and differences in model optimization or evaluation procedures. These factors may constrain even advanced models, underscoring the need for richer longitudinal data and more personalized and context-aware modeling approaches. Consistent with this interpretation, only the type of text source significantly explained some between-study variance in meta-regression (accounting for 13.5% of heterogeneity), suggesting that depression-focused linguistic content may be more critical than model complexity for detection accuracy.
Secondary analysis reveals a pooled AUC of 0.79, suggesting that NLP models can detect depression from text at a level that could be clinically useful, especially for early screening or augmenting clinicians’ assessments. The pooled precision (0.78) and recall (0.76) suggest that, on average, models show a reasonable balance between correctly identifying depressed individuals and avoiding false positives. However, there was substantial variability across studies, likely because these metrics are complementary, and researchers can optimize one at the expense of the other. This trade-off means that while some models favor higher sensitivity to reduce missed cases, others prioritize precision to avoid false alarms.
This study is one of the few systematic reviews and the first meta-analyses to quantitatively evaluate NLP and ML techniques for detecting depression. Engaging clinical researchers in this rapidly evolving field is essential to ensure these tools are properly understood, evaluated, and directed toward real clinical needs. To maintain a high standard, we included only studies that validated their models against independent measures of depression. Nonetheless, several limitations should be acknowledged. The high heterogeneity means the pooled accuracy should be interpreted cautiously, as individual study results varied widely. Although moderator analyses explained some variability, much remains unexplained, likely reflecting differences in validation methods, preprocessing, or sample characteristics.
Furthermore, many papers did not report all the standard classification metrics. The field would benefit from more standardized reporting. Relatedly, accuracy alone can be misleading in imbalanced datasets where high accuracy may simply reflect the majority (non-depressed) class rather than true model performance.
An additional limitation is that over half of studies used variants of the DAIC corpus, meaning a large portion of the literature is built on the same or very similar data, which limits generalizability. While the DAIC is a valuable benchmark, there is a clear need for new datasets that cover different populations, languages, and modes of communication.
Gender differences were not examined, as few studies reported this information. Given that depression is more prevalent among women, they were likely overrepresented in training data, potentially biasing model performance and increasing misclassification risk for men.28 Finally, automated detection differs conceptually from tracking within-person change. Identifying who is depressed does not necessarily translate to monitoring fluctuations or improvement over time, which may rely on different linguistic dynamics.
In summary, NLP and ML can detect depression with good accuracy across a variety of settings. The growing interest since 2020 has yielded many promising approaches. Advancing the field will require greater standardization in reporting, the use of more diverse datasets, and explicit attention to gender and cultural fairness to address the substantial heterogeneity observed. With continued refinement and validation, text-based depression detection systems have the potential to complement traditional assessment methods and broaden access to mental health screening.
A
Author Contribution
HF had full access to all study data and takes responsibility for the integrity and accuracy of the data analysis. HF and CW contributed to the study concept and design. HF, NJ, KP, AT, and MV acquired, analyzed, and interpreted the data. HF drafted the manuscript, and HF, NJ, KP, AT, MV, PD, and CW critically revised it for important intellectual content. HF performed the statistical analysis, and CW supervised the study. All authors read and approved the final manuscript.
A
Funding
HF was supported by the VATAT Scholarship (Israel), the Livingston Award, and the Kaplen Fellowship at Harvard Medical School. CW was partially supported by R01MH116969, R01MH135844, R01AT011002, the Tommy Fuss Fund, a NARSAD Young Investigator Grant from the Brain & Behavior Research Foundation, and the Klingenstein Third Generation Foundation. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
A
Competing Interests
CW has received consulting fees from King & Spalding law firm but declares no non-financial competing interests. CW’s interests were reviewed and are managed by McLean Hospital and Mass General Brigham in accordance with their conflict of interest policies. No funding from this entity was used to support the current work, and all views expressed are solely those of the authors. The other authors declare no competing financial or non-financial interests.
A
Data Availability
Data and materials underlying the findings of this study are publicly available at [https://osf.io/x7tm9/overview](https:/osf.io/x7tm9/overview) .
Electronic Supplementary Material
Below is the link to the electronic supplementary material
References
1.
GBD 2019 Mental Disorders Collaborators. Global, regional, and national burden of 12 mental disorders in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet Psychiatry 9, 137–150 (2022).
2.
Ettman, C. K. et al. Prevalence of depression symptoms in US adults before and during the COVID-19 pandemic. JAMA Netw. Open 3, e2019686–e2019686 (2020).
3.
Mohr, D. C. et al. Perceived barriers to psychological treatments and their relationship to depression. J. Clin. Psychol 66, 394–409 (2010).
4.
Kraus, C., Kadriu, B., Lanzenberger, R., Zarate Jr, C. A. & Kasper, S. Prognosis and improved outcomes in major depression: a review. Transl. Psychiatry 9, 127 (2019).
5.
Halfin, A. Depression: the benefits of early and appropriate treatment. Am. J. Manag. Care 13, S92 (2007).
6.
Naslund, J. A. et al. Digital technology for treating and preventing mental disorders in low-income and middle-income countries: a narrative review of the literature. Lancet Psychiatry 4, 486–500 (2017).
7.
Mao, K., Wu, Y. & Chen, J. A systematic review on automated clinical depression diagnosis. NPJ Ment. Health Res. 2, 20 (2023).
8.
Zhang, T., Schoene, A. M., Ji, S. & Ananiadou, S. Natural language processing applied to mental illness detection: a narrative review. NPJ Digit. Med. 5, 46 (2022).
A
9.
Le Glaz, A. et al. Machine learning and natural language processing in mental health: systematic review. J. Med. Internet Res. 23, e15708 (2021).
10.
Teferra, B. G. et al. Screening for depression using natural language processing: literature review. Interact. J. Med. Res. 13, e55067 (2024).
11.
Nanomi Arachchige, I. A., Sandanapitchai, P. & Weerasinghe, R. Investigating machine learning & natural language processing techniques applied for predicting depression disorder from online support forums: A systematic literature review. Inf. 12, 444 (2021).
12.
William, D. & Suhartono, D. Text-based depression detection on social media posts: A systematic literature review. Procedia Comput. Sci. 179, 582–589 (2021).
13.
Malgaroli, M., Hull, T. D., Zech, J. M. & Althoff, T. Natural language processing for mental health interventions: a systematic review and research framework. Transl. Psychiatry 13, 309 (2023).
14.
Li, Y. et al. Automated Depression Detection from Text and Audio: A Systematic Review. IEEE J. Biomed. Health Inform. (2025).
15.
Qin, R. et al. Language models for online depression detection: A review and benchmark analysis on remote interviews. ACM Trans. Manag. Inf. Syst. 16, 1–35 (2025).
16.
Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372, (2021).
17.
Ciharova, M. et al. Use of machine learning algorithms based on text, audio, and video data in the prediction of anxiety and posttraumatic stress in general and clinical populations: a systematic review. Biol. Psychiatry 96, 519–531 (2024).
18.
Wolff, R. F. et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann. Intern. Med. 170, 51–58 (2019).
19.
Higgins, J. P. et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ 343, (2011).
20.
Sterne, J. A. et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ 355, (2016).
21.
Gratch, J. et al. The distress analysis interview corpus of human and computer interviews. in Proceedings of the Ninth International Conference on Language Resources and Evaluation vol. 14 3123–3128 (Reykjavik, Iceland, 2014).
22.
DeVault, D. et al. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. in Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems 1061–1068 (2014).
23.
Kroenke, K. et al. The PHQ-8 as a measure of current depression in the general population. J. Affect. Disord. 114, 163–173 (2009).
24.
Flores, R., Tlachac, M., Toto, E. & Rundensteiner, E. Transfer learning for depression screening from follow-up clinical interview questions. in Deep Learning Applications, Volume 4 53–78 (Springer, 2023).
25.
Chen, Y., Zhao, W., Yi, S. & Liu, J. The diagnostic performance of machine learning based on resting-state functional magnetic resonance imaging data for major depressive disorders: a systematic review and meta-analysis. Front. Neurosci. 17, 1174080 (2023).
26.
Abd-Alrazaq, A. et al. Systematic review and meta-analysis of performance of wearable artificial intelligence in detecting and predicting depression. NPJ Digit. Med. 6, 84 (2023).
27.
Hoemann, K., Şencan, R. S., Cochez, A., Beckers, C. & de Mesquita, B. G. Cultural models of emotion manifest in descriptions of everyday experience: A case study of the US and Belgium. (2024).
28.
Cirillo, D. et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digit. Med. 3, 81 (2020).
29.
Abilkaiyrkyzy, A., Laamarti, F., Hamdi, M. & El Saddik, A. Dialogue system for early mental illness detection: toward a digital twin solution. IEEE Access 12, 2007–2024 (2024).
30.
Aloshban, N., Esposito, A. & Vinciarelli, A. Language or Paralanguage, This is the Problem: Comparing Depressed and Non-Depressed Speakers Through the Analysis of Gated Multimodal Units. in Interspeech 2496–2500 (2021).
31.
Antoniou, M. et al. Predicting mental health status in remote and rural farming communities: computational analysis of text-based counseling. JMIR Form. Res. 6, e33036 (2022).
32.
Banerjee, T. et al. Predicting mood disorder symptoms with remotely collected videos using an interpretable multimodal dynamic attention fusion network. arXiv preprint arXiv:2109.03029 (2021).
33.
Boian, R. et al. A conversational agent framework for mental health screening: Design, implementation, and usability. Behav. Inf. Technol. 44, 2364–2378 (2025).
34.
Burkhardt, H., Pullmann, M., Hull, T., Areán, P. & Cohen, T. Comparing emotion feature extraction approaches for predicting depression and anxiety. in Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology 105–115 (2022).
35.
Cao P, Zhang Y, Zhang C, et al. A multimodal depression consultation dataset of speech and text with hamd-17 assessments. Scientific Data. 12(1), 1577 (2025).
36.
Chen, J. et al. IIFDD: Intra and inter-modal fusion for depression detection with multi-modal information from Internet of Medical Things. Inf. Fusion 102, 102017 (2024).
37.
Cohen, J. et al. A multimodal dialog approach to mental state characterization in clinically depressed, anxious, and suicidal populations. Front. Psychol. 14, 1135469 (2023).
38.
Cook, B. L. et al. Novel use of natural language processing (NLP) to predict suicidal ideation and psychiatric symptoms in a text-based mental health intervention in Madrid. Comput. Math. Methods Med. 2016, 8708434 (2016).
39.
de Hond, A. et al. Predicting depression risk in patients with cancer using multimodal data: algorithm development study. JMIR Med. Inform. 12, e51925 (2024).
40.
Demiroglu, C., Beşirli, A., Ozkanca, Y. & Çelik, S. Depression-level assessment from multi-lingual conversational speech data using acoustic and text features. EURASIP J. Audio Speech Music Process. 2020, 17 (2020).
41.
Gao, H., Zhou, Y., Chen, L. & Chi, K. Deep Depression Detection Based on Feature Fusion and Result Fusion. in Chinese Conference on Pattern Recognition and Computer Vision (PRCV) 64–74 (Springer, 2023).
42.
Guo, Y. & Guo, Y. A Knowledge Graph and Large Language Model-Based Framework for Depression Detection. In 2024 International Conference on Image Processing, Computer Vision and Machine Learning 670–673 (IEEE, 2024).
43.
Hayati, M. F. M., Ali, M. A. M. & Rosli, A. N. M. Depression detection on Malay dialects using GPT-3. in 2022 IEEE-EMBS Conference on Biomedical Engineering and Sciences 360–364 (IEEE, 2022).
44.
He, Y., Lu, X., Yuan, J., Pan, T. & Wang, Y. Depressive Tendency Recognition by Fusing Speech and Text Features: A Comparative Analysis. in 344–348 (IEEE, 2022).
45.
Howes, C. & Purver, M. Linguistic indicators of severity and progress in online text-based therapy for depression. in 2022 13th International Symposium on Chinese Spoken Language Processing (2014).
46.
Iyortsuun, N. K., Kim, S.-H., Yang, H.-J., Kim, S.-W. & Jhon, M. Additive cross-modal attention network (ACMA) for depression detection based on audio and textual features. IEEE Access 12, 20479–20489 (2024).
47.
Joharee, I. N., Hashim, N. N. W. N. & Shah, N. S. M. Sentiment analysis and text classification for depression detection. J. Integr. Adv. Eng. 3, 65–78 (2023).
48.
Krishnamurti, T., Allen, K., Hayani, L., Rodriguez, S. & Davis, A. L. Identification of maternal depression risk from natural language collected in a mobile health app. Procedia. Comput. Sci. 206, 132–140 (2022).
49.
Li, N. et al. Using deeply time-series semantics to assess depressive symptoms based on clinical interview speech. Front. Psychiatry 14, 1104190 (2023).
50.
Liu, T. et al. The relationship between text message sentiment and self-reported depression. J. Affect. Disord. 302, 7–14 (2022).
51.
Munthuli, A. et al. Classification and analysis of text transcription from Thai depression assessment tasks among patients with depression. PLoS One 18, e0283095 (2023).
52.
Nobles, A. L., Glenn, J. J., Kowsari, K., Teachman, B. A. & Barnes, L. E. Identification of imminent suicide risk among young adults using text messages. in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems 1–11 (2018).
53.
Oh, J. et al. Development of depression detection algorithm using text scripts of routine psychiatric interview. Front. Psychiatry 14, 1256571 (2024).
54.
Ohse, J. et al. Zero-Shot Strike: Testing the generalisation capabilities of out-of-the-box LLM models for depression detection. Comput. Speech Lang. 88, 101663 (2024).
55.
Orhan, Z., Mercan, M. & Gökgöl, M. K. A new digital mental health system infrastructure for diagnosis of psychiatric disorders and patient follow-up by text analysis in Turkish. in International Conference on Medical and Biological Engineering 395–402 (Springer, 2019).
56.
Porkaew, P., Zhu, T., Li, A. & Chuenphitthayavut, K. The effectiveness of a sentence completion test for depression screening using large language models. Acta Psychol. 259, 105425 (2025).
57.
Pérez-Toro, P. A. et al. Depression assessment in people with Parkinson’s disease: The combination of acoustic features and natural language processing. Speech Commun. 145, 10–20 (2022).
58.
Podina, I. R., Bucur, A.-M., Fodor, L. & Boian, R. Screening for common mental health disorders: a psychometric evaluation of a chatbot system. Behav. Inf. Technol. 44, 2160–2169 (2025).
59.
Ren, X., Burkhardt, H. A., Areán, P. A., Hull, T. D. & Cohen, T. Deep representations of first-person pronouns for prediction of depression symptom severity. in AMIA Symposium vol. 2023 1226 (2024).
60.
Resnik, P., Garron, A. & Resnik, R. Using topic modeling to improve prediction of neuroticism and depression in college students. in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing 1348–1353 (2013).
61.
Rutowski, T. et al. Depression and anxiety prediction using deep language models and transfer learning. in 2020 7th International Conference on Behavioural and Social Computing 1–6 (IEEE, 2020).
62.
Shen, Y., Yang, H. & Lin, L. Automatic depression detection: An emotional audio-textual corpus and a gru/bilstm-based model. in International Conference on Acoustics, Speech and Signal Processing 6247–6251 (IEEE, 2022).
63.
Shin, D. et al. Detection of depression and suicide risk based on text from clinical interviews using machine learning: possibility of a new objective diagnostic marker. Front. Psychiatry 13, 801301 (2022).
64.
Shin, D., Kim, H., Lee, S., Cho, Y. & Jung, W. Using large language models to detect depression from user-generated diary text data as a novel approach in digital mental health screening: instrument validation study. J. Med. Internet Res. 26, e54617 (2024).
65.
Smirnova, D. et al. Language patterns discriminate mild depression from normal sadness and euthymic state. Front. Psychiatry 9, 105 (2018).
66.
Smirnova, D. et al. Language in mild depression: How it is spoken, what it is about, and why it is important to listen. Psychiatria Danub. 31, 427–433 (2019).
67.
Sood, P., Yang, X. & Wang, P. Enhancing depression detection from narrative interviews using language models. in International Conference on Bioinformatics and Biomedicine 3173–3180 (IEEE, 2023).
68.
Tao, Y. et al. Classifying anxiety and depression through LLMs virtual interactions: a case study with ChatGPT. in International Conference on Bioinformatics and Biomedicine 2259–2264 (IEEE, 2023).
69.
Tlachac, M. & Rundensteiner, E. Screening for depression with retrospectively harvested private versus public text. IEEE J. Biomed. Health Inform. 24, 3326–3332 (2020).
70.
Tlachac, M. et al. Studentsadd: Rapid mobile depression and suicidal ideation screening of college students during the coronavirus pandemic. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 6, 1–32 (2022).
71.
Tlachac, M. et al. Text generation to aid depression detection: a comparative study of conditional sequence generative adversarial networks. in International Conference on Big Data 2804–2813 (IEEE, 2022).
72.
Tlachac, M., Shrestha, A., Shah, M., Litterer, B. & Rundensteiner, E. A. Automated construction of lexicons to improve depression screening with text messages. IEEE J. Biomed. Health Inform. 27, 2751–2759 (2022).
73.
Weber S, Deperrois N, Heun R, et al. Using a fine-tuned large language model for symptom-based depression evaluation. npj Digital Medicine.8(1), 598 (2025).
74.
Wright-Berryman, J., Cohen, J., Haq, A., Black, D. P. & Pease, J. L. Virtually screening adults for depression, anxiety, and suicide risk using machine learning and language from an open-ended interview. Front. Psychiatry 14, 1143175 (2023).
75.
Xue, J. et al. Fusing multi-level features from audio and contextual sentence embedding from text for interview-based depression detection. in International Conference on Acoustics, Speech and Signal Processing 6790–6794 (IEEE, 2024).
76.
Ye, J. et al. Multi-modal depression detection based on emotional audio and evaluation text. J. Affect. Disord. 295, 904–913 (2021).
77.
Yuan, J. et al. Depressive tendency recognition using the gated recurrent unit from speech and text features. in 2021 International Conference on Asian Language Processing 42–46 (IEEE, 2021).
78.
Zhang Z, Zhang S, Ni D, et al. Multimodal sensing for depression risk detection: Integrating audio, video, and text data. Sensors. 24(12), 3714 (2024).
79.
Zou, B. et al. Semi-structural interview-based Chinese multimodal depression corpus towards automatic preliminary screening of depressive disorders. IEEE Trans. Affect. Comput. 14, 2823–2838 (2022).
80.
Morales, M. R. & Levitan, R. Speech vs. text: A comparative analysis of features for depression detection systems. in Spoken Language Technology Workshop 136–143 (IEEE, 2016).
81.
Özkanca, Y., Demiroglu, C., Besirli, A. & Celik, S. Multi-Lingual Depression-Level Assessment from Conversational Speech Using Acoustic and Text Features. in Interspeech 3398–3402 (2018).
82.
Agarwal, N., Dias, G. & Dollfus, S. Agent-based splitting of patient-therapist interviews for depression estimation. In Empowering Communities: A Participatory Approach to AI for Mental Health (2022).
83.
Agarwal, N., Dias, G. & Dollfus, S. Multi-view graph-based interview representation to improve depression level estimation. Brain Inform. 11, 14 (2024).
84.
Al Hanai, T., Ghassemi, M. M. & Glass, J. R. Detecting depression with audio/text sequence modeling of interviews. in Interspeech 1716–1720 (2018).
85.
Ansari, G., Sharma, A., Arya, P. & Saxena, Y. Multimodal Depression Detection System Using Machine Learning. In 2023 Second International Conference on Informatics 1–7 (IEEE, 2023).
86.
Burdisso, S., Villatoro-Tello, E., Madikeri, S. & Motlicek, P. Node-weighted Graph Convolutional Network for Depression Detection in Transcribed Clinical Interviews. in Interspeech 3617–3621 (2023).
87.
Cao, Y., Hao, Y., Li, B. & Xue, J. Depression prediction based on BiAttention-GRU. J. Ambient Intell. Humaniz. Comput. 13, 5269–5277 (2022).
88.
Chen, Z. et al. Depression detection in clinical interviews with LLM-empowered structural element graph. in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 8181–8194 (2024).
89.
Correia, J., Trancoso, I. & Raj, B. Detecting psychological distress in adults through transcriptions of clinical interviews. in International Conference on Advances in Speech and Language Technologies for Iberian Languages 162–171 (Springer, 2016).
90.
Dang, T. et al. Investigating word affect features and fusion of probabilistic predictions incorporating uncertainty in AVEC 2017. in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge 27–35 (2017).
91.
Danner, M. et al. Advancing mental health diagnostics: GPT-based method for depression detection. in 2023 62nd Annual Conference of the Society of Instrument and Control Engineers 1290–1296 (IEEE, 2023).
92.
Fang, M., Peng, S., Liang, Y., Hung, C.-C. & Liu, S. A multimodal fusion model with multi-level attention mechanism for depression detection. Biomed. Signal Process. Control 82, 104561 (2023).
93.
Firoz, N., Beresteneva, O. G., Vladimirovich, A. S., Tahsin, M. S. & Tafannum, F. Automated text-based depression detection using hybrid ConvLSTM and Bi-LSTM model. in 2023 Third International Conference on Artificial Intelligence and Smart Energy 734–740 (IEEE, 2023).
94.
Firoz, N., Beresteneva, O. G. & Aksyonov, S. V. Enhancing depression detection: Employing autoencoders and linguistic feature analysis with BERT and LSTM model. in 2023 International Russian Automation Conference 299–304 (IEEE, 2023).
95.
Guo, Y. et al. A prompt-based topic-modeling method for depression detection on low-resource data. IEEE Trans. Comput. Soc. Syst. 11, 1430–1439 (2023).
96.
Hadzic, B. et al. Enhancing early depression detection with AI: a comparative use of NLP models. SICE J. Control. Meas. Syst. Integr. 17, 135–143 (2024).
97.
Hong, S., Cohn, A. & Hogg, D. C. Using graph representation learning with schema encoders to measure the severity of depressive symptoms. in International Conference on Learning Representations (2022).
98.
Jo, A.-H. & Kwak, K.-C. Diagnosis of depression based on four-stream model of bi-LSTM and CNN from audio and text information. IEEE Access 10, 134113–134135 (2022).
99.
Kokkera, A., Varsha, N. & Vasanth, A. V. Multimodal Approach for Detecting Depression Using Physiological and Behavioural Data. in 2023 3rd International Conference on Pervasive Computing and Social Networking 53–65 (IEEE, 2023).
100.
Lam, G., Dongyan, H. & Lin, W. Context-aware deep learning for multi-modal depression detection. in International Conference on Acoustics, Speech and Signal Processing 3946–3950 (IEEE, 2019).
101.
Lau, C., Chan, W.-Y. & Zhu, X. Improving depression assessment with multi-task learning from speech and text information. in 2021 55th Asilomar Conference on Signals, Systems, and Computers 449–453 (IEEE, 2021).
102.
Lau, C., Zhu, X. & Chan, W.-Y. Automatic depression severity assessment with deep learning using parameter-efficient tuning. Front. Psychiatry 14, 1160291 (2023).
103.
Li, C., Braud, C. & Amblard, M. Multi-Task Learning for Depression Detection in Dialogs. in Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. 1–8 (2022).
104.
Li, M., Xu, H., Liu, W. & Liu, J. Bidirectional lstm and attention for depression detection on clinical interview transcripts. in IEEE 10th International Conference on Information, Communication and Networks (ICICN) 638–643 (IEEE, 2022).
105.
Li, M., Sun, X. & Wang, M. Detecting depression with heterogeneous graph neural network in clinical interview transcript. IEEE Trans. Comput. Soc. Syst. 11, 1315–1324 (2023).
106.
Lin, L., Chen, X., Shen, Y. & Zhang, L. Towards automatic depression detection: A BiLSTM/1D CNN-based model. Appl. Sci. 10, 8701 (2020).
107.
Lopez-Otero, P., Fernández, L. D., Abad, A. & Garcia-Mateo, C. Depression Detection Using Automatic Transcriptions of De-Identified Speech. in Interspeech 3157–3161 (2017).
108.
Lorenc, P., Uban, A.-S., Rosso, P. & Šedivý, J. Detecting early signs of depression in the conversational domain: The role of transfer learning in low-resource scenarios. in International Conference on Applications of Natural Language to Information Systems 358–369 (Springer, 2022).
109.
Lu, K.-C., Thamrin, S. A. & Chen, A. L. Depression detection via conversation turn classification. Multimed. Tools Appl. 82, 39393–39413 (2023).
110.
Rodrigues Makiuchi, M., Warnita, T., Uto, K. & Shinoda, K. Multimodal fusion of bert-cnn and gated cnn representations for depression detection. in Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop 55–63 (2019).
111.
Mallol-Ragolta, A., Zhao, Z., Stappen, L., Cummins, N. & Schuller, B. A hierarchical attention network-based approach for depression detection from transcribed clinical interviews. in Interspeech 221–225 (2019).
112.
Mao, K. et al. Prediction of depression severity based on the prosodic and semantic features with bidirectional LSTM and time distributed CNN. in IEEE Transactions on Affective Computing vol. 14 2251–2265 (IEEE, 2022).
113.
Milintsevich, K., Sirts, K. & Dias, G. Towards automatic text-based estimation of depression through symptom prediction. Brain Inform. 10, 4 (2023).
114.
Niu, M., Chen, K., Chen, Q. & Yang, L. Hcag: A hierarchical context-aware graph attention model for depression detection. in International Conference on Acoustics, Speech and Signal Processing 4235–4239 (IEEE, 2021).
115.
Pampouchidou, A. et al. Depression assessment by fusing high and low level features from audio, video, and text. in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge 27–34 (2016).
116.
Prabhu, S., Mittal, H., Varagani, R., Jha, S. & Singh, S. Harnessing emotions for depression detection. Pattern Anal. Appl. 25, 537–547 (2022).
117.
Qureshi, S. A., Saha, S., Hasanuzzaman, M. & Dias, G. Multitask representation learning for multimodal estimation of depression level. IEEE Intelligent Systems 34, 45–52 (2019).
118.
Qureshi, S. A., Dias, G., Hasanuzzaman, M. & Saha, S. Improving depression level estimation by concurrently learning emotion intensity. IEEE Comput. Intell. Mag. 15, 47–59 (2020).
119.
Oureshi, S. A., Dias, G., Saha, S. & Hasanuzzaman, M. Gender-aware estimation of depression severity level in a multimodal setting. In 2021 International Joint Conference on Neural Networks 1–8 (IEEE, 2021).
120.
Rasipuram, S., Bhat, J. H., Maitra, A., Shaw, B. & Saha, S. Multimodal depression detection using task-oriented transformer-based embedding. in 2022 IEEE Symposium on Computers and Communications 01–04 (IEEE, 2022).
121.
Ray, A., Kumar, S., Reddy, R., Mukherjee, P. & Garg, R. Multi-level attention network using text, audio and video for depression prediction. in Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop 81–88 (2019).
122.
Rinaldi, A., Tree, J. E. F. & Chaturvedi, S. Predicting depression in screening interviews from latent categorization of interview prompts. in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 7–18 (2020).
123.
Rohanian, M., Hough, J. & Purver, M. Detecting depression with word-level multimodal fusion. in Interspeech 1443–1447 (2019).
124.
Sadeghi, M. et al. Exploring the capabilities of a language model-only approach for depression detection in text data. in International Conference on Biomedical and Health Informatics 1–5 (IEEE, 2023).
125.
Sadeghi, M. et al. Harnessing multimodal approaches for depression detection using large language models and facial expressions. NPJ Ment. Health Res. 3, 66 (2024).
126.
Samareh, A., Jin, Y., Wang, Z., Chang, X. & Huang, S. Predicting depression severity by multi-modal feature engineering and fusion. In Proceedings of the AAAI Conference on Artificial Intelligence vol. 32 (2018).
127.
Senn, S., Tlachac, M., Flores, R. & Rundensteiner, E. Ensembles of bert for depression classification. in 2022 4th Annual International Conference of the IEEE Engineering in Medicine & Biology Society 4691–4694 (IEEE, 2022).
128.
Stasak, B., Epps, J. & Goecke, R. Elicitation Design for Acoustic Depression Classification: An Investigation of Articulation Effort, Linguistic Complexity, and Word Affect. in Interspeech vol. 17 834–838 (2017).
129.
Stepanov, E. A. et al. Depression severity estimation from multiple modalities. in 2018 IEEE 20th International Conference on e-Health Networking, Application and Services 1–6 (IEEE, 2018).
130.
Sun, B. et al. A random forest regression method with selected-text feature for depression assessment. in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge 61–68 (2017).
131.
Toto, E., Tlachac, M. & Rundensteiner, E. A. Audibert: A deep transfer learning multimodal classification framework for depression screening. in Proceedings of the 30th ACM International Conference on Information & Knowledge Management 4145–4154 (2021).
132.
Marriwala, N. & Chaudhary, D. A hybrid model for depression detection using deep learning. Meas.: Sensors 25, 100587 (2023).
133.
Van Steijn, F., Sogancioglu, G. & Kaya, H. Text-based interpretable depression severity modeling via symptom predictions. in Proceedings of the 2022 International Conference on Multimodal Interaction 139–147 (2022).
134.
Villatoro-Tello, E., Ramírez-de-la-Rosa, G., Gática-Pérez, D., Magimai.-Doss, M. & Jiménez-Salazar, H. Approximating the mental lexicon from clinical interviews as a support tool for depression detection. in Proceedings of the 2021 International Conference on Multimodal Interaction 557–566 (2021).
135.
Williamson, J. R. et al. Detecting depression using vocal, facial and semantic communication cues. in Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge 11–18 (2016).
136.
Xezonaki, D., Paraskevopoulos, G. & Potamianos, A. Affective Conditioning on Hierarchical Attention Networks applied to Depression Detection from Transcribed Clinical Interviews. in Interspeech 4556–4560 (2020).
137.
Xia, Y. et al. A depression detection model based on multimodal graph neural network. Multimed. Tools Appl. 83, 63379–63395 (2024).
138.
Xiao, J., Huang, Y., Zhang, G. & Liu, W. A deep learning method on audio and text sequences for automatic depression detection. in 2021 3rd International Conference on Applied Machine Learning 388–392 (IEEE, 2021).
139.
Xu, X., Zhang, G., Lu, Q. & Mao, X. Multimodal depression recognition that integrates audio and text. in 2023 4th International Symposium on Computer Engineering and Intelligent Communications 164–170 (IEEE, 2023).
140.
Yadav, U. & Sharma, A. K. A novel automated depression detection technique using text transcript. Int. J. Imaging Syst. Technol. 33, 108–122 (2023).
141.
Yang, L. et al. Hybrid depression classification and estimation from audio video and text information. in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge 45–51 (2017).
142.
Yang, L. et al. Multimodal measurement of depression using deep learning models. in Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge 53–59 (2017).
143.
Yang, L., Jiang, D. & Sahli, H. Integrating deep and shallow models for multi-modal depression analysis—hybrid architectures. IEEE Trans. Affect. Comput. 12, 239–253 (2018).
144.
Yang, C., Lai, X., Hu, Z., Liu, Y. & Shen, P. Depression tendency screening use text based emotional analysis technique. in J. Phys.: Conf. Ser. vol. 1237 032035 (IOP Publishing, 2019).
145.
Zhang, Z., Lin, W., Liu, M. & Mahmoud, M. Multimodal deep learning framework for mental disorder recognition. in Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition 344–350 (IEEE, 2020).
146.
Zhang, Y., Wang, Y., Wang, X., Zou, B. & Xie, H. Text-based decision fusion model for detecting depression. in Proceedings of the 2020 2nd Symposium on Signal Processing Systems 101–106 (2020).
147.
Zhang, W., Mao, K. & Chen, J. A Multimodal approach for detection and assessment of depression using text, audio and video. Phenomics 4, 234–249 (2024).
148.
Zhang, J. & Guo, Y. Multilevel depression status detection based on fine-grained prompt learning. Pattern Recog. Lett. 178, 167–173 (2024).
149.
Zhao, Z. & Wang, K. Unaligned multimodal sequences for depression assessment from speech. in 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society 3409–3413 (IEEE, 2022).
A
Fig. 1
Flow Diagram of Study Selection
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
A
Fig. 2
Forest plot of study-level classification accuracy with pooled random-effects estimate
Click here to Correct
A
Fig. 3
Galbraith (radial) plot assessing heterogeneity across studies in classification accuracy
Note. The Galbraith (radial) plot displays each study’s standardized effect size (zi) against its statistical precision (1 / √(vi + τ²)), where precision represents the inverse of the standard error, that is, the reliability of each study’s estimate. The x-axis indicates statistical precision, studies farther to the right are more precise (smaller SE), and the y-axis represents the standardized effect size, with points near the top or bottom deviating more from the pooled effect. The solid line represents the pooled effect from the random-effects model, and the shaded 95% confidence region indicates the expected range of variation.
In this analysis, most studies fall within the shaded region, suggesting that the observed heterogeneity (I² = 98.8%) reflects broad variability across studies rather than the influence of a single outlier.
A
Fig. 4
Forest Plot of Classification Accuracy from Sensitivity Analysis Excluding Studies
Click here to Correct
1
Because the DAIC dataset and its derivatives were repeatedly used across numerous studies, including overlapping samples and similar recording procedures, their inclusion could bias descriptive summaries toward the characteristics of that single corpus. Therefore, for dataset characteristics section, DAIC-based studies are summarized separately from those using unique datasets to better represent the diversity of data sources included in the review.
2
Many studies tested multiple NLP feature extraction and classification methods. The counts reported here represent the best-performing approach in each study rather than the only method used. For example, complex models were compared with simpler or traditional approaches to test improved performance.
Total words in MS: 8053
Total words in Title: 11
Total words in Abstract: 226
Total Keyword count: 4
Total Images in MS: 6
Total Tables in MS: 2
Total Reference count: 149