A Time-Aware Multilingual Multimodal Framework for Depression Detection on Social Media
A
VandanaSharma1Email
GeetaSikka2Email
Ajay KSharma3Email
A
Abstract
Around 280 million people across the world live with depression, making it one of the most common mental health concerns today bib42. Early detection is one of the most effective ways to support those in need and prevent their condition from worsening. Social media lets us see people's daily activities and feelings in today's digital world. More people express their genuine opinions online than in clinical settings. These websites are therefore helpful in learning about mental health trends. However, most previous studies have examined only English text and ignored the variety of languages and media people use on social platforms. This gap is evident in India, where users often write in Hinglish (a natural mix of Hindi and English), which brings new linguistic challenges. To bridge this gap, our study introduces a time-aware multilingual and multimodal framework for detecting signs of depression from social media posts collected from X (formerly Twitter). The model develops a deeper insight into how users behave and express themselves by learning from text, images, and posting times. The results of our experiments indicate that the model performs well, consistently across runs, with an F1-score of 0.79 and an AUC of 0.74, outperforming all text-only or single-language baselines. These results suggest that combining behavioural, visual, and textual cues improves the accuracy and cross-linguistic flexibility of depression detection. This is the first study to examine multilingual and multimodal depression detection using actual Indian social media data. This study shows the value of multicultural research and offers a valuable framework for developing tools that can facilitate online mental health monitoring.
Keywords
Depression detection
Social media
Mental health
Deep learning
Multimodal
A
A
Introduction
A
All around the world, many people now face depression, a condition that can affect anyone regardless of who they are. More than 280 million people live with it today bib42, indicating its widespread prevalence. It brings sadness and shapes one's thoughts, relationships, and daily quality of life. Due to a lack of resources for mental health, the poor infrastructure for healthcare, and the enduring social stigma, low- and middle-income countries (LMICs) bear a disproportionate amount of this burden. Nearly one in twenty people in India suffers from depressive disorders, many of which go undiagnosed or untreated, according to data from the National Mental Health Survey bib17. If depression is not treated, it can result in decreased productivity, social disengagement, and, frequently, suicidal thoughts and behaviors. These statistics highlight the urgent need for early, accessible, and non-invasive methods to identify individuals at risk and facilitate timely intervention.
In recent years, social media has transformed the way people express their emotions and share their experiences. Social media sites like X (formerly Twitter) are valuable resources for mental health research as they offer vast amounts of real-time data representing users' emotions and behaviors bib13. Unlike structured clinical interviews or self-reported surveys, social media expressions are spontaneous and often reflect authentic emotional states, offering new opportunities for large-scale behavioral analysis. They therefore provide potential for passive and early detection of mental health issues such as depression bib20. However, such an approach is still in its infancy bib43, facing challenges such as noisy multilingual data, scarce ground-truth labels, and unstructured content that is difficult to interpret.
Language is central to mental health bib14, yet most studies focus on English-language data. Very few have explored depression detection in languages other than English, and almost none have addressed mixed-language contexts, such as Hinglish, a hybrid form combining Hindi and English. Hinglish is prevalent on Indian social media and is characterized by code-switching within and across sentences, the use of Hindi words written in both Roman script and Devanagari script, and a flexible, informal grammar. This mixture reflects natural conversation but poses challenges for computational analysis, as conventional natural language processing tools are designed for single-language text. Understanding Hinglish is crucial for developing culturally grounded and linguistically sensitive depression detection systems that accurately reflect how Indian users communicate online. Despite reflecting the everyday language of millions of people, this hybrid form of Hindi and English is still underrepresented in computational models.
The absence of multimodal exploration in current research is another significant drawback. Most earlier research has only examined textual data, overlooking the variety of self-expression that can be found in addition to written posts. People often post pictures on social media, such as memes or photos, that either support or contradict what they say in text. If these visual cues are not considered, it may not be easy to understand a user's mental state fully. Online multimodal learning combines text, images, and time-based information to provide a comprehensive understanding of how people feel and act. This method works well in India, where people often mix words, pictures, and cultural symbols when talking. Figure 1 supports this idea by showing how images can reveal emotions that text alone might miss. It demonstrates the importance of considering things from multiple angles.
Fig. 1
An example of a multimodal post showing contrasting signals. The text conveys a positive tone, yet the image suggests underlying sadness, illustrating how visual data add crucial emotional context.
Click here to Correct
This study investigates how to identify depression using user-generated data from X to fill these gaps. We present a multilingual, multimodal framework that considers text, images, and time into a single learning pipeline. The proposed model uses a temporal embedding layer to represent posting patterns, ResNet-18 for image analysis, and XLM-RoBERTa for text encoding. Multilingual and code-switched text can be handled efficiently by XLM-RoBERTa, a transformer model trained on over 100 languages bib12. A deep residual network, specifically ResNet-18, extracts emotional and contextual information from photos bib22. These components help the model read emotional cues from multiple data types.
The following is an outline of the study's key contributions:
We developed and annotated a multimodal dataset of depression-related social media posts from X, containing both English and Hindi content, along with associated text and image data.
We proposed a time-aware, multilingual, multimodal framework that jointly models text, image, and temporal information using an early-fusion learning approach, enabling the model to understand emotional meaning and context better.
We conducted systematic experiments comparing unimodal, multimodal, and language-specific setups to assess the benefits of integrating multiple modalities and languages.
We also conducted a case study to interpret the model's behaviour and comprehend how textual, visual, and temporal elements contribute to its predictions.
The paper is divided into the following sections. Section 2 reviews related work on depression detection. Section 3 describes the proposed methodology and model architecture. In Section 4, the experiments and main results are described. Section 5 then explains the case study and how the model decisions were interpreted. The paper concludes with a summary and ideas for future work in Section 6.
Related work
Evolution of computational approaches
In the early 2010s, experts began to see that social media could be helpful in spotting signs of depression, since it captures how people think and feel in real time. The words people use on X can reveal signs of depression, according to one of the first studies by bib15. Their dictionary-based method demonstrated that a person's language frequently reflects their mood, providing a foundation for further computer research. Subsequently, researchers began to go beyond text alone by incorporating social activity features, such as network size and interaction levels, and user details, including age, gender, and location bib39,bib43. Depressive symptoms show correlation with behavioural and environmental factors such as posting time, weather, and daily activity patterns bib13,bib21,bib43. According to these results, combining behavioural and linguistic indicators rather than just language is the most effective way to understand depression.
After the initial text-based studies, researchers examined how depressive signs change. bib39 noticed that people showing depressive behaviour often post at unusual hours or follow irregular daily patterns.bib33 found that when people post, monitoring their activity over days or weeks allows the model to detect mood swings more accurately. bib10 also demonstrated that predictions can be more accurate when using time patterns. They achieved an F1-score of 0.956 by applying a time-aware attention model to Instagram posts. Simultaneously, other researchers started concentrating on outward manifestations of depression. bib15 found that users with depressive tendencies often shared darker and less colourful images, while bib31 achieved more than 70% accuracy with image-based machine learning.
Research started integrating text and image data as the field developed to create richer and more trustworthy models. \cite{bib5} attempted to enhance the model's understanding of its significance by incorporating topic modelling. In a separate study, bib19 employed a Bi-GRU network for text and a VGGNet for images to integrate text and image features. More recently, bib18 examined videos to identify body language and facial expressions as nonverbal signs of emotional distress. Although their study included video data, the material came from mixed sources: clinical recordings for videos and social media content for images and text. However, work in this area is still limited as many datasets are private and hard to reproduce. These studies show that multimodal learning is steadily advancing bib11,bib34. Another ongoing challenge is the limited language variety in available data. Most existing research still focuses mainly on English-language datasets. Only a handful of studies have looked into other languages, including Chinese bib43,bib40, Russian bib37, Arabic \cite{bib2}, and Bengali bib25.
In contrast, mixed-language settings like Hinglish, which naturally blends Hindi and English, have received little attention. Such texts are often complex for regular NLP systems to interpret, including mixed scripts, informal wording, and inconsistent grammar bib32. Overcoming this language gap is crucial for developing mental health detection models that are realistic, inclusive, and more accurately reflect how people communicate online.
Deep learning for depression detection
A
In computer science, the study of depression detection moved steadily from hand-crafted statistical methods to deep learning architectures capable of learning complex, non-linear relationships in multimodal and multilingual data. Early work primarily relied on manual feature extraction, where researchers identified linguistic markers, behavioral signals, or profile attributes and then trained models such as support vector machines, logistic regression, or random forests bib14,bib21. These methods progressively added visual cues, temporal patterns, and social network indicators as richer data became available. However, most still handled each signal separately rather than using a single framework. Attempts are made to model time explicitly. bib10 investigated the relationship between the timing of Instagram posts and people's emotions. Their model focused on user post patterns and found links to emotional changes. Earlier, bib33 also examined timing in conjunction with the text to observe how moods evolve. Despite these advances, few large-scale public datasets include text, images, and timestamps, especially in multilingual and code-switched environments, which remain largely unexplored.
The transition to deep learning significantly improved the ability to capture such dependencies. Early research employed recurrent models, such as LSTMs and GRUs, to investigate how words in text relate to each other over time bib35,bib43. Convolutional networks, such as VGGNet and ResNet, gained popularity for image data analysis because they can capture emotional and contextual details from visual content bib19. Later, transformer models such as BERT, RoBERTa, and XLM-RoBERTa revolutionised text analysis, enabling the learning of deep language patterns that transfer across different languages bib12,bib28. Recent studies have also started to focus on model transparency and multilingual learning. bib40 explored ways to make depression prediction models more understandable by showing where the model focuses its attention and adjusting it for different languages. \cite{bib7} took a different approach, utilising a Time2Vec method to connect feelings and timing information from other types of data.
Our work combines XLM-RoBERTa for handling text and ResNet-18 for processing visual content. XLM-RoBERTa is a multilingual transformer trained on approximately 2.5 TB of text from CommonCrawl, covering 100 languages bib12. It expands the masked language modelling (MLM) idea from BERT to multilingual data by predicting hidden words based on their context. The model minimizes the following cross-entropy loss:
where
represents the masked tokens, and
denotes the surrounding words used as context.ResNet-18 is a convolutional neural network designed to overcome the vanishing gradient issue using shortcut, or skip, connections bib22.These connections allow the input feature map
to bypass one or more layers and be added directly to the output. The relationship can be represented mathematically as follows:
where
denotes the residual function,
represents the input feature map, and
is the resulting output after the residual connection. This setup preserves important visual details while enabling the deep network to recognize complex patterns that relate to emotions and mood. Combining these two models, our system captures proper signals from text, images, and posting behavior, leading to a more realistic picture of how depressive signs appear on social media. Building on earlier multimodal research bib11,bib10,bib7, our framework also explores areas that have received less attention, especially multilingual and code-mixed communication challenges in real-world depression detection. To contextualize our contribution, Table summarizes studies that have used non-English content or a mix of English with other languages for depression detection.
A
begin{sidewaystable*}\caption{Summary of studies using non-English or multilingual data for depression detection.}\tiny\label{tab:relatedwork}\renewcommand{\arraystretch}{1.2}\setlength{\tabcolsep}{2pt}\begin{tabular*}{\textheight}{@{\extracolsep\fill}p{2.5cm}p{2.8cm}p{1.6cm}p{1.6cm}p{2.2cm}p{1.6cm}p{0.8cm}p{5.2cm}}\topruleAuthor & Year & Models & Data Model & Model Type & Platform & Language(s) & F1 & Key Highlights \\\midruleOur Work (2025) & XLM-RoBERTa + ResNet18 + Temporal Embeddings & Text, Images, Time & Multimodal & X & English, Hindi, Hinglish & 0.79 & Multilingual, code-switched data, early fusion, time-aware design \\Xiao et al. (2024)bib44 & Hierarchical Transformer Network (HTN) & Text & Unimodal & Sina Weibo & Chinese & 0.87 & Hierarchical text modeling across social posts \\Yang Liu (2024)bib28 & Weighted Graph-RoBERTa Neural Network (WGRNN) & Text & Unimodal & Online Reviews & Chinese & 0.84 & Graph-based contextual representation \\Bucur et al. (2023)\cite{bib7} & Multimodal Transformer (Vanilla, Set, Time2Vec) & Text, Images & Multimodal & Reddit, X & English, Spanish & 0.81 & Temporal embeddings and multimodal fusion \\Kabir et al. (2022)bib25 & KSVM, RF, LR, KNN, NB, LSTM, GRU, RNN & Text & Unimodal & Blogs \& Microblogs & Bangla & 0.77 & Traditional + deep text models for low-resource language \\Abdulqader M. Almars (2022)\cite{bib3} & Attention-based Bi-LSTM & Text & Unimodal & X & Arabic & 0.83 & Attention for contextualized Arabic depression cues \\Cheng et al. (2022)bib10 & Multimodal Time-Aware Attention Networks (MTAN) & Text, Image, Time & Multimodal & Instagram & English & 0.956 & Time-aware LSTM + attention for interpretability \\El-Ramly et al. (2021)bib16 & Fine-tuned ARABERT, MARBERT & Text & Unimodal & X & Arabic & 0.82 & Domain-specific BERT adaptation \\Mumu et al. (2021)bib30 & CNN-LSTM Hybrid & Text & Unimodal & Facebook & Bangla & 0.75 & CNN-LSTM synergy for Bangla \\Safa et al. (2021)bib34 & CNN + BERT Fusion & Text, Image, Profile & Multimodal & X & English & 0.74 & User-level multimodal analysis \\Wang et al. (2020)bib40 & BERT, XLNet, RoBERTa & Text & Unimodal & Sina Weibo & Chinese & 0.86 & Transformer-based comparative analysis \\Chiu et al.(2020)bib11 & LSTM with Temporal Gaps & Text, Images, Time & Multimodal & Instagram & English, Chinese & 0.83 & Time interval modeling for user posts \\Eatedal Alabdulkreem (2020)\cite{bib2} & RNN + LSTM Cascaded DNN & Text & Unimodal & X & Arabic & 0.81 & Layered deep neural structure \\Wang et al. (2019)bib40 & CNN, LSTM, BERT, RoBERTa, XLNet & Text & Unimodal & Sina Weibo & Chinese & 0.80 & Early transformer benchmark \\Stankevich et al. (2019)bib37 & MLP & Text & Unimodal & VKontakte & Russian & 0.78 & Depression detection from short social posts \\Shen et al. (2018)bib35 & DNN & Text & Unimodal & X, Weibo & English, Chinese & 0.72 & Cross-lingual early deep model \\Wu et al. (2018)bib43 & RNN-inspired DNN & Text & Unimodal & Facebook & Chinese & 0.76 & Early deep architecture for behavioral signals \\\botrule\end{tabular*}\footnotemark\footnotetext{\textbf{Abbreviations:} CNN – Convolutional Neural Network; RNN – Recurrent Neural Network; LSTM – Long Short-Term Memory; Bi-LSTM – Bidirectional Long Short-Term Memory; DNN – Deep Neural Network; MLP – Multilayer Perceptron; BERT – Bidirectional Encoder Representations from Transformers; XLNet – Generalized Autoregressive Transformer; RoBERTa – Robustly Optimized BERT Pretraining Approach; KSVM – Kernel Support Vector Machine; RF – Random Forest; LR – Logistic Regression; KNN – K-Nearest Neighbors; NB – Naïve Bayes; GRU – Gated Recurrent Unit.}\footnotetext{Hinglish – a Hindi–English mixed language commonly used on Indian social media platforms, combining words or phrases from both languages within the same post.}\end{sidewaystable*}
As shown in Table 6, most earlier studies focused on a single type of data, mainly text in languages such as Chinese bib43,bib40,bib28,bib44, Arabic bib16, Russian bib37 and Bangla bib30,bib25. Even the more recent multimodal studies [bib1,bib10,bib7,bib27 still face key limitations, such as the limited availability of high-quality image data, restricted language coverage, and a lack of focus on multilingual or code-mixed settings.
So far, no research directly explores Hinglish or other mixed languages, which create unique challenges in tokenisation, meaning understanding, and connecting text with images.. Our work addresses these shortcomings by proposing a time-aware, multilingual, and multimodal framework that jointly learns from text, images, and timing information drawn from real social media posts. Previous studies have mainly examined text and pictures separately, without considering their interaction. Our model takes a different path; it studies both side by side to see how words and visuals influence each other. Using transformers for text and CNNs for images, the system learns subtle language details alongside the visual mood or setting. This combination helps it build a more precise and more realistic picture of how people express emotions or signs of depression on social platforms.
Method
Problem overview
This work examines how various types of information, including text, images, and posting time, can be combined to detect signs of depression in social media posts. Unlike standard text classification, our dataset includes multiple languages and diverse input types, which introduces additional layers of complexity. To clarify the research motivation, we first identify the major sources of challenges—such as multilinguality, code-mixing, and multimodal imbalance—and then describe the specific objectives of the work.
Linguistic and multimodal challenges
Language diversity poses one of the most significant challenges in our dataset. Hindi and English, two of India's most widely spoken languages, are used to write the posts. While Hindi is typically written in the Devanagari script, users online frequently write it using English letters, referred to as Romanised Hindi, for convenience. For example, a sentence usually appearing in Hindi script may instead be typed as ``mujhe lagta hai zindagi se haar gaya hoon,'' meaning ``I feel like I have lost to life.'' Many users also mix Hindi and English within the same post, a phenomenon known as code-mixing or Hinglishbib32. For instance, one might write ``Mood kharab hai today'' or ``Kal se bilkul motivation nahi hai.'' Such mixed posts often contain non-standard grammar, informal phrasing, and numerous spelling variations, all of which hinder the performance of standard NLP models.
The multimodal nature of the data introduces additional challenges. While most posts consist solely of text, a smaller subset includes images that may convey emotional or contextual cues. However, this image-based content is unevenly distributed, English posts tend to include images more frequently than Hindi or Hinglish posts, which are predominantly text-only. This imbalance complicates model training, a trend also observed in previous multimodal mental health studies bib31,bib11. Another difficulty arises when textual and visual modalities convey conflicting emotional tones. A cheerful image might accompany a melancholic caption, or vice versa, obscuring the overall affective meaning of the post.
To address these challenges, we made several design choices during dataset curation and model development. We constructed a multilingual lexicon encompassing English, Romanised Hindi, and variant spellings, guided by earlier multilingual depression detection approaches bib43,bib10.For textual encoding, we employed XLM-RoBERTa bib12, a transformer model trained on over 100 languages that has shown strong performance on code-switched and low-resource text.For visual feature extraction, we used ResNet-18 bib22, a convolutional architecture effective for both emotional and contextual understanding of images.Finally, we integrated textual, visual, and temporal representations using adaptive fusion. Temporal embeddings capture posting-time regularities, allowing the model to learn behavioral rhythms and emotional drifts over time. Together, these strategies mitigate linguistic noise, balance modality contributions, and provide a holistic understanding of how depressive expressions manifest across languages and media.
Task description
The main objective of this work is to identify signs of depression from social media posts by jointly analysing textual, visual, and temporal modalities. The task is formulated as a binary classification problem, where each post is labelled as either ``depression'' or ``normal''.
Let the dataset be denoted as
. Each post in the dataset consists of four components:
– textual content written by the user,
– the image attached to the post (if any),
– the timestamp indicating when the post was created, and
– the ground-truth label, where
represents a depressive post and
denotes a normal one.
Thus, the dataset can be represented as:
eq:dataset
The goal is to learn a predictive model
that takes multimodal inputs and produces a probability
indicating the likelihood that a given post reflects depressive traits:
eq:model
Each modality is processed through a dedicated encoder:
The text input is encoded using a multilingual transformer, producing a 768-dimensional representation
.
The image, when available, is passed through a convolutional neural network (CNN) encoder to yield a 512-dimensional representation
.
The temporal signal
is embedded through a learnable temporal embedding layer, producing a 32-dimensional representation
.
The encoded representations are concatenated to form a unified multimodal feature vector:
eq:concat
The resulting feature vector
has a combined dimension of
and encapsulates information from linguistic, visual, and temporal aspects. It is then passed through one or more fully connected layers, followed by a sigmoid activation to generate the predicted probability:
eq:sigmoid
The model is optimised using the binary cross-entropy loss function, defined as:
eq:loss
where
is the total number of posts in the training set.
This framework enables the model to jointly learn from multimodal signals such as linguistic patterns, emotional content in images, and temporal behaviour, providing a more comprehensive understanding of depressive expression. The conceptual flow of encoding, fusion, and classification is illustrated in Algorithm \ref{alg:framework}.
begin{algorithmic}[1]
Require Dataset
Ensure Predicted labels
State Initialize pretrained encoders:
State \hspace{1em}
XLM-RoBERTa
State \hspace{1em}
ResNet18
State \hspace{1em}
Temporal embedding layer
State Initialize learnable projection matrices:
State Initialize fusion weights:
For{each post
in
}
State # Encode text
State
State
State # Encode image if available
If{
}
State
State
Else
State
State
EndIf
State # Encode temporal features
State
State
State # Adaptive early fusion
State
State
State # Classification
State
EndFor
State # Model optimization
State Update parameters using backpropagation:
State
State Repeat until convergence criteria are satisfied.
State \Return Final classifier
end{algorithmic}
alg:framework
Multimodal Depression Detection Framework
Data collection
Our study gathered publicly available posts from the X platform. The idea is to collect real online content that might show signs of depression, written in different languages and styles. Every post is marked with one of two labels: Depression or Normal. These posts reflect how people usually share their feelings online, often switching between languages in a single post. Each post is assigned one of two labels: Depression or Normal. The dataset covers four major linguistic forms: English (EN), Hindi written in the Devanagari script (HD), Hindi written using Roman letters (HR), and Hinglish (HN), which combines Hindi and English words within the same post. This diversity makes the data linguistically rich and realistic, adding complexity for automated analysis bib32.
To identify relevant posts, the strategy proposed by bib14 is followed, but broadened the focus beyond explicit self-reports of depression. The collection process not only utilised explicit statements like ``I am depressed'' but also considered posts that showed quieter forms of distress, such as feelings of emptiness, loss of hope, or emotional fatigue, an approach consistent with earlier findings on implicit markers of mental health issues in social media text bib15, bib20.
To ensure systematic collection, a multilingual lexicon is developed guided by three major clinical frameworks: the Diagnostic and Statistical Manual of Mental Disorders (DSM-5)cite{bib4}, the Patient Health Questionnaire-9 (PHQ-9)bib26, and the International Classification of Diseases (ICD-10)bib41. These guidelines do not serve for medical diagnosis but help in finding repeated language patterns linked to depressive feelings. Similar to earlier work bib43, bib10, our work pulled out common words and short phrases that reflect sadness, tiredness, low motivation, and even self-blaming or suicidal thoughts. Each keyword is written in EN, HD, and HR, while frequent misspellings and social media abbreviations are also included to reflect informal online communication styles. Frequent social media misspellings and abbreviations are incorporated for broader coverage.
The collection followed a staged process, drawing from methods used in previous multimodal studies bib11. The initial step is to create a multilingual lexicon that captures variations in script and spelling. This list is then used to search for posts, taking into account informal or mixed-language writing that is common on social media. The entire workflow is summarised in Figure 2, which illustrates the stages of multilingual lexicon creation, data retrieval, cleaning, and multimodal alignment.
Fig. 2
Overview of the multilingual and multimodal data collection workflow. The process includes lexicon creation, multilingual retrieval, cleaning, preprocessing, and alignment of text, image, and temporal data.
Click here to Correct
After that, the data is cleaned, removing retweets, ads, and bot-generated content, so the final set consisted of real user posts and genuine feelings bib31. For users with multiple depression-related posts, historical tweets for up to 60 days are also collected to provide temporal continuity and behavioural context, following the practice used in earlier work bib14.
After collection, both textual and visual data are carefully preprocessed. The text is standardised through lowercasing and cleaned by removing URLs, emojis, hashtags, mentions, repeated characters, and other noise consistent with previous preprocessing pipelines bib43, bib10. To distinguish between HD and HR, a fastText-based language identification model bib24 is applied, supported by script-specific heuristics. All images are downloaded and passed through a perceptual hashing filter to remove duplicates. Images are collected and cleaned using perceptual hashing to avoid duplicates and then resized to 224 × 224 pixels and normalised to fit CNN-based models bib22. Further metadata, such as posting time, is also retained to support later analysis of users' temporal activity patterns.
Ethical safeguards and anonymisation
As the study involves sensitive mental health content, special care is taken to ensure the ethical handling of data, user privacy, and protection against any risk of re-identification. The study follows established best practices for privacy-preserving social media research bib13 and built a custom anonymisation pipeline in Python using the \texttt{pandas} library. Before starting anonymisation, several quasi-identifiers (QIDs) are identified: details that do not directly reveal a person's identity but could still be used to trace them when combined with other information. These include screen names, display names, and exact posting timestamps. Although our dataset lacks direct identifiers such as email addresses or phone numbers, earlier research shows that even indirect data points can facilitate re-identification, especially in sensitive areas like mental health bib31.
To address this, a three-step anonymisation process is applied. In the first step, all identifiable details are removed. Usernames, personal names, and tagged handles are deleted, and all links are cleared to ensure that posts cannot be traced back to their original profiles. Next, the posting time is generalised to the nearest calendar day. This reduces timestamp precision but maintains the natural posting order, helping to preserve temporal patterns useful for later analysis bib10.
This study also applies k-anonymity with
, ensuring that every record matches at least five others in basic features such as language and posting date. This step follows privacy protection guidelines widely used in social media research bib38. These three layers; personal information masking, temporal generalisation, and controlled anonymisation—create an ethically secure and research-ready dataset. It retains enough linguistic and temporal detail for mental health modelling without compromising privacy. In future work, automated language detection tools such as fastText or langdetect may be used to further refine linguistic consistency across anonymised data entries.
Data annotation
Our study utilises both machine and human effort to tag the data. Initially, posts are selected based on a list of words commonly associated with sadness or depression. Each post is manually reviewed to ensure label accuracy. To build this list, earlier studies are reviewed for how people express mental health issues on social media, and related terms from various languages (EN, HD, HR, HN) are incorporated bib15, bib35, bib11, bib10.
Similar to lexicon-driven methods used by bib43 and bib14, our term list is grounded in established clinical frameworks such as the Diagnostic and Statistical Manual of Mental Disorders (DSM-5)cite{bib4}, the Patient Health Questionnaire-9 (PHQ-9)bib26, and the International Classification of Diseases (ICD-10)bib41. These sources outline medically recognised symptoms like ongoing sadness, lack of interest, tiredness, hopelessness, and suicidal thoughts, which ensure that the labelling focuses on genuine clinical signs rather than regular mood changes. Afterwards, each post is read manually and marked as either Depression or Normal. Posts expressing clear or implied signs of distress, such as loneliness, emptiness, worthlessness, or self-blame, are marked as Depression. Posts representing neutral discussions, humour, or casual conversation are labelled Normal. In cases where meaning is uncertain, the broader context is considered, including the repeated use of negative language, tone of expression, or emotional intensity. This approach follows earlier manual labelling strategies applied to Twitter and Reddit data bib23.
Although labelling is primarily done at the post level, user-level patterns are also considered. If a user's posts show a clear emotional pattern over time, that background is also considered while deciding the final label. This idea aligns with the eRisk and CLPsych shared-task project methodology bib29, bib36. When available, attached images are also reviewed, as previous multimodal analyses demonstrated that visual cues often reinforce or clarify emotional states expressed in text bib31, bib21. After annotation, roughly 41% of posts are categorised as Depression and 59% as Normal. This pattern matches what earlier social media studies show: posts about depression are usually fewer than normal ones bib10. By combining automatic keyword search with careful human review, the final set provides a more authentic and culturally diverse representation of how people discuss depression in various languages, encompassing both text and images.
Feature extraction
Accurately identifying depressive signals requires drawing information from multiple input forms such as text, images, and posting behaviour. To achieve this, a multimodal feature extraction pipeline is developed that converts each modality into compact numerical representations suitable for deep learning models.
Textual features
For the text-based stream, the pretrained XLM-RoBERTa model bib12 is utilised, a transformer architecture specifically designed to handle multiple languages and informal code-mixed expressions commonly found in social media posts. Each tweet is tokenised and passed through the model to generate dense contextual representations. From the output, the embedding linked to the [CLS] token is extracted to summarise the entire message, producing a 768-dimensional feature vector:
This vector embeds the text's overall meaning and emotional undertone, helping the system interpret variations in linguistic style and sentiment across languages.
Image features
The visual component of each post is processed using ResNet-18 bib22, a deep residual convolutional architecture capable of recognising subtle emotional or contextual elements within images. Before feature extraction, every image is resized to
pixels and normalised to maintain consistency across the dataset. The processed images are then passed through the pretrained ResNet-18 network, and the output from its final average pooling layer is taken as the image representation, producing a 512-dimensional embedding:
When a post contains multiple images, the resulting embeddings are averaged and projected as:
This process helps retain emotional and contextual patterns visible in the visual data, such as colour tone, composition, and mood.
Temporal features
Temporal aspects of user activity are extracted from each post's timestamp and converted into separate components, including the day of the week, month, and general time segment of posting. These discrete variables are then embedded within a 32-dimensional latent space to form the temporal representation:
This representation enables the model to learn behavioural rhythms such as late-night posting or weekday variations that may correspond to depressive tendencies.
By combining these textual, visual, and temporal embeddings, the framework ensures that semantic, emotional, and behavioural signals are represented in a consistent and learnable form for the later fusion and classification stages.
Model architecture
The proposed model combines three types of information; text, images, and posting time to gain a more comprehensive understanding of signs of depression. Each component contributes a unique perspective: the text reflects what users express, the images reveal how emotions are visually portrayed, and the temporal dimension shows when these expressions occur. The overall structure of the model is illustrated in Figure 3 which shows the overall design of the proposed multimodal feature extraction and fusion framework.
Fig. 3
Proposed multimodal architecture showing the integration of text, image, and temporal feature streams for depression detection.
Click here to Correct
Text encoder
The textual component of each post is encoded using XLM-RoBERTa bib12, a multilingual transformer model well-suited for code-switched and Hinglish (HN) content commonly found on social media. An input post consisting of tokens is represented as:
Each token is transformed into embeddings as:
where
. These embeddings pass through twelve transformer encoder layers containing multi-head self-attention, residual connections, and feed-forward networks. The hidden state corresponding to the [CLS] token is extracted as a 768-dimensional vector representing the semantic meaning of the text:
This step allows the encoder to capture both linguistic meaning and emotional undertones, even when users mix multiple languages online.
Image encoder
Images accompanying posts are processed using ResNet-18 bib22, a residual convolutional neural network designed to learn expressive visual patterns while preventing gradient degradation. Each image is resized and normalised as:
and passed through convolutional blocks with skip connections. The resulting output from the final average pooling layer serves as the image embedding:
If a post contains multiple images, their embeddings are averaged and projected using a
activation function:
This process captures contextual and emotional cues visible in images, such as colour tone, composition, and brightness, which often correlate with the poster's emotional state.
Temporal encoder
Each post's timestamp
is decomposed into categorical components; day of week, month, and time block and then mapped into a trainable embedding space. The temporal embedding is represented as:
This representation helps the model identify behavioural rhythms such as irregular posting intervals or late-night activity, patterns commonly associated with depressive behaviour.
Fusion and classification
To integrate all modalities, the textual, visual, and temporal embeddings are concatenated to form a unified multimodal representation:
This fused representation is passed through fully connected layers with non-linear activations, followed by a sigmoid output function to predict the probability of a post expressing depressive symptoms:
This architecture enables the model to combine textual, visual, and temporal cues into a shared multimodal space, providing a deeper and more reliable understanding of depressive behaviour in multilingual social media posts.
Experiment
The experiments evaluated the effectiveness of the proposed multimodal framework in detecting signs of depression across various data modalities and feature combinations. The primary objective is to assess how textual, visual, and temporal cues individually and collectively contribute to model performance. By experimenting with different settings, we aim to understand the impact of each input and how combining them enhances the model's performance and improves its accuracy.
Dataset statistics
The final dataset comprises 15,739 posts, among which 1,595 posts (approximately 10.1%) contain valid images. Most of the data still comes from text, which aligns with findings from earlier social media studies. People mainly express themselves through words, but sometimes add pictures to highlight or support their feelings.While relatively small, the multimodal portion of the dataset adds valuable diversity, enabling the evaluation of models that combine linguistic and visual signals, a dimension often underrepresented in earlier depression detection datasets.
Class distribution
A
The data shows that Normal posts are slightly more common than Depression posts. It is typical in social media datasets, as people tend to discuss emotional distress less frequently. Another thing examined is that Depression posts usually include fewer images than Normal ones. Users with low moods prefer text over pictures when sharing their feelings. Table 1 shows how the posts are divided between text-only and multimodal types, while Figure 4 displays the number of image-based posts in each class. These numbers help us understand how people naturally share content and point to the real-world challenges of building multimodal models for mental health research.
begin{table}[htbp]\centering\caption{Class distribution by modality}\label{tab:class_dist}\begin{tabular}{lccc}\hlineLabel & Text-only & With Images & Total \\\hlineDepression & 5,861 & 595 & 6,456 \\Normal & 8,177 & 1,106 & 9,283 \\\hlineTotal & 14,038 & 1,701 & 15,739 \\\hline\end{tabular}\end{table}
Fig. 4
Image distribution across classes
Click here to Correct
Language distribution
A
The dataset shows how linguistic diversity is a natural online behaviour of social media users. The majority of posts are in English (EN, 8,832), followed by Hindi written in the Devanagari script (HD, 475) and Romanised Hindi (HR, 6,289), as shown in Table 2. Although EN is the primary language in the dataset, the strong presence of HD and HR highlights the importance of building models that can handle multilingual and mixed-language content. The pattern also changes across labels; EN posts appear fairly balanced between Depression and Normal, while HR has more Normal posts. Posts in HD, on the other hand, are more strongly associated with depressing content. This variation, as shown in the heatmap in Figure 8 and the cross-tabulated results in Table 2, suggests that language choice itself may convey subliminal cultural or emotional cues that influence how people express distress online.
begin{table}[htbp]\centering\caption{Language–Label distribution}\label{tab:lang_dist}\begin{tabular}{lcccc}\hlineLanguage & Depression & Normal & Total & Images \\\hlineEN & 3,623 & 5,209 & 8,832 & 1,280 \\HR & 2,580 & 3,709 & 6,289 & 203 \\HD & 195 & 280 & 475 & 65 \\HN & 58 & 85 & 143 & 47 \\\hlineTotal & 6,456 & 9,283 & 15,739 & 1,595 \\\hline\end{tabular}\end{table}
A
Fig. 5
Language distribution across classes
Click here to Correct
Aside from language distribution, other descriptive trends provide a greater understanding. Posts labelled depressed tend to be longer than average regarding word count and character count. Individuals experiencing emotional turmoil may write more complex or in-depth works as a means of self-reflection. Prior research on computational mental health identified similar trends. Box plots comparing the text lengths across classes are used in Figure 6 to illustrate this difference.
Fig. 6
Box plots comparing text lengths across Depression and Normal posts
Click here to Correct
User activity patterns also reveal interesting dynamics. While most contributors post infrequently, a few highly active users are responsible for a significant portion of the content. Figure 7 illustrates that the five most active users have been posting regularly for several years. This uneven activity, evident in daily social media data, reminds us to handle user bias carefully so that the model does not learn too much from the language or emotions of just a few people.
Fig. 7
User posting frequency distribution showing the top five most active users
Click here to Correct
Finally, to visualise how language and emotion interact, a heatmap is generated to display the relationship between labels and languages. As shown in Figure 8, Depression posts show a slightly higher concentration in HD, while HR and EN posts are more balanced across both labels.
Fig. 8
Label –Language heatmap showing depression and normal distributions across languages
Click here to Correct
Implementation details
Our model is trained using 80% of the data and retains the remaining 20% to evaluate its performance. The number of Depression and Normal posts is roughly equal in both sets. The model is trained with the Adam optimiser and binary cross-entropy loss. The learning rate is 0.0001, and the batch size is 8. Training lasted approximately 30 rounds. When no improvement showed after a few checks, it stopped early to avoid overfitting. The learning rate and batch size are selected after a few trial runs, allowing the training to remain steady and not fluctuate excessively. Every setup ran with different random seeds three times to ensure the results were not merely a matter of luck. The study utilised Accuracy, Precision, Recall, F1-score, and AUC to evaluate the model's performance. By resampling the data, standard deviations and confidence ranges are also calculated to increase the accuracy of the results. Score differences are verified to determine if they reflected real variation or random noise. Overall, this setup gave a precise and repeatable way to test performance across all data types.
Multimodal training
The encoders described in Section 3.3 are jointly optimised under an early-fusion paradigm. The concatenated representation
is used as input to the classifier, enabling joint learning of cross-modal interactions. We trained the architecture end-to-end using the binary cross-entropy (BCE) loss:
where
is the ground-truth label and
is the predicted probability. Optimisation is performed using the Adam optimiser, and early stopping is implemented based on the validation loss.
Various improvements, such as dropout regularisation, handling missing images, and batch normalisation, are adopted to improve robustness. For multimodal layers, dropout is applied as:
where
is the keep probability and
denotes element-wise multiplication.
If an image is unavailable, the visual embedding is replaced by either:
or stochastically masked during training:
To stabilise training across heterogeneous modalities, batch normalisation is applied to fused representations:
where
,
are batch statistics and
,
are learnable scale and shift parameters.
The model is trained following the multimodal architecture illustrated in Figure X. It uses an early-fusion strategy, where outputs from the text, image, and temporal encoders are concatenated and optimised together within a single network. Training is carried out end-to-end using backpropagation, allowing the gradients to update all encoders simultaneously. For optimisation, we use binary cross-entropy loss:
where
denotes the Depression label.
This configuration enables the model to learn how different modalities interact in a common feature space while collecting information specific to each modality. Early integration of these representations enhances the system's ability to detect subtle patterns between language, images, and posting behaviour. The early-fusion approach is selected because it offers a good balance between interpretability and computational efficiency, making it well-suited for real-world mental health analysis tasks.
Results and analysis
The results show that using multiple types of input, such as text, pictures, and time, helps the model identify signs of depression more effectively than using only one type. Table 3 lists how each version of the model operates, including those that utilise only one type of data and those that combine them. When trained under three different random seeds, the multimodal configurations achieved higher accuracy and showed more consistent performance across runs.
Table 3
Comparative performance of unimodal and multimodal variants with statistical significance testing.
textbf{Model}
SD)
SD)
SD)
Precision
Recall
Text only
0.758
0.040
0.681
0.092
0.614
0.132
0.625
0.028
0.758
0.036
Image only
0.805
0.005
0.791
0.007
0.728
0.013
0.795
0.010
0.805
0.008
Time only
0.758
0.031
0.705
0.037
0.653
0.076
0.746
0.022
0.758
0.028
Text + Image
0.824
0.006
0.808
0.009
0.728
0.009
0.821
0.012
0.824
0.011
Text + Time
0.805
0.019
0.769
0.032
0.748
0.032
0.821
0.021
0.805
0.017
Full (Text + Image + Time)
0.810
0.004
0.794
0.005
0.738
0.023
0.802
0.014
0.810
0.010
Looking at the unimodal baselines, the image-only model performed the strongest, reaching an average accuracy of
and an F1-score of
. The text-only model followed, with an accuracy of
and an F1-score of
. Meanwhile, the temporal-only variant achieved an accuracy of
and an F1-score of
. Once modalities were fused, performance improved notably. The text + image model reached the highest accuracy of
and the best F1-score of
. The complete multimodal model, which combines text, image, and temporal signals, achieved a balanced accuracy of
and an F1 score of
. Across all fusion settings, AUC values ranged between
and
, showing an evident ability to discriminate depressive from non-depressive content rather than performing at a chance level. These numbers lead to three concrete observations. First, images carry strong predictive cues in this dataset. The image-only baseline is approximately as robust as the whole model in some metrics, and substantially better than text-only. Second, textual signals are informative but noisier in our collection; they improve overall performance when combined with images, but do not match image-only performance on their own. Third, temporal features consistently add modest gains in stability and discrimination when combined with other modalities, even though they are not highly predictive alone. Interpreting these observations requires care. The strong performance of the image-only model reflects a combination of meaningful visual cues and dataset-specific regularities. In many depressive posts in our corpus, images tend to exhibit darker tones, lower contrast, or solitary scenes, visual traits that a convolutional encoder can pick up reliably. This makes image features valuable on our dataset, but it also raises a caution as part of the model's success may come from learning visual style differences that correlate with labels, rather than learning deeper, content-level indicators of depression. Due to this, we treat the image-only result as an empirical outcome that is valid for our dataset while acknowledging the need for external validation to confirm generality.
The weaker performance of the text-only model can be traced to the linguistic diversity of the dataset. The corpus comprises posts written in English and Hindi, using both the scripts, often within the same sentence. Frequent code-switching, irregular grammar, and nonstandard spellings make tokenisation difficult and reduce the effectiveness of pretrained multilingual encoders. This variability explains the higher standard deviation observed for the text-only configuration. Even so, text remains indispensable for understanding emotional meaning. It directly expresses feelings such as hopelessness, guilt, or fatigue signals neither images nor timestamps can reliably convey. This complementary nature of modalities is reflected in the strong performance of the text+image model (
), which effectively resolves ambiguity and improves precision and recall compared to unimodal models.
Temporal information plays a more subtle but helpful role. Posting behaviour has weak but not negligible predictive power, according to the time-only model (
). These temporal embeddings enhance stability and marginally increase AUC scores when combined with visual or textual signals. Although the gain is modest, it consistently demonstrates that behavioural rhythm adds context beyond content alone.
Two further observations emerge from the variance analysis. First, the fusion-based models exhibit remarkably low variability across random seeds; for example, the whole model's F1-score varies by only
. This low variance indicates that the multimodal combination improves consistency, as well as performance. Second, several confidence intervals overlap between fusion variants, suggesting that although additional or temporal modalities provide stability, their incremental accuracy gains are negligible. In other words, multimodal training contributes more robustness than dramatic jumps in headline metrics. The results also highlight certain limitations. The visual encoder could depend on surface-level patterns instead of real emotional indicators because of dataset-specific biases, such as the prevalence of darker or sadder images in depressed posts. Likewise, the lower and more variable text performance emphasises the need for more effective preprocessing and modelling strategies for code-switched and Romanised Hindi. Because the numeric differences among fusion variants are sometimes narrow, statistical significance testing remains crucial to confirm that observed trends are not due to random variation. All things considered, these results support the use of a multimodal strategy. Despite the drawbacks of each modality alone, when combined, they produce a more stable and accurate classifier than any unimodal baseline. Three possibilities for future study and practical implementation are monitoring prediction by validating the model on external datasets, examining potential visual and annotation biases, and enhancing linguistic preprocessing to support translation and code-mixing more effectively. Together, these steps would help turn the promising empirical gains observed here into robust, generalisable advances for automated depression detection on social media.
4.4.2 Encoder comparisons
We tried different model setups for text, pictures, and time to determine which ones helped the system the most and how they interacted with each other. TableTable 4 represents further encoder comparison. Each encoder family is selected to represent different perspectives of user expression: linguistic, visual, and behavioural, and all are fine-tuned under identical conditions for fair comparison.
For the textual modality, three models are examined: FastText, BERT-base, and XLM-RoBERTa. These represent the evolution from shallow word embeddings to multilingual transformer-based contextual models. With a 0.707 F1-score, FastText demonstrated an acceptable performance mark but struggled to handle informal, code-mixed Hinglish text. A vital boost is provided by BERT-base (F1 = 0.799), showing the ability of transformer-based contextual learning. The best results are achieved with XLM-RoBERTa (F1 = 0.835), which effectively handles cross-lingual semantics and code-switching problems through multilingual pre-training across more than 100 languages. This slightly higher value than the text-only multimodal experiment (F1
0.68) is expected since the encoder comparison involved longer fine-tuning on text alone. In contrast, the multimodal setup included early stopping and fusion layers. Hence, the standalone XLM-RoBERTa results reflect the upper-bound potential of the text encoder.
A
Table 4
Comparison of text encoders.
textbf{Model}
F1-score
FastText
0.707
BERT-base
0.799
XLM-RoBERTa
0.835
For the image modality, three convolutional architectures, ResNet-18, DenseNet-121 and EfficientNet-B0 are compared, corresponding to residual, densely connected and compound-scaled designs respectively. All models are initialised with ImageNet weights and fine-tuned to identify affective and contextual visual cues. As shown in Table 5, despite sparse and noisy social media image data, ResNet-18 demonstrated strong generalisation, achieving the highest F1-score (0.7956). DenseNet-121 performed poorly (F1 = 0.6830), most likely due to over-parameterisation concerning dataset size, while EfficientNet-B0, which was lighter, performed competitively (F1 = 0.7575). The results indicate that smaller residual architectures, such as ResNet-18, are reliable for affective visual analysis when data is scarce.
Table 5
Comparison of image encoders.
textbf{Image Encoder}
F1-score
DenseNet-121
0.6830
EfficientNet-B0
0.7575
ResNet-18
0.7956
Table 6 represents further model analysis. We tested three conditions to evaluate temporal representation strategies: no temporal features, bucketized time attributes (day, hour, month), and learned temporal embeddings as used in our final model. Even small posting patterns contain proper behavioural signals associated with depressive patterns, as demonstrated by the improvement in performance from 0.709 (without time) to 0.810 (using bucketized features) when temporal information was included. Due to the dataset's limited temporal coverage, the embedding-based method yielded a slightly lower result of 0.789. These results support earlier studies that show how posting habits and timing can provide additional helpful clues. The better results from bucketization might be because it works well when the data is small, while learned embeddings do better with larger or continuous data.
Table 6
Comparison of temporal representation strategies.
textbf{Temporal Representation}
F1-score
None (no temporal info)
0.709
Buckets (day/time/month)
0.810
Learned embeddings
0.789
Combined, these experiments show that the strengths of each modality are complementary. Temporal encoders uncover behavioural patterns, image encoders extract affective context, and text encoders capture semantic and emotional subtleties. Their consistent advantages justify using XLM-RoBERTa and ResNet-18 as the main building blocks of the finished framework. The findings suggest that combining language, visuals, and timing provides a more comprehensive and stable approach to detecting depression in multilingual social media data.
4.4.3 Language-wise Performance
Our data includes several languages and mixed writing, so we looked at how the model performs for each group. Table 7 shows the average and variation from three tests. The results change depending on script style, data balance, and language form.
Table 7
Comparative performance of the model across different language modalities.
textbf{Language}
SD)
SD)
SD)
SD)
SD)
EN
0.849
0.012
0.820
0.015
0.816
0.018
0.818
0.014
0.917
0.010
HR
0.662
0.021
0.642
0.025
0.743
0.020
0.684
0.024
0.771
0.018
HD
0.710
0.017
0.692
0.022
0.705
0.021
0.698
0.019
0.742
0.020
HN
0.846
0.014
0.846
0.013
1.000
0.000
0.917
0.011
0.818
0.016
English (EN) achieved the strongest and most stable performance, with an average F1-score of 0.818
0.014 and an AUC of 0.917
0.010. It is expected that English benefits from richer representation in multilingual transformer pretraining and generally follows more regular grammar and word usage. The model finds it easier to capture emotional meaning when syntax and semantics are relatively uniform.
Romanised Hindi (HR) showed the lowest scores (F1 = 0.684
0.024, AUC = 0.771
0.018). The drop in performance appears to stem from the inconsistent spellings and informal writing styles people use online. Many users mix Hindi sounds with English letters in random ways, which makes it hard for the system to break the text into proper tokens and creates broken word meanings. Hindi written in Devanagari (HD) performed better, with an F1 score of 0.698 (
0.019) and an AUC of 0.742 (
0.020). Even though it forms a smaller dataset share, its consistent script provides cleaner token boundaries and more stable embeddings.
Hinglish (HN) displayed an interesting pattern. It reached the highest recall (1.000
0.000) and a strong overall F1-score (0.917
0.011). It suggests that the model is susceptible to depressive cues expressed in mixed-language text, where emotional tone tends to be vivid and direct. Some of this boost, however, can be attributed to dataset imbalance. Hinglish posts contained a larger proportion of depression-labelled samples, which likely enhanced recall. Despite that, the stability across runs indicates that this effect reflects genuine sensitivity rather than overfitting.
In summary, these results highlight both the promise and the difficulty of multilingual mental health modelling. English and Hinglish performed better since they have stronger pretraining support and clearer emotional words. Romanised and Devanagari Hindi faced issues with how words are split, as the related data is less in count. In future, adding more posts to these scripts, improving how Romanised text is broken into words, and utilising data expansion can help the model better understand different languages.
4.4.4 Comparison with previous studies
To better understand our results and their relevance to the broader literature, Table 8 compares the model with earlier key works that employed similar social media platforms and approaches. bib43 are the first to use deep learning for detecting depression on Weibo, combining textual information with external knowledge sources. bib34 later tried a multimodal method on Twitter by combining text, images, and user details. Although this improved prediction, their work, like earlier studies, only used one language, either English or Chinese, and did not address multilingual or mixed-language data.
Table 8
Comparison with previous studies on social media-based depression detection.
textbf{Study}
Platform
Languages
Modalities
F1-score
Notes
Wu et al. (2018) bib43
Weibo
Chinese
Text + External Knowledge
0.76
Early deep model with knowledge integration \\[3pt]
Safa et al. (2021) bib34
X
English
Text + Image + Profile
0.74
Multimodal analysis with user metadata \\[3pt]
textbf{Proposed Model (Ours)}
X
English, Hindi
Text + Image + Time
0.79
Multilingual, code-switched multimodal model
Our work advances by applying a multimodal framework to a linguistically diverse dataset that includes EN, HR, HD, and HN variations of the language. Even with this added variation and noise, the model achieves an average F1-score of 0.79, which is comparable to or better than the results from earlier single-language studies. According to this research, it is feasible to create reliable depression detection systems that function effectively in various scripts and languages, accurately representing how people communicate in multilingual online settings.
Case study
We examined several representative examples, including both accurate and inaccurate classifications, to gain a deeper understanding of how the proposed multimodal framework operates across languages and input types. Sample cases representing various emotional tones and modality combinations are displayed in Figure 9 in English (EN), Romanised Hindi (HR), Devanagari Hindi (HD), and Hinglish (HN). The main behavioural patterns observed along with their implications for the model's advantages and disadvantages are highlighted in the following discussion.
Fig. 9
Examples illustrating the behaviour of the proposed multimodal framework on depressive posts across EN, HR, HD, and HN. Image, text, and post time are shown alongside predicted depression probabilities. Multimodal fusion enhances prediction accuracy, particularly when text and pictures are emotionally aligned. Performance drops when modalities conflict, or when noisy HD or HR text reduces linguistic clarity, leading to heavier reliance on visual signals. These examples reflect strengths and limitations revealed during error analysis.
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
The model performed best when the text and image conveyed the same emotional intent. For example, posts with captions and pictures showing sadness or lack of energy seemed easily detected, particularly in HR examples, indicating emotional fatigue. The model's depression probability decreased significantly (from approximately
to less than
) when the accompanying text is removed, suggesting that textual content remains the primary factor influencing the choice. Temporal information added some context about user activity, but rarely influenced predictions.
Fig. 10
Examples illustrating the behaviour of the proposed multimodal framework on normal posts across EN, HR, HD, and HN. Image, text, and posting time are displayed along with predicted depression probabilities. The model correctly identifies most normal posts when textual and visual cues convey neutral or positive emotion. The model frequently focusses more on the image and its confidence may decline when the text is unclear, particularly in HR and HN due to spelling errors or language mixing. These examples demonstrate that while language variation can still confuse, the model performs best when the image and text match.
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
For normal posts as depicted in Figure 10, the pattern is similar but reversed. When the image and text projected positive emotions, such as family gatherings or encouraging quotes, the multimodal model's confidence increased (for instance, from
for text-only to
after fusion). This behaviour shows how multimodal learning captures emotional consistency more effectively than any modality alone.
Some incorrect predictions revealed as shown in Figure 11 where the model still struggles. One common problem arose when the image and text carried opposite meanings. For instance, an EN post saying "My life is a mess" was paired with a calm-looking photo, leading the system to mark it as Normal rather than depressive. A similar mix-up happened in an HR example of loneliness, such as "All the seats lie empty", but the picture seemed emotionally neutral, again confusing the classifier. On the other hand, the model interpreted descriptive HN words for feelings of sadness, which led to the incorrect labelling of depression for an HN post that expressed frustration over a broken object. The posts with incorrect labels expose a more significant weakness in how the system integrates textual and visual data. When the photo and text do not match, one takes control, and the result goes wrong. Language made a difference, too. EN and HR posts improved since their words and style are easy for the model to handle.
On the other hand, posts written in the HD or HN tended to confuse the model; the constant switching between languages and unusual spellings made it harder for the model to process the text. Hence, the system leaned more on what it saw in the images. In many cases, the visual side ultimately influenced the final label, highlighting how language inconsistencies can shift the model's balance between text and visuals.
Fig. 11
Examples of misclassified posts in the EN, HR, HD, and HN categories. Incorrect results occurred due to  the model being influenced by text and image conflicts, unclear images, or noisy transformed language, even with multimodal fusion. While visually sombre images and neutral text may cause high depression scores, posts with emotionally neutral images and depressing text are typically under-reported. These examples highlight common failure patterns identified in error analysis and suggest potential avenues for modality-aware fusion and improved code-mixed text handling in the future.
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Two broader factors also shaped these results. First, mild dataset bias played a role: since Normal posts are more common, the model favoured the Normal label, especially when visual cues seemed neutral. As a result, posts with subtle or indirect depressive language are sometimes missed. Second, the uneven distribution of modalities caused an imbalance during fusion. A bright or aesthetically neutral image could downplay depressive text, while darker visuals could trigger false positives.
Overall, these insights show both the potential and the sensitivity of multimodal depression detection. A greater understanding of user expression is made possible by combining text, image, and temporal signals; however, this also creates dependencies that may amplify noise when the signals diverge. Stronger multilingual embeddings for processing transliterated and code-switched input, as well as adaptive fusion mechanisms that dynamically weigh each modality, will be necessary to address these problems. Such improvements could help future models reason more contextually, understanding what is said or shown and how these elements interact to convey emotion.
Conclusion and future work
To provide a multilingual multimodal framework for recognising depressive expressions on social media, this study integrated textual, visual, and temporal signals. The methodology is tested on a linguistically diverse dataset that included posts in English (EN), Romanised Hindi (HR), Devanagari Hindi (HD), and Hinglish (HN). The results demonstrated that temporal embeddings increased the model's stability and sensitivity to behavioural context, while combining text and image features consistently increased classification accuracy. Multilingual modelling is feasible and required to manage the wide range of expressions present in social media, as revealed by language-wise analysis.
Although the results are promising, the study also has some drawbacks. Only text and still images are currently included in the dataset; dynamic signals, such as audio, video, and emojis, which frequently convey significant emotional context, are not included. In addition, the annotation process followed linguistic and contextual guidelines rather than clinical validation, which may leave room for subjective interpretation in the labelling process.
In contrast to long-term behavioural trends, the temporal modelling is comparatively small and concentrates on daily or monthly intervals. Computational limitations also prevented experiments with more complex transformer-based models or larger fusion networks. Furthermore, because the dataset originates from a single platform with a specific demographic focus, it remains unclear whether the findings can be generalised to other populations.
There are many ways this framework can grow in the future. Adding new forms of data, such as voice, short videos, or emojis, could help the system notice emotions more clearly. Examining user activity over extended periods might also reveal gradual mood shifts. Larger multimodal models that connect text and images could facilitate the more straightforward interpretation of emotional patterns. Testing the model on social platforms and among diverse user groups might also be important. Doing so can help build systems that stay accurate while respecting cultural differences and can be used responsibly to support early mental health screening.
bibliography{sn-bibliography}
A
Author Contribution
All authors contributed equally to this work
A
Data Availability
The dataset used in this study contains publicly sourced social media posts related to mental health. Due to the sensitive nature of the content and privacy considerations, the data cannot be openly shared. De-identified data may be made available upon reasonable request to the corresponding author for non-commercial research purposes, subject to ethical approval and platform terms of use.
References:
Al Hanai, T. and Ghassemi, M. M. and Glass, J. R. (2018) Detecting depression with audio/text sequence modelling of interviews. 10.21437/Interspeech.2018-2522, 1716--1720, Interspeech 2018
Alabdulkreem, E. (2020) Prediction of Depressed Arab Women Using Their Tweets. Journal of Decision Systems 30(2-3): 102--117 https://doi.org/10.1080/12460125.2020.1859745
Almars, A. M. (2022) Attention-based Bi-LSTM model for Arabic depression classification. Computers, Materials and Continua 71(2): 3091--3106 https://doi.org/10.32604/cmc.2022.022609
{American Psychiatric Association} (2013) Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Publishing, Arlington, 10.1176/appi.books.9780890425596, 5th
An, M. and Wang, J. and Li, S. and Zhou, G. (2020) Multimodal topic-enriched auxiliary learning for depression detection. 10.18653/v1/2020.coling-main.94, 1078--1089, Proceedings of COLING 2020
Benton, A. and Coppersmith, G. and Dredze, M. (2017) Ethical research protocols for social media health data. 10.18653/v1/W17-1612, 94--102, Proceedings of the First ACL Workshop on Ethics in Natural Language Processing
Bucur, A.-M. and Cosma, A. and Rosso, P. and Dinu, L. P. It is just a matter of time: Detecting depression with time-enriched multimodal transformers. Advances in Information Retrieval, 10.1007/978-3-031-28244-7_13, 2023, Springer, 202--216
Cavazos-Rehg, P. A. and Krauss, M. J. and Sowles, S. and Connolly, S. and Rosas, C. and Bharadwaj, M. and Bierut, L. J. (2016) A content analysis of depression-related Tweets. Computers in Human Behavior 54: 351--357 https://doi.org/10.1016/j.chb.2015.08.023
Chancellor, S. and Birnbaum, M. L. and Caine, E. D. and Silenzio, V. M. B. and De Choudhury, M. (2019) A taxonomy of ethical tensions in inferring mental health states from social media. 10.1145/3287560.3287587, 79--88, Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency
Cheng, Z. and Chen, Y. (2022) Multimodal time-aware attention networks for depression detection. Journal of Intelligent Information Systems 59: 407--426 https://doi.org/10.1007/s10844-022-00704-w
Chiu, C.-Y. and Tseng, Y. and Tsai, M. (2020) Multimodal depression detection on Instagram considering the time interval of posts. Journal of Intelligent Information Systems 54: 233--254 https://doi.org/10.1007/s10844-020-00599-5
Conneau, A. and Khandelwal, K. and Goyal, N. and Chaudhary, V. and Wenzek, G. and Guzm{\'a}n, F. and Grave, E. and Ott, M. and Zettlemoyer, L. and Stoyanov, V. (2020) Unsupervised cross-lingual representation learning at scale. 10.18653/v1/2020.acl-main.747, 8440--8451, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Coppersmith, G. and Dredze, M. and Harman, C. (2014) Quantifying mental health signals in Twitter. 10.3115/v1/W14-3207, 51--60, Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality
Coppersmith, G. and Dredze, M. and Harman, C. and Hollingshead, K. (2015) From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Diagnoses. 10.3115/v1/W15-1201, 1--10, Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality
De Choudhury, M. and Gamon, M. and Counts, S. and Horvitz, E. (2021) Predicting Depression via Social Media. 10.1609/icwsm.v7i1.14432, 128--137, 1, 7, Proceedings of the International AAAI Conference on Web and Social Media
El-Ramly, M. and others (2021) CairoDep: Detecting depression in Arabic posts using BERT transformers. 10.1109/ICICIS52592.2021.9694178, 207--212, 2021 Tenth International Conference on Intelligent Computing and Information Systems (ICICIS)
Gautham, M. S. and Gururaj, G. and Varghese, M. and Benegal, V. and Rao, G. N. and Kokane, A. and Chavan, B. S. and Dalal, P. K. and Ram, D. and Pathak, K. and Singh, R. K. L. and Singh, L. K. and Sharma, P. and Saha, P. K. and Ramasubramanian, C. and Mehta, R. Y. and Shibukumar, T. M. and NMHS Collaborators Group (2020) The National Mental Health Survey of India (2016): Prevalence, socio-demographic correlates and treatment gap of mental morbidity. International Journal of Social Psychiatry 66(4): 361--372 https://doi.org/10.1177/0020764020907941
Gimeno-G{\'o}mez, D. and Bucur, A.-M. and Cosma, A. and Mart{\'i}nez-Hinarejos, C. D. and Rosso, P. Reading between the frames: Multi-modal depression detection in videos from non-verbal cues. Advances in Information Retrieval (ECIR 2024), 10.1007/978-3-031-56027-9_12, 2024, Springer, 14608, Lecture Notes in Computer Science
Gui, T. and Zhu, L. and Zhang, Q. and Peng, M. and Zhou, X. and Ding, K. and Chen, Z. (2019) Cooperative multimodal approach to depression detection in Twitter. 10.1609/aaai.v33i01.3301110, 110--117, 01, 33, Proceedings of the AAAI Conference on Artificial Intelligence
Guntuku, S. C. and Preotiuc-Pietro, D. and Eichstaedt, J. C. and Ungar, L. H. (2019) What Twitter profiles and posted images reveal about depression and anxiety. 10.1609/icwsm.v13i01.3225, 236--246, 1, 13, Proceedings of the International AAAI Conference on Web and Social Media
Guntuku, S. C. and Schneider, R. and Pelullo, A. and Young, J. and Wong, V. and Ungar, L. H. and Polsky, D. (2019) Studying expressions of loneliness in individuals using Twitter: An observational study. BMJ Open 9: e030355 https://doi.org/10.1136/bmjopen-2019-030355
He, K. and Zhang, X. and Ren, S. and Sun, J. (2016) Deep residual learning for image recognition. 10.1109/CVPR.2016.90, 770--778, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Husseini Orabi, A. and Buddhitha, P. and Husseini Orabi, M. and Inkpen, D. (2018) Deep Learning for Depression Detection of Twitter Users. Association for Computational Linguistics, New Orleans, LA, 10.18653/v1/W18-0609, 88--97, Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic
Joulin, A. and Grave, E. and Bojanowski, P. and Mikolov, T. (2017) Bag of Tricks for Efficient Text Classification. 427--431, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017)
Kabir, M. K. and Islam, M. and Kabir, A. N. B. and Haque, A. and Rhaman, M. K. (2022) Detection of depression severity using Bengali social media posts on mental health: A study using natural language processing techniques. JMIR Formative Research 6(9): e36118 https://doi.org/10.2196/36118
Kroenke, K. and Spitzer, R. L. and Williams, J. B. W. (2001) The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine 16(9): 606--613 https://doi.org/10.1046/j.1525-1497.2001.016009606.x
Lin, Y.-S. and Tai, L.-K. and Chen, A.-L. (2023) The detection of mental health conditions by incorporating external knowledge. Journal of Intelligent Information Systems 61: 497--518 https://doi.org/10.1007/s10844-022-00774-w
Liu, Y. (2024) Depression detection via a Chinese social media platform: A novel causal relation-aware deep learning approach. Journal of Supercomputing 80: 10327--10356 https://doi.org/10.1007/s11227-023-05830-y
Losada, D. E. and Crestani, F. and Parapar, J. Overview of eRisk: Early Risk Prediction on the Internet. In: Bellot, P. and others (Eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction: CLEF 2018. Lecture Notes in Computer Science, 10.1007/978-3-319-98932-7_30, 2018, Cham, Springer, 1--30, 11018, Lecture Notes in Computer Science
Mumu, T. F. and Munni, I. J. and Das, A. K. (2021) Detecting depressed people from Bangla social media statuses using an LSTM and CNN approach. Journal of Engineering Advancements 2(1): 31--37 https://doi.org/10.38032/JEA.2021.01.006
Reece, A. G. and Danforth, C. M. (2017) Instagram photos reveal predictive markers of depression. EPJ Data Science 6(15) https://doi.org/10.1140/epjds/s13688-017-0110-z
Rijhwani, S. and Sequiera, R. and Choudhury, M. and Bali, K. and Maddila, C. S. (2017) Estimating code-switching on Twitter with a novel generalised word-level language detection technique. 10.18653/v1/P17-1180, 1971--1982, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
Sadeque, F. and Xu, D. and Bethard, S. (2017) UArizona at the CLEF eRisk 2017 Pilot Task: Linear and Recurrent Models for Early Depression Detection. 1866, CEUR Workshop Proceedings
Safa, R. and Bayat, P. and Moghtader, L. (2022) Automatic detection of depression symptoms in Twitter using multimodal analysis. Journal of Supercomputing 78: 4709--4744 https://doi.org/10.1007/s11227-021-04040-8
Shen, T. and Jia, J. and Shen, G. and Feng, F. and He, X. and Luan, H. and Tang, J. and Tiropanis, T. and Chua, T.-S. and Hall, W. (2018) Cross-domain depression detection via harvesting social media. AAAI Press, 10.24963/ijcai.2018/223, 1611--1617, Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI'18)
Shing, H.-C. and Nair, S. and Zirikly, A. and Friedenberg, M. and Daum{\'e} III, H. and Resnik, P. (2018) Expert, Crowdsourced, and Machine Assessment of Suicide Risk via Online Postings. Association for Computational Linguistics, New Orleans, LA, 10.18653/v1/W18-0603, 25--36, Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic
Stankevich, M. and Smirnov, I. and Kiselnikova, N. and Ushakova, A. Depression detection from social media profiles. Data Analytics and Management in Data Intensive Domains, 10.1007/978-3-030-51913-1_12, 2020, Cham, Springer, 1223, Communications in Computer and Information Science
Sweeney, L. (2002) k-Anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5): 557--570 https://doi.org/10.1142/S0218488502001648
Tsugawa, S. and Kikuchi, Y. and Kishino, F. and Nakajima, K. and Itoh, Y. and Ohsaki, H. (2015) Recognizing depression from Twitter activity. Association for Computing Machinery, New York, NY, USA, 10.1145/2702123.2702280, 3187--3196, Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15)
Wang, X. and others (2019) Assessing depression risk in Chinese microblogs: A corpus and machine learning methods. 10.1109/ICHI.2019.8904506, 1--5, 2019 IEEE International Conference on Healthcare Informatics (ICHI)
{World Health Organisation} (1992) The ICD-10 Classification of Mental and Behavioural Disorders: Clinical Descriptions and Diagnostic Guidelines. WHO, Geneva, https://www.who.int/publications/i/item/9241544554
{World Health Organisation}. Depression. 2021, https://www.who.int/news-room/fact-sheets/detail/depression
Wu, M.-Y. and Shen, C.-Y. and Wang, E.-T. and others (2020) A deep architecture for depression detection using posting, behaviour, and living environment data. Journal of Intelligent Information Systems 54: 225--244 https://doi.org/10.1007/s10844-018-0533-4
Xiao, Y. and Yang, Y. and Xu, H. and Li, S. (2024) Empirical insights into the interaction effects of groups at high risk of depression on online social platforms with NLP-based sentiment analysis. Data and Information Management 8(4): 100080 https://doi.org/10.1016/j.dim.2024.100080
Additional Files
Additional file 15
Click here to Correct
Additional file 16
Click here to Correct
Additional file 17
Click here to Correct
Additional file 18
Click here to Correct
Additional file 19
Click here to Correct
Additional file 20
Click here to Correct
Additional file 21
Click here to Correct
Additional file 22
Click here to Correct
Additional file 23
Click here to Correct
Additional file 24
Click here to Correct
Additional file 25
Click here to Correct
Additional file 26
Click here to Correct
Additional file 27
Click here to Correct
Additional file 29
Click here to Correct
Additional file 30
Click here to Correct
Additional file 31
Click here to Correct
Additional file 32
Click here to Correct
Additional file 33
Click here to Correct
Additional file 34
Click here to Correct
Additional file 35
Click here to Correct
Abstract
Around 280 million people across the world live with depression, making it one of the most common mental health concerns today \cite{bib42}. Early detection is one of the most effective ways to support those in need and prevent their condition from worsening. Social media lets us see people's daily activities and feelings in today's digital world. More people express their genuine opinions online than in clinical settings. These websites are therefore helpful in learning about mental health trends. However, most previous studies have examined only English text and ignored the variety of languages and media people use on social platforms. This gap is evident in India, where users often write in Hinglish (a natural mix of Hindi and English), which brings new linguistic challenges. To bridge this gap, our study introduces a time-aware multilingual and multimodal framework for detecting signs of depression from social media posts collected from X (formerly Twitter). The model develops a deeper insight into how users behave and express themselves by learning from text, images, and posting times. The results of our experiments indicate that the model performs well, consistently across runs, with an F1-score of 0.79 and an AUC of 0.74, outperforming all text-only or single-language baselines. These results suggest that combining behavioural, visual, and textual cues improves the accuracy and cross-linguistic flexibility of depression detection. This is the first study to examine multilingual and multimodal depression detection using actual Indian social media data. This study shows the value of multicultural research and offers a valuable framework for developing tools that can facilitate online mental health monitoring.
Total words in MS: 10655
Total words in Title: 11
Total words in Abstract: 262
Total Keyword count: 5
Total Images in MS: 11
Total Tables in MS: 6
Total Reference count: 44