Introduction
Emotion analysis in text has emerged as a fundamental component of modern natural language processing, with applications spanning social media monitoring, customer feedback analysis, mental health assessment, and human-computer interaction systems. The ability to automatically detect and classify emotions from textual content has become increasingly important as digital communication continues to proliferate across various platforms and domains.
Despite significant advancements in deep learning and transformer-based architectures, accurately identifying and categorizing complex emotional expressions remains a formidable challenge. This difficulty stems from several factors: the contextual nature of emotional expression, the subtlety of linguistic cues that convey emotion, the frequent co-occurrence of multiple emotions within a single text, and the inherent subjectivity in emotional interpretation. Traditional approaches often struggle with these complexities, particularly in multi-label scenarios where multiple emotions may be present simultaneously.
Recent developments in Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding nuanced textual content, while BERT and its variants have established themselves as powerful tools for fine-grained text classification tasks. However, each approach has distinct strengths and limitations: BERT variants excel at capturing local contextual relationships and can be fine-tuned for specific tasks, while LLMs demonstrate superior performance in understanding global context and handling diverse linguistic patterns through their extensive pre-training.
This work bridges these gaps with a rule-based voting framework that adaptively integrates BERT variants and LLMs, achieving state-of-the-art multi-label emotion detection. It addresses three fundamental research questions:
RQ1: How can BERT variants be optimally utilized for multi-label emotion detection? As well known, compared to traditional methods for word embedding, BERT is the best one so far. With the bidirectional architecture, BERT can take into account the preceding and following context of a word. Therefore, it can understand the context of words through identifying emotional nuances in text, and detect precisely sentiment implied in text. RQ2: Why do Rule-based voting aggregator outperform traditional ensemble approaches? Normal voting classifiers are not good as voting with rules. This will be demonstrated experimentally. Moreover, specifying reasonable rules would help handle classification efficiently in a specific domain rather than using default or normal voting rules. Traditional ensemble approaches process classifiers and vote for the highest output in a similar way but lack of weighting, thus encountering the problem of data imbalance. This study will conduct some experiments to evaluate the performance of rule-based voting classification method compared with normal voting classification method and traditional ensemble classification methods. RQ3: How can Large Language Models enhance emotion detection capabilities? In recent years, Large Language Models (LLMs) have demonstrated significant capabilities in detecting emotions within text; however, systematic and comprehensive evaluations of their performance remain limited. The integration of sentiment analysis into LLMs enhances their ability to generate contextually appropriate, empathetic, and accurate responses, thereby improving user engagement and interaction quality. This reciprocal relationship allows LLMs not only to contribute to emotion recognition systems but also to leverage sentiment analysis feedback to refine their conversational abilities. Modern LLMs, such as ChatGPT \cite{openai2023gpt4} and Gemini \cite{team2023gemini}, now embed dedicated sentiment analysis modules, leading to more precise and effective emotion detection in textual data.
This paper makes the following key contributions:
1.Unlike traditional ensemble methods that apply uniform model combinations, our approach selects optimal models independently for each emotion category based on performance metrics.
2.We introduce a sophisticated voting system that goes beyond simple majority voting by incorporating confidence scores, probability thresholds, and hierarchical decision rules.
We provide extensive experimental validation across multiple BERT variants and LLMs, demonstrating consistent improvements over individual model approaches. Our method achieves superior performance on the SemEval-2025 Task 11 benchmark, establishing new performance benchmarks for multi-label emotion detection.
The remainder of this paper is structured as follows: Sect. 2 presents a comprehensive literature review covering traditional emotion analysis approaches, deep learning methods, and ensemble techniques. Sect. 8 details our proposed methodology, including the system architecture, model selection mechanisms, and voting strategies. Sect. 18 describes data pre-processing, the experimental setup, and implementation details. Sect. 26 presents comprehensive experimental results and ablation studies. Sect. 34 provides detailed discussion of findings, limitations, and comparative analysis. Finally, Sect. 39 concludes the paper with key findings and future research directions.
Proposed Methodology
System Architecture
The proposed system combines Transformer-based models and Large Language Models through a rule-based voting aggregator for multi-label emotion detection, as shown in Figure 1. The architecture comprises three main components: (1) transformer-based models, (2) large language models, and (3) rule-based voting aggregator.
The system architecture operates through a systematic three-stage pipeline for multi-label emotion detection. Initially, input text undergoes parallel processing through multiple model pathways: transformer-based models (BERT variants) perform fine-tuned classification with sigmoid activation to generate emotion probabilities, while LLMs execute either SFT or ICL approaches to produce binary emotion predictions. Subsequently, the Weight assignment mechanism converts these diverse model outputs into standardized probability estimates through sigmoid outputs for BERT variants and F1-based calibration for LLM binary predictions. Finally, the Rule-based voting aggregator processes these calibrated probabilities through emotion-specific model selection and hierarchical decision rules to generate final multi-label emotion classifications. This comprehensive workflow ensures systematic integration of complementary model strengths while maintaining consistent probability-based aggregation across diverse architectural approaches.
Transformer-based Models
To systematically evaluate Transformer-based architectures for multi-label emotion detection, we selected twelve widely-adopted BERT variants based on their availability and established usage within the research community. Our methodology employs a comprehensive empirical evaluation approach wherein we assess all model variants to determine which architectures exhibit optimal performance for individual emotion categories, rather than relying on predetermined model selection criteria.
The selected models encompass significant advancements in Transformer architecture development:
Standard pretraining models: BERT RN1402 serves as the foundational model, trained via masked language modeling. RoBERTa RN1466 extends BERT with dynamic masking, longer training durations, and larger batch sizes, enhancing performance consistency.
Autoregressive and Permutation-based models: XLNet RN1296 employs a permutation-based objective in combination with Transformer-XL recurrence, enabling modeling of bidirectional contexts without relying on masking strategies.
Discriminator-based pretraining: ELECTRA RN1463 replaces the masked token prediction task with a discriminator trained to distinguish real input tokens from synthetically generated ones, significantly improving training sample efficiency.
Lightweight and compressed variants: DistilBERT \cite{sanh2019distilbert}, a distilled version of BERT, offers reduced size and inference time while retaining performance. ALBERT \cite{lan2020albert} achieves compression through cross-layer parameter sharing and embedding factorization. TinyBERT \cite{jiao2020tinybert} is trained via knowledge distillation from a larger BERT model.
Multilingual and cross-lingual Models: mBERT \cite{pires2019mbert} supports 104 languages using a shared WordPiece vocabulary. XLM-RoBERTa \cite{conneau2020xlmr} is pretrained on an extended multilingual corpus to enhance cross-lingual generalization.
Domain-specific and span-based variants: SpanBERT \cite{joshi2020spanbert} improves span-level understanding by masking and predicting continuous text spans. BART \cite{lewis2020bart} combines bidirectional encoding and autoregressive decoding within a denoising sequence-to-sequence framework.
Our system architecture supports unlimited model variants without constraints, enabling continuous integration of new Transformer models as they become available. This scalability ensures that the approach remains adaptable to evolving NLP architectures while maintaining empirical performance assessment for emotion-specific model selection.
Each transformer-based model underwent supervised fine-tuning using the training and development partitions from Track A of SemEval 2025 Task 11, an established benchmark for multi-label emotion detection. We augmented each pretrained backbone with a task-specific classification head comprising a linear layer that generates independent binary predictions for the five target emotion categories (anger, fear, joy, sadness, surprise). The fine-tuning process followed standard transfer learning protocols, enabling end-to-end optimization of both the pretrained representations and the classification layer on the emotion-annotated corpus.
Prediction Generation, F1 Performance, and Probability Procedures
The transformer-based models generate binary predictions for each emotion category across all test data instances, accompanied by corresponding probability estimates for each emotion class. Each model produces sigmoid-activated probability scores within the range [0, 1], representing the confidence level for the presence of each target emotion in the input text. These probability estimates serve as direct inputs to the Weight Assignment Mechanism described in Sect. 14. The binary predictions are subsequently processed by the Voting Rules Mechanism outlined in Sect. 14 for final emotion classification decisions.
Performance evaluation is conducted through systematic comparison of binary predictions against ground truth annotations in the development dataset, yielding F1 metrics computed at both individual emotion levels and aggregate levels (macro and micro F1 scores). These F1 performance metrics function as selection criteria for the emotion-specific model selection mechanism within the Rule-based Voting Aggregator described in Sect. 14, ensuring that only the most effective models contribute to each emotion category's final classification.
Large Language Models (LLMs)
We employ Large Language Models through two distinct methodologies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). This dual approach combines the adaptability of fine-tuned models with the zero-shot capabilities of pre-trained LLMs for multi-label emotion detection.
A
To ensure consistent prompt formatting across all LLM approaches, we design standardized prompt templates that facilitate reliable model training and inference. Table
1 presents the two primary template configurations employed throughout our methodology: Template 1 for unified multi-label classification and Template 2 for binary emotion-specific detection. These templates are strategically designed to optimize model performance while maintaining consistency across different LLM architectures and fine-tuning strategies.
begin{table}[htbp]\vspace{-12pt} \centering \caption{Prompt templates for supervised fine-tuning approaches} \label{tab:prompt_templates} \begin{tabular}{cp{12cm}} \toprule Template & Prompt Structure \\ \midrule Template 1 & "Analyze the sentiment of the following Tweets \{input text\} and classify them as 'ANGER', 'FEAR', 'JOY', 'SADNESS', 'SURPRISE'. Choose all possible emotions in the list that match the sentence without any explanation." \\ \midrule Template 2 & "Analyze the sentiment of the following Tweets \{input text\} Do you find any \{emotion\} in that sentence? Use the schema: Yes or No. Make sure you do not add any explanation nor other detail." \\ \bottomrule \end{tabular} \vspace{-6pt}\end{table}
SFT Approaches
Supervised fine-tuning adapts pre-trained models using labeled data through prompt-label pairs. We implement three strategies using Gemini-1.5-flash-001-tuning, each targeting different aspects of multi-label emotion classification (Figure 2).
Our SFT methodology follows a two-phase process: fine-tuning with labeled data using specific prompt templates (Table 1), then inference using identical templates for prediction consistency. During training, models receive input text formatted with templates and corresponding emotion labels as targets. For prediction, the same template structure queries the fine-tuned model for emotion classifications.
Approach 1: Vanilla Multi-Label SFT
Uses Template 1 for unified multi-label detection. A single model trains on the complete dataset to predict all emotions simultaneously, capturing inter-emotion relationships within one inference pass.
Approach 2: Task-Decomposed SFT
Employs Template 2 to create five specialized binary classifiers. Each model focuses on one emotion category, reducing interference between different emotional signals and enabling targeted optimization.
Approach 3: Data-Augmented SFT
Builds on Approach 2 using Template 2 but incorporates augmented training data with positive-filtered external instances. This addresses class imbalance while maintaining the focused benefits of binary decomposition.
ICL Prompt Approaches
ICL performs tasks through carefully designed prompts without parameter updates. We apply Template 1 from Table 1 across multiple LLM variants for comprehensive multi-label classification.
Our ICL methodology employs Template 1 to enable simultaneous detection of all five emotion categories within one inference pass. The template structure is:
"Analyze the sentiment of the following Tweets \{input text\} and classify them as 'ANGER', 'FEAR', 'JOY', 'SADNESS', 'SURPRISE'. Choose all possible emotions in the list that match the sentence without any explanation."
For each test instance, we format the input text according to Template 1 and send the prompt to the respective LLM through API calls. The models process the structured prompt and return predictions indicating which emotions are present in the input text. We implement this approach across multiple LLM variants without any fine-tuning, relying solely on the models' pre-trained capabilities to respond to the formatted prompts and provide emotion classifications for the given text instances.
Prediction Generation and F1 Performance Assessment
The LLM approaches generate binary predictions for each emotion category across all test dataset instances through two distinct pathways: SFT models produce direct binary outputs via fine-tuned classification heads, while ICL models generate classifications through prompt-response mechanisms.These binary predictions undergo two-stage processing within the rule-based voting framework: conversion to calibrate probability estimates for weight assignment (Sect. 16) and direct integration into the hierarchical voting rules mechanism (Sect. 17).
Performance evaluation employs systematic comparison of binary predictions against ground truth annotations in the development dataset, computing F1 metrics at both individual emotion and aggregate levels (macro and micro F1 scores). These F1 performance metrics serve dual functions within the rule-based voting framework: determining which LLM models participate in voting for each emotion category through emotion-specific model selection and enabling probability estimation by converting binary predictions into calibrated confidence scores through the Weight Assignment Mechanism outlined in Sect. 16.
Rule-based Voting Aggregator
The core component of our proposed methodology is the Rule-based Voting Aggregator, which combines predictions from multiple different models to produce final emotion classifications. Figure 3 shows the detailed structure and decision flow of this aggregation process. In this framework, each participating model—whether transformer-based or LLMs—contributes binary predictions, F1 performance metrics, and probability estimates for individual emotion categories. These outputs are systematically integrated through our hierarchical rule-based decision mechanism to produce optimal multi-label emotion classifications.
Model Selection Mechanism
Traditional ensemble methods apply uniform model combinations across all prediction tasks. However, selecting the most appropriate subset of models to participate in the voting process is a critical step for optimizing classification performance. Our approach recognizes that different models exhibit varying strengths for different emotions, necessitating adaptive selection strategies. In this study, model selection is based on the F1 score — a balanced metric that accounts for both precision and recall in multi-label classification tasks zhang2014review,pedregosa2011scikit.
Specifically, for each individual emotion category (e.g., joy, sadness, fear, anger, and surprise), models are independently evaluated on the development set. An F1 score is calculated per emotion, and only the top-performing models for each category are selected to contribute predictions. This per-emotion selection strategy enables the ensemble to leverage the unique strengths of different architectures for specific emotional labels.
By relying on F1-based selection, the ensemble voting process effectively filters out underperforming or noisy models, thereby reducing the risk of deteriorating the overall accuracy of the emotion classification system.
For each emotion category
, we select the top-
performing models based on development set F1 scores. The selection process follows:
[H]
caption{Emotion-Specific Model Selection}
begin{algorithmic}
State
Input: Model set
, emotion
, development set
State
Output: Selected model subset
State Evaluate
on
for emotion
EndFor
State Sort models by
in descending order
State Select top-
models:
end{algorithmic}
Weight Assignment Mechanism
The weight assignment mechanism transforms diverse model outputs into standardized voting weights through a systematic three-stage process that enables effective integration of heterogeneous architectures within the rule-based voting framework.
Step 1: Class Probability Generation
The first stage generates probability estimates from heterogeneous model architectures through architecture-specific approaches.
Transformer-based Models (BERT Variants): Class probabilities are generated directly through sigmoid activation functions applied to the final classification layer outputs, producing probability values in the range [0,1] that reflect the model's confidence in emotion presence.
Large Language Models: Class probabilities are estimated through F1-based calibration since LLMs produce discrete binary predictions. The calibration process maps binary outputs to probability estimates using the model's empirical performance on the development set:
begin{equation}p_m^e = \begin{cases}F1_m^e & \text{if } \text{prediction}_m^e = 1 \\1 - F1_m^e & \text{if } \text{prediction}_m^e = 0\end{cases}\end{equation}
where
represents the F1 score achieved by model
for emotion
on the development set.
Step 2: Probability-to-Weight Conversion
A
The second stage transforms the generated class probabilities into discrete voting weights using the systematic mapping scheme presented in Table
2. This conversion process applies uniformly to all probability estimates, ensuring consistent weight assignment across diverse model architectures.
begin{table}[htbp]\vspace{-6pt}\centering\caption{Probability-to-weight mapping scheme with confidence interpretations}\label{tab:probability-weight-mapping}\begin{tabular}{ccp{7.5cm}}\topruleClass Probability & Weight & Confidence Interpretation \\\midrule0.8 -- 1.0 & +2 & Strong model confidence that the emotion is present \\0.6 -- 0.8 & +1 & Moderate confidence levels indicating emotion presence \\0.4 -- 0.6 & 0 & Model uncertainty about emotion presence or absence \\0.2 -- 0.4 & -1 & Moderate confidence suggesting emotion absence \\0.0 -- 0.2 & -2 & Strong confidence in emotion absence \\\bottomrule\end{tabular}\vspace{-6pt}\end{table}Step 3: Weight Aggregation
The final stage integrates the assigned weights into the voting rules mechanism through systematic aggregation of individual model contributions. For each emotion category
, the total weight score is computed as:
where
represents the weight assigned to model
for emotion
, and
denotes the number of selected models participating in the voting process.
These aggregated weight scores serve as primary inputs to the hierarchical voting rules mechanism described in Sect. 17, enabling effective integration of diverse architectural strengths through standardized probability-based aggregation.
Voting Rules Mechanism
The voting rules mechanism implements a hierarchical decision-making process that systematically aggregates weighted predictions from selected models to determine final emotion classifications. As illustrated in Figure 3, this mechanism employs a three-tier rule structure that addresses various prediction scenarios, from clear consensus to highly ambiguous cases.
Rule 1: Total Weight Aggregation
The primary decision criterion evaluates the total weight score computed from all selected models for each emotion category. For each emotion
, the total weight
is calculated as:
where
represents the weight assigned to model
for emotion
based on the probability-to-weight mapping scheme (Table
2). The decision logic follows:
If
: Assign positive value (1) - the emotion is present
If
Assign negative value (0) - the emotion is absent
This primary rule handles the majority of classification decisions where model consensus provides a clear directional preference.
Rule 2: Positive vs Negative Weight Count
When the total weight equals zero, indicating balanced positive and negative evidence, the system evaluates the distribution of model predictions by counting the number of models contributing positive versus negative weights:
The decision logic applies:
If
: Assign positive value (1)
If
: Assign negative value (0)
This secondary rule leverages model agreement patterns when weighted aggregation fails to provide decisive evidence.
Rule 3: Average Probability Threshold
In the most ambiguous scenario where both total weight and vote counts are balanced, the system employs a probability-based tie-breaking mechanism. The average class probability across all selected models is computed as:
where
represents the probability output of model
for emotion
. The final decision criterion applies:
If
: Assign positive value (1)
If
: Assign negative value (0)
This tertiary rule ensures that even in highly uncertain cases, the system can make informed decisions based on the collective confidence of all participating models.
References:
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Conference Proceedings, 4171-4186, 1, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Wankhade, Mayur and Rao, Annavarapu Chandra Sekhara and Kulkarni, Chaitanya (2022) A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review 55(7): 5731-5780 https://doi.org/10.1007/s10462-022-10144-1, Journal Article, 1573-7462
Clark, Kevin and Luong, Minh-Thang and Le, Quoc V. and Manning, Christopher D. (2020) ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv:2003.10555 [cs.CL] Journal Article
Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V. (2020) XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv:1906.08237 [cs.CL] Journal Article
Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. Journal Article, https://arxiv.org/abs/1907.11692
Nandwani, Pansy and Verma, Rupali (2021) A review on sentiment analysis and emotion detection from text. Social Network Analysis and Mining 11(1): 81 https://doi.org/10.1007/s13278-021-00776-6, Journal Article, https://doi.org/10.1007/s13278-021-00776-6, 1869-5469
Campbell, S. L. and Gear, C. W. (1995) The index of general nonlinear {D}{A}{E}{S}. Numer. {M}ath. 72(2): 173--196
Slifka, M. K. and Whitton, J. L. (2000) Clinical implications of dysregulated cytokine production. J. {M}ol. {M}ed. 78: 74--80 https://doi.org/10.1007/s001090000086
Hamburger, C. (1995) Quasimonotonicity, regularity and duality for nonlinear systems of partial differential equations. Ann. Mat. Pura. Appl. 169(2): 321--354
Geddes, K. O. and Czapor, S. R. and Labahn, G. (1992) Algorithms for {C}omputer {A}lgebra. Kluwer, Boston
Broy, M. Software engineering---from auxiliary to key technologies. In: Broy, M. and Denert, E. (Eds.) Software Pioneers, 1992, Springer, New {Y}ork, 10--13
(1981) Conductive {P}olymers. Plenum, New {Y}ork, Seymour, R. S.
Smith, S. E. (1976) Neuromuscular blocking drugs in man. Springer, Heidelberg, 593--660, Neuromuscular junction. {H}andbook of experimental pharmacology, 42, Zaimis, E.
Chung, S. T. and Morris, R. L.. Isolation and characterization of plasmid deoxyribonucleic acid from Streptomyces fradiae. Paper presented at the 3rd international symposium on the genetics of industrial microorganisms, University of {W}isconsin, {M}adison, 4--9 June 1978. 1978
Hao, Z. and AghaKouchak, A. and Nakhjiri, N. and Farahmand, A.. Global integrated drought monitoring and prediction system (GIDMaPS) data sets. figshare https://doi.org/10.6084/m9.figshare.853801. 2014
Babichev, S. A. and Ries, J. and Lvovsky, A. I.. Quantum scissors: teleportation of single-mode optical states by means of a nonlocal single photon. Preprint at https://arxiv.org/abs/quant-ph/0208066v1. 2002
Beneke, M. and Buchalla, G. and Dunietz, I. (1997) Mixing induced {CP} asymmetries in inclusive {B} decays. Phys. {L}ett. B393: 132-142 gr-gc, 0707.3168, arXiv
Abbott, T. M. C. and others (2019) {Dark Energy Survey Year 1 Results: Constraints on Extended Cosmological Models from Galaxy Clustering and Weak Lensing}. Phys. Rev. D 99(12): 123505 https://doi.org/10.1103/PhysRevD.99.123505, FERMILAB-PUB-18-507-PPD, astro-ph.CO, arXiv, 1810.02499, DES
Grootendorst, Maarten (2022) BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia (2017) Attention is all you need. 5998--6008, 30, Advances in neural information processing systems
Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, R{\'e}mi and Funtowicz, Morgan and others (2020) Transformers: State-of-the-art natural language processing. 38--45, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Kovaleva, Olga and Romanov, Alexey and Rogers, Anna and Rumshisky, Anna (2019) Revealing the dark secrets of BERT. 4365--4374, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
Sun, Chi and Huang, Luyao and Qiu, Xipeng (2020) EMOBERT: Learning emotion representations using BERT for emotion detection. 4498--4508, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
Demszky, Dorottya and Mahowald, Kyle and Kohli, Pushpendre and Zhao, Jieyu and Gibson, Emma and Sachan, Mrinmaya and Jurafsky, Dan (2020) GoEmotions: A dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547
Xia, Rui and Ding, Rui and Ding, Ziqing and Li, Rui (2019) EmotionX-IDEA: Emotion BERT--an affectional model for conversation. 285--291, Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805
Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692
Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding. Advances in neural information processing systems 32
Clark, Kevin and Luong, Minh-Thang and Le, Quoc V and Manning, Christopher D (2020) ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv preprint arXiv:2003.10555
Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108
Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu (2020) ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. International Conference on Learning Representations
Jiao, Xiaoqi and Yin, Yichun and Shang, Lifeng and Jiang, Xin and Chen, Xiao and Li, Linlin and Wang, Fang and Liu, Qun (2020) TinyBERT: Distilling BERT for Natural Language Understanding. 4163--4174, Findings of the Association for Computational Linguistics: EMNLP 2020
Pires, Telmo and Schlinger, Eva and Garrette, Dan (2019) How multilingual is Multilingual BERT?. 4996--5001, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm án, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin (2020) Unsupervised Cross-lingual Representation Learning at Scale. 8440--8451, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Joshi, Mandar and Chen, Danqi and Liu, Yinhan and Weld, Daniel S and Zettlemoyer, Luke and Levy, Omer (2020) SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics 8: 64--77
Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke (2020) BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics : 7871--7880
Pedregosa, Fabian and Varoquaux, Ga{\"e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and others (2011) Scikit-learn: Machine learning in Python. Journal of machine learning research 12: 2825--2830
Zhang, Min-Ling and Zhou, Zhi-Hua (2014) A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26(8): 1819--1837 IEEE
Breiman, Leo (2001) Random forests. Machine learning 45(1): 5--32 Springer
Friedman, Jerome H (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics : 1189--1232
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Girish, Sastry and Askell, Amanda and others (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems 33: 1877--1901
Howard, Jeremy and Ruder, Sebastian (2018) Universal language model fine-tuning for text classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) : 328--339
Wei, Jason and Tay, Yi and Bommasani, Rishi and Raffel, Colin and Zoph, Barret and Borgeaud, Sebastian and Yogatama, Dani and Bosma, Maarten and Zhou, Denny and Metzler, Donald and others (2022) Emergent abilities of large language models. Transactions on Machine Learning Research
Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21(140): 1--67
Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Xu, Jingjing and Sui, Zhifang (2023) A survey for in-context learning. arXiv preprint arXiv:2301.00234
Liu, Xiao and Zheng, Yanan and Du, Zhengxiao and Ding, Ming and Qian, Yujie and Yang, Zhilin and Tang, Jie (2023) GPT understands, too. AI Open 4: 40--47
Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu (2021) LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685
Zhang, Lei and Wang, Shuai and Liu, Bing (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5(1): 1--167
Mohammad, Saif M and Turney, Peter D (2013) Crowdsourcing a word-emotion association lexicon. 436--465, 3, 29, Computational intelligence
Calvo, Rafael A and D'Mello, Sidney (2010) Affect detection: An interdisciplinary review of models, methods, and their applications. 18--37, 1, 1, IEEE Transactions on affective computing
Qiu, Xipeng and Sun, Tianxiang and Xu, Yige and Shao, Yunfan and Dai, Ning and Huang, Xuanjing (2020) Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63(10): 1872--1897
Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna (2020) A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8: 842--866
Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others (2019) PyTorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32
OpenAI. GPT-4 Technical Report. 2023, arXiv preprint arXiv:2303.08774
Team, Gemini and Anil, Rohan and Borgeaud, Sebastian and Wu, Yonghui and Alayrac, Jean-Baptiste and Yu, Jiahui and Soricut, Radu and Schalkwyk, Johan and Dai, Andrew M and Hauth, Anja and others (2023) Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805
Organizers, SemEval-2025. SemEval-2025 Task 11: Multi-label Emotion Detection in Social Media Posts. Available at: https://github.com/emotion-analysis-project/SemEval2025-task11. 2025, Shared Task Competition
Loshchilov, Ilya and Hutter, Frank (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101