Research on a Multimodal Emotion Perception Model Based on GCN + GIN Hybrid Model
YingqiangWang1
ElcidA.Serrano3✉Email
1School of Graduate StudiesMapua UniversityManilaPhilippines
2School of electronic information engineeringXi’an Siyuan universityXi’anChina
3
A
School of Information TechnologyMapúa UniversityManilaPhilippines
Yingqiang Wang12, Elcid A. Serrano3*
1 School of Graduate Studies, Mapua University, Manila, Philippines
2 School of electronic information engineering, Xi’an Siyuan university, Xi’an, China
3 School of Information Technology, Mapúa University, Manila, Philippines
* Corresponding author’s email: easerrano@mapua.edu.ph
Abstract
Graph neural networks (GNNs) have demonstrated strong performance in handling graph-structured data in recent years​​, particularly in capturing complex inter-node relationships among data samples, showcasing advantages over traditional neural networks. However, challenges persist, including ​​difficulties in cross-modal information fusion, inadequate modeling of modal relationships, and high computational costs​​. To address these limitations, ​​this paper proposes GGMEN​​, a novel model that integrates the local neighborhood aggregation capability of graph convolutional networks with the global structural expressiveness of graph isomorphic networks (GINs). Leveraging ​​shallow feature extraction via time-frequency joint analysis​​, the paper extracts 14 representative physiological statistical features. Simultaneously, the ​​Transformer model captures spatial features from individual facial expression video frames​​, enabling spatio-temporal modeling of facial expressions. ​​The GCN layer models temporal dependencies in physiological signals and spatial relationships of facial key points​​, while the ​​GIN layer enhances modeling of complex higher-order relationships​​. ​​Multi-modal emotion perception is achieved through attention-based modality fusion​​. Experiments on the DEAP dataset validate the model’s effectiveness across multiple emotion perception benchmarks, achieving an emotion recognition accuracy of 81.25%. Comparative analyses with existing models confirm the accuracy improvement of the proposed framework.
Keywords:
Multimodal
Emotion Perception
GCN
GIN
1. Introduction
A
In the contemporary digital era, emotion perception plays a ​​pivotal​​ role across diverse application scenarios, including ​​intelligent human-computer interaction systems​​, social media analytics, mental health monitoring, and ​​intelligent customer service​​. As technological capabilities advance, ​​single-modal emotion recognition techniques​​ (e.g., relying exclusively on text or voice) have proven ​​increasingly inadequate​​ for addressing real-world complexities, ​​propelling​​ multimodal emotion perception to the forefront of research [1]. ​​This approach integrates heterogeneous data sources​​ (e.g., text, audio, physiological signals, and video) to enable ​​holistic interpretation​​ of users' emotional states [2]. ​​Current multimodal research predominantly leverages​​ text, speech, EEG, physiological signals, and facial expressions. ​​To effectively capture real-time emotional states, this study prioritizes minimally intrusive data acquisition. Consequently, the paper focuses on physiological signals and facial expressions as the primary modalities for modeling.​
Multimodal emotion perception significantly ​​enhances recognition accuracy and robustness​​ by synthesizing complementary information across modalities. For instance, ​​in intelligent assistant systems​​, integrating users' ​​prosodic cues​​ with textual content enables more precise emotion interpretation, ​​facilitating personalized responses​​. Nevertheless, ​​prevailing methods encounter persistent challenges​​: cross-modal information fusion difficulties [3], ​​suboptimal modal relationship modeling​​ [4], and ​​computational inefficiency​​ [5]. ​​To overcome these limitations​​, this study proposes the ​​GGMEN framework​​—integrating physiological signals and facial expressions within a hybrid architecture. This model synergistically combines GCN​​ for neighborhood aggregation and ​​GIN for structural expressiveness. By ​​fusing GCN's global information integration with GIN's high-order relational modeling​​, GGMEN achieves efficient multimodal fusion while ​​adapting to complex emotional signal variations​​. ​​This integration elevates emotion recognition performance​​ across diverse application scenarios.
The remainder of this paper is structured as follows: ​​Chapter 2 reviews existing research​​ in multimodal emotion perception and graph neural networks. ​​Chapter 3 details the foundational algorithms​​, specifically GCN and GIN, utilized in this study. ​​Chapter 4 presents the proposed methodology​​, focusing on feature extraction from peripheral physiological signals and facial expressions, as well as the design of the GCN + GIN hybrid architecture. ​​Chapter 5 describes the experimental implementation​​, including results and analysis. ​​Chapter 6 summarizes key contributions​​, identifies limitations, and outlines future research directions.
2. Related work
2.1 Multimodal emotion perception
Multimodal fusion methods integrate emotional features across heterogeneous information sources to enhance emotion recognition accuracy. These techniques mitigate ​​modality interference​​—such as when conflicting emotional cues occur between facial expressions and vocal patterns—thereby improving recognition robustness. For instance, ​​divergent emotional indicators​​ across modalities may compromise recognition precision; fusion algorithms resolve such discrepancies to yield more ​​stable and reliable affective assessments​​. Substantial research has been conducted globally to advance these methodologies.
Reference [6] deeply reviews the latest progress, basic theories, architectures, information fusion mechanisms, relevant datasets, performance evaluation and practical applications of multimodal emotion recognition systems based on deep learning, identifies the main challenges and limitations in this field, and proposes future research directions. Reference [7] provides a comprehensive overview of the latest updates in this field, briefly introduces many recently proposed algorithms and various MSA applications, classifies a large number of recent articles, and explains the latest research trends in MSA and related fields. Yujian Cai and others proposed a unimodal feature extraction network (UFEN) to extract unimodal features with stronger representation capabilities; then, a multi-task fusion network (MTFN) was introduced to improve the correlation and fusion effect between multiple modalities. The model utilizes multi-layer feature extraction, attention mechanisms, and transformers to mine the potential relationships between features. Experimental results on the MOSI, MOSEI, and SIMS datasets demonstrate better performance in multimodal sentiment analysis tasks compared to existing baselines [8]. Hao Chao proposed a new method for multi-channel EEG signal emotion recognition based on three-dimensional features and convolutional neural networks. The EEG signal features of each channel are obtained through time-domain processing, and the features of all channels are combined into a three-dimensional feature matrix. Then, an advanced convolutional neural network containing univariate convolutional layers and multivariate convolutional layers is used for emotion recognition, fully utilizing the correlation information between multiple channels and improving the accuracy of emotion recognition [9]. Hajar Filali and others designed an algorithm that inputs modalities fused from text, audio, text, and audio features to a group of neurons, using deep learning to learn each feature separately, and then combining them in the last layer. This method was evaluated on multimodal and multi-party datasets for emotion recognition in MELD conversations. The proposed method achieved an accuracy of 86.69% [10]. Alireza Ghorbanali proposed a deep ensemble transfer learning method based on capsule networks, called DSY-ETL-MSA. This method uses pre-trained VGG16 models fine-tuned on datasets to extract advanced features for image classification, utilizes pre-trained GloVe models to embed words, and employs two separate classifiers for text classification. The results of the text and image classifiers are combined as early fusion, and their final results are fused at the decision level as late fusion. Experimental results show that on the MVSA and T4SA datasets, the proposed method achieved final accuracy rates of 0.9866 and 0.9996, respectively [11]. Doha Taha Nour El-Deen et al. proposed a hybrid deep learning model incorporating an attention mechanism. The model initially employs a Convolutional Bidirectional Gated Recurrent Unit (Bi-GRU) with an attention mechanism, referred to as the CBGA model. The attention mechanism is placed after the Bi-GRU, and then the output softmax layer is constructed. The model runs on the IMDB dataset, Yelp 2015, and other models and achieves good results [12]. Bilotti U et al. used multimodal methods and convolutional neural networks (CNNs) for emotion recognition, compared different strategies in two multimodal datasets, and proposed a multimodal input method that combines facial frames, optical flow of continuous facial frames, and mel-spectrograms from word melodies. By combining multimodal features in different ways, convolutional neural networks were used for emotion recognition, and excellent accuracy was achieved on two benchmark datasets [13]. Ajwa Aslamde et al. proposed a framework called "Attention-based Multimodal Sentiment Analysis and Emotion Recognition (AMSAER)", which uses attention mechanisms to improve the accuracy of sentiment and emotion classification through internal discriminative features and cross-modal correlations between visual, audio, and text modalities [14]. Mani K T et al. proposed a new uncertainty-aware multimodal fusion method COLD Fusion, which is achieved by learning to constrain the variance of the potential distribution between modalities. The two variance constraints, calibration and ordinal sorting, respectively represent the amount of information of the temporal context of different modalities on emotion recognition, thereby improving the accuracy and robustness of emotion recognition [15]. Lorenzo V et al. proposed a video emotion recognition model called ViPER, which uses a modality-independent late fusion network to combine video frames, audio recordings, and text annotations to predict people's emotional arousal, enhancing the model's adaptability to different modalities [16]. Lucas Goncalves et al. adopted a method that combines auxiliary networks and transformer architectures and optimized the training mechanism to achieve pattern alignment in audio-visual emotion recognition, capture temporal dynamic information, and handle feature loss, thereby improving the accuracy of emotion recognition under non-ideal conditions [17].
2.2 Graph neural networks
Bharti Khemani delved into specific GNN models, such as Graph Convolutional Networks (GCN), GraphSAGE, and Graph Attention Networks (GAT), providing a broad overview of GNN research and its practical implementation. He also discussed the message passing mechanisms employed by GNN models and examined their advantages and limitations in various fields. Additionally, he explored the various applications, commonly used datasets, and Python libraries that support GNN models [18]. Tomasz Wiercinski and others adjusted GraphSleepNet (a GNN for classifying sleep stages) for use with physiological data for emotion recognition. The novelty of this research lies not only in using the GNN network with the GraphSleepNet architecture for emotion recognition but also in analyzing the potential of emotion recognition based on differential entropy features in the Ekman model [19]. Nannan Lu and others proposed a new multimodal fusion method based on bi-stream graph learning for emotion recognition in conversations. This method includes a single-modal stream graph learning for modeling long-distance contextual information within each modality and a cross-modal stream graph learning for modeling interactions between modalities. GNNs are used in parallel to learn both within-modality and cross-modality information. This separation learning scheme can successfully alleviate conflicts and heterogeneity issues in multimodal data fusion and promote the explicit modeling of cross-modal relationships [20]. Jiang Li and others proposed a novel graph network-based multimodal fusion technique (GraphMFT) for emotion recognition in conversations. Within this technical framework, multimodal data is skillfully modeled as a graph structure. Specifically, each data object is considered a node in the graph, and the dependencies between nodes within the same modality (intra-modal) and between different modalities (inter-modal) are transformed into edges in the graph. GraphMFT utilizes multiple improved graph attention networks to accurately capture within-modality contextual information and cross-modality complementary information [21]. Hongbin Wang et al. proposed a cross-instance Graph Neural Network (GNN) that leverages global features from the dataset to detect emotions in text–image pairs. The method begins by extracting five attributes from each image and constructing a co-occurrence matrix based on the relationships among these attributes. This matrix is then used to generate an Attribute Graph Convolutional Network (Attribute_GCN) for emotion recognition. Then, using pointwise common information and message passing mechanisms, the representations of edges and nodes are updated, thereby constructing a text graph neural network (Text_GNN). Finally, multimodal deep fusion based on multi-head attention mechanisms is implemented to better predict the emotions in image-text pairs [22]. The paper [23] introduces a groundbreaking method, namely, the cumulative attribute-weighted graph neural network, which is innovatively designed to integrate three-modal text, audio, and visual data from two multimodal datasets. This algorithm achieved impressive performance metrics on the CMU-MOSI dataset, with an accuracy rate of 94%, and the accuracy, recall, and F1 scores for negative, neutral, and positive emotional categories were all above 92%. Similarly, on the IEMOCAP dataset, the algorithm achieved an overall accuracy rate of 93%, with high precision and recall for both neutral and positive categories. Huyen Trang Phan proposed an aspect-level sentiment analysis method based on a graph-structured convolutional attention neural network. By improving the graph structure to consider edge weights, introducing an attention mechanism to classify sentiment based on the word position in a sentence, and considering co-occurring words, inter-word and inter-sentence relationships, he used deep learning and graph structure to improve the F1 score to 7.75% [24].
However, GNN also has some limitations, for example the over-smoothing problem makes it difficult to train deep models, the high computational complexity limits its application on large-scale graphs, and its robustness to noise and missing data still needs to be improved. Future research directions will focus on optimizing computational efficiency, enhancing interpretability, node relationship aggregation and integrating with other technologies (such as multimodal learning and hybrid models) to meet the needs of more complex practical scenarios. For example, Lin proposed a new aggregation mechanism, which can distinguish the same affinity and different affinity neighborhoods, to achieve effective "classified aggregation", combining block modeling and aggregation process, so that GCN can automatically learn the aggregation rules of different types of neighbors [25].
3. Algorithm introduction
3.1 GCN
The core strength of GCN lies in its ability to process non-Euclidean structured data​​, capturing node relationships and graph topology through message-passing mechanisms. This capability enables ​​cross-domain applicability​​ in fields such as social network analysis, recommendation systems, and bioinformatics. Key advantages include powerful relational modeling​​ for complex system interactions, architectural flexibility​​ supporting heterogeneous graphs (e.g., R-GCN) and dynamic graphs (e.g., Temporal GNN), scalable structural representation​​ that adapts to diverse data geometries.
GNNs transform graph-structured data into ​​learnable vector representations​​ by implementing ​​neighborhood aggregation strategies​​ across nodes and edges. This standardization enables seamless integration with diverse neural architectures, achieving ​​state-of-the-art performance​​ in node classification, edge attribute propagation, and graph clustering. The core mechanism involves ​​iteratively refining node embeddings​​ through message passing, where representations of adjacent nodes converge toward similarity. GCNs operationalize this principle by ​​generalizing convolutional operations​​ to non-Euclidean domains. Specifically, GCNs update node embeddings via ​​linear transformations of the adjacency matrix and node feature matrix​​, enabling efficient neighbor information aggregation.
Graph Structure​​: A graph G=(V,E) comprises ​​nodes vertices​​ and ​​edges​​, formally represented by an ​​adjacency matrix​​ A, where Aij∈{0,1} indicates edge existence between nodes i and j.
Node Features​​: Each node v possesses a ​​feature vector​​ Xv, collectively forming the ​​node feature matrix​​ X∈RN×d (N denotes the total number of nodes, d denotes dimensional features).
Hierarchical structure: GCN consists of multiple graph convolutional layers, each of which updates the feature representation of the node. The output H (l+1) of each layer of the GCN graph will serve as the input Hl of the next layer. By stacking multiple convolutional layers, GCN can learn high-level feature representations of nodes. The structure is shown in Fig. 1.
Fig. 1
GCN structure diagram
Click here to Correct
Convolutional Mechanism​​: Graph Convolutional Networks fundamentally redefine convolution through ​​structure-aware feature propagation​​, distinct from traditional image-based convolution. The process involves two core phases:​​
Step 1: Neighborhood Aggregation​​. Each node collects feature vectors from adjacent nodes according to
.
Step 2: Feature Transformation​​. Learned weights W(l) project aggregated features into higher-dimensional space.
This operation synthesizes ​​topological information​​ (encoded in adjacency matrix A) and ​​node features​​ via a layer-wise propagation rule:
1
Among them,
the representation matrix representing the node of layer l, for the input layer, H is X.
represents the weight matrix of layer l. The σ represents an activation function (for example, ReLU ).
This ​​message-passing paradigm​​ enables nodes to progressively encode local graph substructures through stacked layers.
The algorithmic workflow of GCN comprises ten formalized steps:
Step 1: Initialization. Initialize node feature matrix X∈R|V|×d, where |V| means node count and d is feature dimension. Initialize first-layer weight matrix W(0) ∈ RF×H (H = hidden dimension).
Step 2: Graph Construction. Construct adjacency matrix A ∈ {0,1}N×N, where Aij =1 if edge (i,j) exists.
Step 3: Self-loop Augmentation. Augment adjacency matrix with self-connections:
, where IN is the identity matrix.
Step 4: Degree Matrix Calculation. The calculated
degree matrix
, whose diagonal elements
are
the sum of the i-th row, that is, the degree of node i.
Step 5: Symmetric Normalization​. Calculate the normalized adjacency matrix, usually using symmetric normalization:
Step 6: Layer-wise Feature Propagation​. For each layer l, perform the following operations:
1) Compute the update of node features:
2
2) Apply a nonlinear activation function σ, such as ReLU :
3) Update the node feature matrix
as the input of the next layer.
Step 7: Iterative Layer Propagation​. Repeat step 6 until the required number of layers L is reached.
Step 8: Task-Specific Readout​. Depending on the task type, design a readout layer (such as a fully connected layer) to generate the final output.
Step 9: Objective Formulation​. Minimize task-aligned loss function, for example, for node classification tasks, cross-entropy loss can be used.
Step 10: Gradient-Based Optimization​. Use the loss function to calculate the gradient with optimizer choice and early stopping.
3.2 GIN
The ​​Graph Isomorphism Network (GIN)​​ is a theoretically grounded graph neural architecture designed to maximize ​​structural discriminative power​​. Its expressiveness matches the ​​Weisfeiler-Lehman (WL) graph isomorphism test​​, enabling superior differentiation of complex graph topologies. This capability drives broad applicability in social network analysis, and recommendation systems. While subject to computational intensity and overfitting risks, ​​strategic optimization​​ (e.g., regularization, layer pruning) transforms GIN into a potent tool for graph representation learning.
The core principle of GIN involves two key innovations: introducing a learnable parameter ϵ, and incorporating a multilayer perceptron (MLP) module. Specifically, the learnable parameter ϵ provides flexible control over the weight relationship between central nodes and their neighbors. The resulting feature update formula can be expressed as:
3
Where
denotes the feature representation of node v at layer k.
represents the feature of neighboring node u from layer k-1.
N(v) is the neighbor set of node v.
is a learnable parameter that balances the importance between the central node and its neighbors.
denotes the multilayer perceptron for nonlinear transformation.
The implementation process of GIN is as follows:
Step 1: Initialize node features. The initial features of each node
are usually part of the input data (such as node attributes or embedding vectors).
Step 2: Iteratively update node features. For each layer k, GIN updates node features by aggregating neighbor nodes and summing of neighbor node features:
4
The central node features are combined with the aggregated neighborhood features, followed by a nonlinear transformation using a MLP to generate updated node representations (Formula 2).
Step 3: Global pooling. For graph classification tasks, the paper aggregates node features into a graph-level embedding through:
Sum pooling: directly sum all node features.
Mean pooling: take the average of all node features.
Maximum pooling: take the maximum value of all node features.
Step 4: Downstream task adaptation. The final graph representation can be used for classification or regression, etc.
GIN is designed for ​​architectural simplicity and computational efficiency​​. By incorporating ​​targeted enhancements​​ to traditional message-passing mechanisms, GIN effectively captures high-order structural features. Simultaneously, its ​​MLP-based nonlinear transformations​​ significantly improve complex pattern learning capabilities. GIN demonstrates ​​broad applicability​​ across diverse tasks including graph classification, node classification, and link prediction. It readily adapts to domain-specific requirements such as social network analysis, recommendation systems and so on. These attributes establish GIN as a ​​highly effective framework​​ for graph-structured problem-solving.
4. Model design
4.1 Multimodal data representation
Peripheral physiological signals typically manifest as high-dimensional time series.​​ Direct model ingestion incurs prohibitive computational complexity and introduces substantial ​​emotion-irrelevant noise​​, thereby impeding effective learning. Consequently, ​​feature extraction preprocessing​​ including time-domain, frequency-domain, and nonlinear feature derivation is essential to retain emotion-correlated information, reduce dimensionality and enhance computational efficiency.
4.1.1 Physiological feature representation
This study employs ​​modality-specific shallow feature extraction​​ for peripheral physiological signals:
​​Electrodermal Activity (EDA)​​: Skin conductance level (mean, SD) and SCR count (peak detection).
​​Respiration​​: Rate (breaths/min) and inspiration-expiration ratio (amplitude analysis).
​​Electrocardiogram (ECG)​​: Mean heart rate (HR) and heart rate variability (HRV) via R-wave detection.
​​Photoplethysmography (PPG)​​: Derived heart rate and pulse wave morphology.
​​Skin Temperature​​: Mean value and fluctuation (variance-based).
Feature extraction methodologies are illustrated in Fig. 2.
4.1.2 Facial expression feature representation
Facial expressions convey affective states and communicative intentions during human interaction, including core emotions such as joy, anger, sadness, surprise, disgust, and fear. These expressions typically comprise three computational components:
Fig. 2
Peripheral physiological signal feature extraction
Click here to Correct
Action Units (AUs)​​: Fundamental facial muscle movements (e.g., orbicularis oculi contraction, frontalis lifting) encoded by the Facial Action Coding System. AU combinations generate expressions, such as smiling, frowning, anger, sadness, and so on.
​​Expression Formation​​: Compound expressions emerge through AU synergies (e.g., smile = zygomatic major activation + orbicularis oculi compression).
​​Expression Sequences: A combination of multiple facial expressions arranged sequentially, typically employed to convey complex emotional states and communicative intentions.
​​Expression Temporal Dynamics: The temporal characteristics of facial expressions, including onset, apex and offset phases, which are crucial for assessing expression intensity and duration.
The ​​Transformer's superior temporal modeling capability​​ effectively captures inter-frame dependencies in video sequences, significantly enhancing facial expression analysis. Consequently, this paper employs a Transformer architecture to extract affective features from facial expression videos.
The pipeline comprises four key phases including data preprocessing​​, spatiotemporal feature representation​​, transformer model configuration​​ and feature extraction and dimensionality reduction​.
​​Step 1: Data preprocessing.
Facial expression video data typically exists in the form of frame sequences, so it is necessary to preprocess the data to better adapt to the input requirements of the Transformer model. This mainly involves the following tasks:
Video frame segmentation refers to the process of decomposing video streams into sequential image frames.
​​Facial region processing​​ refers to the process of localizing faces per frame via CNN-based detectors (e.g., MTCNN, Dlib) and applying affine transformations (rotation/scaling) for spatial consistency.
Normalization refers to the process of scaling pixel values to [0,1] (TensorFlow) or [-1,1] (PyTorch) conventions to accelerate convergence.
Step 2: Spatiotemporal feature representation​.
To enable Transformer compatibility, raw video frames undergo ​​hierarchical feature abstraction​​:໿
​​Spatial Encoding​​: A pre-trained ResNet model extracts ​​frame-wise discriminative features​​, converting pixels into high-dimensional latent representations.
​​Temporal Tokenization​​: These feature vectors serve as ​​input tokens​​ for Transformer processing, preserving spatial semantics while enabling temporal relationship modeling.
Step 3: Transformer model configuration.
The core of the Transformer lies in its ​​self-attention mechanism​​, which captures ​​temporal dependencies​​ between frames. Each frame's feature vector is combined with ​​positional encoding​​ to form the Transformer input preserving temporal order information. The ​​multi-head self-attention mechanism​​ models global dependencies across frames. Within each Transformer layer, features undergo nonlinear transformation through a ​​feedforward network​​. By stacking multiple Transformer layers, the model progressively extracts ​​higher-level representations​​.
Step 4: Feature Extraction and Dimensionality Reduction
The Transformer model outputs ​​spatiotemporal contextual representations​​ of the video frame sequence. To generate a fixed-length global feature vector, the paper applies ​​global pooling operators​​ (e.g., mean or max pooling) to aggregate frame-level features. This compresses temporal dynamics while preserving discriminative patterns essential for affective computing tasks.
Figure 3 illustrates the workflow​​ for extracting spatiotemporal affective features from facial expression videos using the Transformer architecture.
Fig. 3
Transformer-based affective feature extraction for facial expression videos​
Click here to Correct
4.2 GGMEN model architecture design
The proposed ​​GGMEN architecture​​ synergistically integrates GCN​​ and GIN​​ to harness their complementary strengths. GCN captures local spatiotemporal dependencies between graph nodes through neighborhood aggregation, while GIN enhances global representational capacity by modeling complex structural relationships via injective multiset transformations. This hybrid design leverages GCN's efficiency in ​​local dependency modeling​​ and GIN's expressiveness in ​​high-order relational reasoning​​, enabling efficient processing of multimodal emotion perception data. The architectural framework is illustrated in Fig. 4.
4.2.1 Input data preprocessing
The preprocessing pipeline for ​​peripheral physiological signals​​ (ECG, EMG, EDA) implements a ​​structured feature extraction framework​​ to generate graph-compatible representations. Time-domain features (mean, variance), frequency-domain characteristics (dominant frequency, spectral entropy), and statistical metrics (kurtosis, zero-crossing rate) are extracted from raw signals. Signal-specific processing includes: ​​R-wave detection​​ for ECG-derived heart rate variability (HRV), ​​tonic/phasic decomposition​​ for EDA-based skin conductance level (SCL), and ​​RMS normalization​​ for EMG muscle activation intensity. These features are structured into a ​​node feature matrix​​, serving as input for subsequent graph neural network architectures. This approach preserves physiological discriminability while optimizing computational efficiency through dimensionality-controlled representation.
Fig. 4
GGMEN model architecture design
Click here to Correct
For ​​facial expression analysis​​, the paper employs a​​Transformer architecture​​ to extract ​​spatiotemporal affective features​​ from video sequences. These features are structured within a ​​facial action graph​​ Gf = (Vf, Ef ), where nodes represent ​​facial action units and edges encode ​​bidirectional relationship between spatial dependencies​​ via anatomical adjacency and ​​temporal co-activation​​ patterns.
This graph formulation preserves ​​expression dynamics​​ through:
Frame-synchronized node features (Transformer outputs).
Edge weights dynamically updated via ​​cross-modal attention​​ with physiological graphs.
Hierarchical aggregation of micro-expression motifs (e.g., rapid AU1 + AU4 activation in surprise).
The unified representation enables joint modeling of ​​neurophysiological-affective synchrony​​ in multimodal emotion perception.
4.2.2 Graph construction
The model designed in this article respectively constructs graph structures for peripheral physiological signal data and facial expression signal data.
In peripheral physiological signal data graph, nodes represent ​​fixed-duration physiological signal segments​​s, and edges encode ​​temporal adjacency and ​​statistical correlations. Each node contains extracted ​​multidimensional features​​:
​​Time-domain​​: Mean, variance, zero-crossing rate
​​Frequency-domain​​: Spectral entropy, dominant frequency
​​Statistical​​: Kurtosis, peak-to-peak amplitude
Formally, the graph is defined as Gphysio​ =(V,E,X),
V: Nodes = {vi|i∈[1,N]} (N denotes the total number of nodes)
E: Edges = {(vi ,vi+1 )∪{(vi ,vj )∣ρij > θ}
X∈RN×F: Node feature matrix (F = 14 features per segment).
Facial expressions are encoded through ​​68 standardized facial landmarks​​, with each landmark serving as a graph node characterized by ​​spatiotemporal feature vectors​​. These vectors integrate ​​spatial attributes​​ (2D/3D coordinates) with ​​dynamic temporal descriptors​​, including frame-to-frame displacement, velocity, and so on. The representation captures ​​kinematic evolution​​ across expression sequences, enabling precise modeling of micro-expression dynamics through normalized positional derivatives and trajectory curvature analysis.
The facial graph structure is defined by ​​anatomically constrained edges​​ connecting adjacent landmarks based on biomechanical relationships. Region-specific subgraphs emerge naturally:
Periorbital landmarks form eye-region clusters tracking blink dynamics.
Lip commissure points establish oral-nodal networks monitoring articulation.
Glabellar points create brow furrow patterns detecting valence shifts.
Edges incorporate ​​dynamic weighting​​ proportional to muscular linkage strength, modeling both static facial topology and transient expression-specific coordination. This dual-layer representation provides physiological grounding for graph neural networks while maintaining robustness against head pose variations through coordinate standardization.
4.2.3 GCN-GIN hybrid architecture​
The proposed hybrid model integrates ​​Graph Convolutional Networks (GCN)​​ and ​​Graph Isomorphism Networks (GIN)​​ in a cascaded architecture:
GCN Layer processes graph-structured data through ​​neighborhood aggregation​​, capturing local spatiotemporal dependencies between adjacent nodes.
GIN Layer enhances representational capacity through ​​injective neighborhood aggregation​​ with learnable parameters. This captures complex higher-order dependencies while preserving graph isomorphism properties, for example facial-physiological cross-modal interactions.
The GCN → GIN cascade enables hierarchical feature learning: local feature extraction → global relationship modeling → task-optimized representations for emotion classification. Cross-layer skip connections mitigate over-smoothing while residual gating dynamically balances modality contributions.
4.2.4 Modality fusion
By incorporating attention mechanisms into GCN and GIN, the model more effectively performs weighted fusion of multimodal data. This enables automatic learning of the relative importance of different modalities, thereby enhancing both the performance and robustness of the emotion perception system. The ​​modality-wise attention mechanism​​ is crucial, allowing flexible handling of diverse data features and optimization of emotion classification efficacy.
​​Modality-wise Attention:​​ This mechanism performs weighted fusion of feature representations. ​​It assigns weights​​ to cross-modal interactions based on importance, improving fusion effectiveness.
​​Weighted Concatenation:​​ Features are concatenated ​​with a weighting mechanism modulating​​ each modality’s contribution. Specifically, GCN/GIN output vectors ​​are weighted, then concatenated​​ before propagation to subsequent layers.
Physiological signal modality Xphys and facial expression modality Xface are processed through GCN and GIN, ​​respectively​​, yielding feature representations Hphys and Hface. ​​A modality-wise attention mechanism​​ is then ​​employed​​ to perform weighted fusion:
5
Where αphys and αface denote the ​​attention weights​​ computed by the modality-wise attention mechanism, ​​quantifying each modality's contribution​​ to the fused representation.
4.2.5 Fully connected layer
Following modality fusion, ​​the integrated feature representation​​ Hfused ​​propagates through​​ the fully connected layer. This layer typically ​​incorporates ReLU activation​​, extracting high-level discriminative features for emotion classification.
4.2.6 Output layer
The output layer can use the Softmax activation function for multi-class emotion classification (such as happy, sad, angry, etc.).
5. Experiment design
5.1 Dataset introduction
This study employs the ​​DEAP dataset​​ (Koelstra et al. [19]) for model validation. Compiled by researchers at Queen Mary University of London, DEAP is a ​​publicly available dataset​​ containing ​​multimodal recordings​​ of human emotional responses. It includes electroencephalogram (EEG), electrocardiogram (ECG), electrodermal activity (EDA), electromyogram (EMG), and facial expression videos.​​The simultaneous capture​​ of facial expressions and physiological signals enables ​​robust analysis of emotion-expression correlations​​ [26]. Given DEAP's ​​multimodal nature​​, it serves as an ​​ideal benchmark​​ for emotion perception models. ​​For real-time emotion inference in practical scenarios​​, the paper prioritizes ​​readily acquirable modalities​​. While EEG requires specialized equipment, ​​peripheral physiological signals (ECG, EDA, EMG)​​ and facial expressions can be ​​captured via consumer-grade wearables​​. Consequently, this research focuses on ​​these two pragmatic modalities​​.
The DEAP dataset operationalizes emotions according to ​​Russell's Circumplex Model ​​[27], a dimensional theory formalizing affective states through two orthogonal axes:
​​Valence​​: Quantifies the hedonic value of emotions (positive vs. negative).
​​Arousal​​: Measures the physiological intensity of emotions (calm vs. excited).
Complementing these core dimensions, DEAP incorporates two supplementary metrics [19]:
​​Dominance​​: Assesses the perceived control level over emotional stimuli.
Liking​​: Gauges subjective preference toward stimulus content.
This framework enables ​​continuous 9-point scale ratings​​ (1–9) of emotional experiences, providing ​​fine-grained continuous descriptors​​ of affective states. Such dimensional quantification offers richer nuance than categorical models (e.g., joy, anger).
5.2 Model implementation
In this paper, the algorithm design and model implementation were conducted according to the designed GGMEN model.
Step 1: Data preprocessing.​​ The paper adopted different shallow feature extraction methods for peripheral physiological features data, and employed the Transformer model to extract facial expression features data from facial expression videos.
The paper applied threshold rules to map the dimensional values (Valence, Arousal, Dominance, Liking) from the DEAP dataset into discrete emotion categories. Using the discrete 9-point scale ratings of emotional experiences, the emotions in the dataset were classified into nine categories: joy, excited, anger, pressure, sad, fear, surprise, calm, and boring. The specific threshold criteria for this mapping are outlined in Table 1.
Table 1
Threshold rules
Emotion
Valence
Arousal
Dominance
Liking
Joy
High (6–9)
Moderate-High (5–9)
Moderate-High (5–9)
High (6–9)
Excited
High (6–9)
High (7–9)
High (6–9)
High (6–9)
Anger
Low (1–4)
High (7–9)
Low (1–4)
Low (1–4)
Pressure
Low (1–4)
High (7–9)
Moderate (4–6)
Low (1–4)
Sad
Low (1–4)
Low (1–4)
Low (1–4)
Low (1–4)
Fear
Low (1–4)
High (7–9)
Very Low (1–3)
Low (1–4)
Surprise
Moderate (4–6)
Extremely High (8–9)
Moderate (4–6)
Moderate (4–6)
Calm
High (6–9)
Low (1–4)
High (6–9)
High (6–9)
Boring
Low (1–4)
Low (1–4)
Moderate-High (5–9)
Very Low (1–3)
The preprocessing pipeline generates two distinct feature matrices:
Physiological signal features: 1280×15 matrix (1280 samples × 14 features + 1 emotion label column).
Facial expression features: 874×515 matrix (874 samples × 514 features + 1 emotion label column).
Step 2: Graph Construction​. The peripheral physiological data graph and facial expression data graph were constructed separately through a systematic process. First, node features were initialized to represent the fundamental units of each graph. For the peripheral physiological data, edges were then established based on biological signal correlations, while for the facial expression data, edges were defined according to predefined anatomical relationships between facial key points, such as those derived from the facial action coding system (FACS). Finally, both the node features and edge connections were converted into PyTorch tensors to ensure compatibility with subsequent deep learning processing. This structured approach effectively transforms multimodal data into graph representations suitable for neural network analysis.
Step 3: Construct the GCN, GIN, and attention mechanism models, all of which use ReLU as the activation function.
Step 4: Model Fusion. During fusion, the input dimensions of both physiological and facial features are first calculated. The physiological data is then processed through two GCN layers while the facial features are processed through one GIN layer, followed by weighted modality fusion using an attention mechanism.
Step 5: Classification Layer​​. The fused feature vector undergoes processing through a fully connected layer, where the Softmax activation function generates probabilistic outputs for multi-class emotion classification (e.g., happiness, sadness, anger).
​​Step 6: Model Training​​. The model undergoes supervised training using backpropagation with gradient-based optimization. The training process employs standard techniques such as batch normalization and dropout to enhance learning stability and prevent overfitting.
5.3 Experimental process
The experimental environment for this paper is as follows:
Processor: Intel® Xeon® Gold 5218CR CPU
Memory: 128GB DDR4 RAM
Operating System: Windows 11 Pro
Software Stack: Python 3.11.7, PyTorch 2.6.0, CUDA 12.6 (cu126).
5.3.1 Model parameter tuning experiment
The model was trained on the DEAP dataset using the Adam optimizer, with hyperparameters including epochs, learning rate, weight decay, and dropout rate selected for tuning. Performance was evaluated based on accuracy, precision, and F1-score. The experiments were conducted in two stages: first, a broad parameter search was performed to identify promising ranges, followed by a refined search within narrowed ranges to determine the optimal parameter combination.
The first experiment represents the initial phase of model parameter optimization, aimed at identifying the parameter ranges that yield better performance.
Hyperparameters are pre-set configuration parameters manually defined before training deep learning models, whose choices significantly shape the model's convergence speed, generalization capability, and ultimate performance. Epochs determine how many times the model processes the entire training dataset, influencing the extent of pattern recognition. Learning rate controls the step size for parameter updates, serving as the core hyperparameter in gradient descent. Weight decay regulates penalty terms to suppress overfitting and enhance generalization, while dropout randomly drops out units during training to force the model to learn redundant representations, thus preventing overfitting.
Guided by theoretical insights, empirical experience, and considerations of model complexity and computational cost, the experiment initially defines broad ranges for four key parameters: epochs, learning rate, weight decay, and dropout rate. A relatively large step size is used for each parameter to efficiently explore the search space. The Adam optimizer is employed to ensure stable performance and accelerate the training process.
The parameter ranges established for the first experimental phase were configured as follows:
epochs: [50, 100, 150],
learning rate: [0.1, 0.01, 0.001],
weight decay: [0.0001, 0.001, 0.01],
dropout rate: [0.2, 0.3, 0.5]
The optimal parameters obtained in this experiment were: epochs = 50, learning rate = 0.001, weight decay = 0.01, dropout = 0.3, achieving an F1-score of 0.7082.
Based on the optimal parameters from the first experiment, the parameter ranges were reset by adjusting values within the same order of magnitude in the second experiment. The refined ranges were set as follows:
epochs: [20,30,40,50,60,70,80]
learning rate: [0.0006,0.0008,0.001,0.002,0.003,0.004]
weight decay: [0.006,0.008,0.01,0.015,0.02]
dropout rate: [0.2,0.25,0.3,0.35,0.4,0.45,0.5]
In the second experiment, this paper designed a detailed parameter optimization process. Different combinations of parameters were set for epochs, learning rate, weight decay, and dropout. Finally, after 1,470 model training processes, the evaluation metrics of 1,470 models were compared. Table 2 only lists the evaluation results of the model parameters of the top 6 models in terms of model performance, and arranges them in descending order of accuracy.
Table 2
The evaluation results of the model parameters of the top 6 models
epochs
learning_rate
weight_decay
dropout
accuracy (%)
precision (%)
f1 (%)
50
0.001
0.008
0.4
0.8125
0.71175
0.7507
70
0.0006
0.006
0.2
0.8047
0.74134
0.7502
60
0.0006
0.02
0.35
0.8047
0.70310
0.7492
80
0.0008
0.006
0.4
0.7969
0.69324
0.7382
60
0.0008
0.006
0.25
0.7891
0.71955
0.7381
80
0.0006
0.006
0.25
0.7813
0.68546
0.7300
The experimental results in Table 2 indicate that among the parameters involved in the model, the value of Epochs is 50, the value of Learning Rate is 0.001, the value of Weight Decay is 0.008, and the value of Dropout is 0.4. The model's accuracy is 0.8125, and the F1 score is 0.7507.
5.3.2 Analysis of the impact of hyperparameters on model
Based on parameter combinations yielding peak model performance, this article systematically analyzes how individual hyperparameters impact model behavior.
When analyzing the impact of epoch on model performance, with fixed hyperparameters learning rate = 0.001, weight decay = 0.008, and dropout = 0.4, the paper systematically analyzed variations in accuracy, precision, and F1-score across epoch values [20, 30, 40, 50, 60, 70, 80]. The experimental results are visualized in Fig. 5.
Fig. 5
Impact of epoch on model performance
Click here to Correct
Experimental results demonstrate that the model achieves optimal performance at 50 epochs.​​ Accuracy remains at a consistently high level (0.7802–0.8125), peaking at epoch 50 with subsequent minor fluctuations indicating stable overall classification capability on test data. The precision decreased at epoch 60, although there was a slight rebound; overall it was in a downward trend, which reflects the decreasing reliability of positive instance recognition. The F1-score fluctuates between 0.62 and 0.75, peaking at epoch 50. Its strong correlation with precision trends suggests evolving trade-off calibration between precision and recall throughout training.
When analyzing the impact of learning rate on model performance, with fixed hyperparameters epoch = 50, weight decay = 0.008, and dropout = 0.4, the paper quantitatively assessed variations in accuracy, precision, and F1-score across learning rates [0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004]. The corresponding experimental outcomes are illustrated in Fig. 6.
Fig. 6
Impact of learning rate on model performance
Click here to Correct
Experimental results indicate that a learning rate of approximately 0.001 yields optimal model performance in classification accuracy, positive-class reliability, and precision-recall balance on the test set.​​ At learning rate = 0.001, classification accuracy peaks at 0.8125, ​​demonstrating superior discriminative capability​​ on test data. The concurrently maximized F1-score of 0.7507 ​​signifies effective calibration​​ between precision and recall. Precision trends corroborate these observations, establishing 0.001 as the ​​parametrically optimal configuration​​. Conversely, ​​learning rates exceeding 0.001 precipitate progressive performance degradation​​, classification accuracy declines by 2.8–5.4%, ​​indicating compromised generalization​​, precision decreases monotonically (Δ ≈ 12%), ​​reflecting elevated false positive rates​​, F1-score deterioration ​​manifests loss of precision-recall equilibrium​​, confirming overall efficacy reduction.
When analyzing the impact of weight decay on model performance,with fixed hyperparameters epoch = 50, learning rate = 0.001, and dropout = 0.4, the paper systematically evaluated variations in accuracy, precision, and F1-score over weight decay values [0.006, 0.008, 0.01, 0.015, 0.02]. Experimental results are presented in Fig. 7.
Fig. 7
Impact of weight decay on model performance
Click here to Correct
The experimental results show that when Weight_Decay = 0.008, the model demonstrates better performance in terms of classification accuracy, reliability of positive predictions, and comprehensive balancing ability on the test set.
When weight decay equals 0.008, the test accuracy reaches a relative peak. This indicates that under this regularization intensity, the model has the optimal overall classification ability for test data. The F1 score is also at its highest level, meaning that the model shows the best performance in balancing precision and recall. The precision reaches its peak, indicating that the "reliability" of predicting positive examples is relatively high. That is, among the results predicted as positive examples by the model, the proportion of truly positive examples is high.
When the weight decay deviates from 0.008 (either greater than or less than 0.008), the accuracy decreases to varying degrees. This shows that when the regularization is too strong (weight decay > 0.008), the model fails to learn effective features. In other words, when the weight decay value is greater than 0.008, the regularization effect is excessive, preventing the model from learning features that are helpful for classification tasks from the data. When the regularization is too weak (weight decay < 0.008), the model may overfit during training, resulting in poor generalization performance. That is, when the weight decay value is less than 0.008, the regularization effect is insufficient, causing the model to overfit to the training data and perform poorly on new test data.
When analyzing the impact of drop out on model performance​,with fixed hyperparameters epoch = 50, learning rate = 0.001, weight decay = 0.008, the paper systematically evaluated model performance across dropout rates [0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5], with results presented in Fig. 8.
Fig. 8
Impact of dropout on model performance
Click here to Correct
Experimental results demonstrate that a dropout rate of 0.40 achieves optimal model performance in terms of classification accuracy, positive-class reliability, and precision-recall balance on the test set. When examining dropout rates within the 0.20–0.40 range, the test accuracy remains consistently high, indicating that this range provides appropriate regularization strength that effectively mitigates overfitting while preserving the model's learning capacity. The F1-scores remain relatively stable in this range, showing good balance between precision and recall. Although precision shows some fluctuation, it generally aligns with accuracy and F1-score trends without exhibiting extreme deviations.
At a dropout rate of 0.45, precision drops significantly to approximately 0.60, suggesting that higher dropout ratios lead to increased misclassification of positive instances, potentially due to excessive disruption of learned features. The F1-score decreases correspondingly, indicating an imbalance between precision and recall. While accuracy shows only a modest decline, this smaller reduction compared to precision can be attributed to factors such as the proportion of negative samples in the classification task.
When the dropout rate returns to 0.50, both precision and F1-score show partial recovery, demonstrating that while the model retains some capacity to adapt to extremely high dropout rates, the overall performance remains compromised. This suggests that although the model exhibits certain self-recovery capabilities under extreme regularization conditions, the excessive dropout still negatively impacts its final performance metrics.
5.3.3 Ablation experiment
The proposed GGMEN model employs a hybrid architecture combining GCN and GIN to integrate facial expressions and peripheral physiological signals for multimodal emotion recognition. To quantify the contributions of individual components, the paper conducts systematic ablation experiments by removing core modules while maintaining the optimal hyperparameters: ​​epochs = 50, learning rate = 0.001, weight decay = 0.008, and dropout = 0.4​​. The ablation experiment design is summarized in ​​Table 3.
Table 3
Ablation experiment design​​​​
Component Type​​
​​Experimental Group​​
​​Control Group​​
​​Data Modality​​
Unimodal: Facial Expressions
Full multimodal input
Unimodal:
Peripheral Physiological Signals
​​Architecture​​
GCN replaced with Standard CNN
Complete GGMEN (GCN + GIN)
The ablation experiments designed in this article include data modal ablation experiments and network module ablation experiments.
The data modality ablation experiment aims to evaluate the individual contributions of ​​facial expressions ​​(FE) and ​​peripheral physiological signals (PPS)​​, determining whether multimodal fusion enhances emotion recognition performance compared to unimodal inputs. The experiments assess whether retaining both modalities improves model efficiency and accuracy. The experimental process is as follows:
​​Step 1: Ablating facial expressions​​
​​Experimental Group​​: Input only ​​PPS data​​ (GGMEN architecture unchanged).
​​Control Group​​: Full multimodal input with GGMEN architecture.
​​Evaluation Metric​​: Compare ​​accuracy​​ between groups to quantify FE’s impact on performance.
​​Step 2: Ablating peripheral physiological signals​​
​​Experimental Group​​: Input only ​​ facial expressions data​​ (GGMEN architecture unchanged).
​​Control Group​​: Full multimodal input with GGMEN architecture.
​​Evaluation Metric​​: Compare ​​accuracy​​ between groups to assess PPS’s contribution.
The results of the data modality ablation experiments are presented in Table 4 below.
Table 4
Data Modality Ablation Experiment Results​​
Model Variant
Modality Input
Dataset
Accuracy (%)
​​GGMEN (Full)​​
Facial Expressions + Peripheral Physiological Signals
DEAP
​​81.25​​
​​GCN + GIN​​
Facial Expressions
DEAP
74.34
​​GCN + GIN​​
Peripheral Physiological Signals
DEAP
72.68
Experimental results demonstrate that the model achieves an accuracy of 81.25% when utilizing both facial expressions and peripheral physiological signals as multimodal inputs. In contrast, the accuracy drops significantly to 74.34% with facial expressions alone and 72.68% with peripheral physiological signals alone. These findings confirm that multimodal fusion provides richer feature representations, thereby effectively enhancing emotion recognition performance and validating the value of multimodal integration.​​
​​Comparative analysis of the unimodal configurations reveals that facial expressions (74.34%) yield higher accuracy than peripheral physiological signals (72.68%), indicating that facial data contains more discriminative information for this specific task under the current experimental setup. However, both unimodal approaches exhibit performance limitations due to inherent information deficiency, underscoring the superiority of multimodal fusion.​​
​​Notably, the same GCN + GIN architecture demonstrates varying performance levels depending on the input modalities, highlighting its adaptability to multimodal data while simultaneously emphasizing that modality selection and combination constitute critical factors influencing model effectiveness. Future work should further investigate modality complementarity and feature fusion strategies to optimize performance.​
The network module ablation experiment aims to validate the functional contributions of both GCN and GIN layers in feature extraction and multimodal fusion, while assessing their necessity as key components for performance enhancement. This experiment provides critical insights for architectural optimization. All tests were conducted using the model's optimal hyperparameters (epochs = 50, learning rate = 0.001, weight decay = 0.008, dropout = 0.4 ) with fused multimodal input (facial expressions + peripheral physiological signals). The experimental process is as follows:
​​​​​Step 1: GCN Layer Ablation​​
Experimental Group: Replace GCN layers with standard convolutional layers while retaining all other components (GIN layers, fusion mechanism, etc.)
Control Group: Original GGMEN architecture
Evaluation Metric: Accuracy comparison to quantify GCN's advantage in graph-structured data processing
​​Step 2: GIN Layer Ablation​​
Experimental Group: Remove GIN layers while maintaining GCN layers and other components
Control Group: Original GGMEN architecture
Evaluation Metric: Accuracy comparison to assess GIN's role in feature refinement and multimodal fusion
​​The quantitative results of network module ablation are presented in ​​Table 5​​.
Table 5
Network module ablation experiment results​​
Model Variant
Modality Input
Dataset
Accuracy (%)
​​GGMEN
Facial Expressions + Peripheral Physiological Signals
DEAP
​​81.25​​
​​CCN + GIN​​
Facial Expressions + Peripheral Physiological Signals
DEAP
79.63
​​GCN
Facial Expressions + Peripheral Physiological Signals
DEAP
77.68
Experimental results reveal significant performance disparities among the three models under identical multimodal conditions. The ​​GGMEN model achieves state-of-the-art accuracy of ​​81.25%​​ on the DEAP benchmark dataset, substantially outperforming the GCN baseline (77.68%). This superiority stems from GGMEN's ​​advanced fusion strategy​​ that synergistically integrates facial expressions and peripheral physiological signals, effectively capturing complementary cross-modal correlations while minimizing information redundancy and noise interference.
In contrast, the ​​CNN + GIN hybrid model​​ exhibits critical limitations. It’s naive fusion operations fail to model complex cross-modal interactions, while facial grid rasterization disrupts physiological constraints like eyebrow-eye muscle synergy. Standalone Graph Convolutional Network (GCN) models exhibit suboptimal performance due to inherent architectural constraints. Their inability to process raw visual data necessitates pre-extraction of facial landmarks, while insufficient temporal modeling capability yields coarse feature granularity—particularly detrimental when capturing subtle physiological patterns like orbicularis oculi microtremors or RR interval fluctuations. The GGMEN framework addresses these limitations through hierarchical feature extraction and biologically grounded amalgamation mechanisms, thereby establishing a novel paradigm for robust multimodal emotion recognition.
5.3.4 Model comparison
To validate model efficacy, the paper benchmarks the proposed architecture against contemporary state-of-the-art approaches, with comparative results summarized in Table 6.
Table 6
Model comparison
Model
Modals
DataSet
Model
Accuracy (%)
GGMEN
Facial expressions, Peripheral physiological signals
DEAP
GCN + GIN
81.25
Yassine Ouzar [28]
Facial expressions, Peripheral physiological signals
BP4D+
a 3D squeeze and exitation
based 3D Xception architecture
71.9
Yuan Wenzhen [29]
EEG signals, Facial expressions
DEAP
LMF
89.33
Li Jing [30]
EEG signals, Peripheral physiological signals, Facial expressions
DEAP
NMSNet
86.5
Performance comparisons reveal that GGMEN significantly outperforms other models processing identical input modalities (facial expressions + peripheral physiological signals), achieving 81.25% accuracy compared to Yassine Ouzar's 71.9%, a substantial 9.35 percentage point advantage. This demonstrates GGMEN's superior capacity for modeling bimodal affective data. While models by Yuan [20] (89.33%) and Li [21] (86.5%) attain higher accuracy, their performance stems primarily from incorporating electroencephalography (EEG) signals, which exhibit heightened sensitivity to emotional fluctuations. Crucially, EEG-dependent approaches necessitate invasive electrodes, whereas GGMEN leverages exclusively non-invasive modalities, offering superior deployment feasibility and user comfort.​​
​​The performance hierarchy establishes GGMEN's competitive advantage: Its GCN + GIN architecture enables effective structural representation of multimodal relationships through graph-based topology modeling. This approach detects subtle pattern variations via graph isomorphism principles, outperforming conventional convolutional operations and feature fusion mechanisms that exhibit limited relational reasoning capacity. These findings validate the architectural innovation underpinning GGMEN's state-of-the-art performance in non-invasive emotion recognition.​
6. Discussion and conclusion
This paper addresses the challenges of ​​multimodal information fusion​​, ​​inadequate modeling of cross-modal interactions​​, and high computational costs in multi-modal emotion perception. This paper proposes ​​GGMEN, a novel framework integrating the ​​local neighborhood aggregation capability of GCN​​ with the ​​global structural representation capability of GIN​​. For physiological signals, this study extracts 14 handcrafted features via ​​a joint time-frequency analysis​​, covering statistical properties across time, spectral, and entropy domains. For facial expressions, this study leverages Transformer to derive ​​spatial features per frame​​ and model ​​temporal dependencies across frames​​. Dedicated GCN branches process ​​spatial facial landmarks​​ and ​​temporal physiological patterns​​, while the GIN layer strengthens high-order cross-modal fusion. Experiments show GGMEN achieves ​​81.25% accuracy​​ and ​​0.7507 F1-score with optimal hyperparameters​​.
However, the proposed model currently relies exclusively on physiological signals and facial expressions​​, without incorporating core modalities such as voice or text, ​​resulting in inherent limitations for recognizing emotions in complex scenarios. Additionally, the demographic bias of the DEAP dataset​​ ​​compromises generalizability across cultures and age groups. To address these challenges, future work will​​ design a contrastive learning-based feature alignment module to bridge the semantic gap hindering fusion between voice and physiological signals, and employ conditional GANs to synthesize culturally diverse facial expressions and physiological response patterns, ​​thereby expanding training data diversity to ultimately enhance cross-demographic adaptability and model robustness​​.
Ethics Declaration
Not applicable.
Consent to Participate
Not applicable.
Consent to Publish
Not applicable.
A
Funding
This work was supported by the major fund project of the president of Xi'an Siyuan University in 2025 under Grant No. XASYB24ZHD02 and “Yulin Science and Technology Light” Talent Project under Grant No. 2024-KJZG-ZQNLJ-007.
A
Data Availability
The datasets analyzed for this study can be found in the https://www.eecs.qmul.ac.uk/mmv/datasets/deap/.
Conflicts of Interest
The authors declare that they have no competing interest.
A
Author Contribution
Conceptualization, Y.W. and E.A.S.; Methodology, Y.W.; Software, Y.W.; Validation, E.A.S.; Writing – Original Draft Preparation, Y.W.; Writing – Review & Editing, Y.W.; Supervision, E.A.S.; Project Administration, E.A.S..
References
1.
Kalateh S, Estrada-Jimenez LA, Nikghadam-Hojjati S, Barata J. A systematic review on multimodal emotion recognition: building blocks, current state, applications, and challenges. IEEE Access. 2024;12:103976–4019. https://doi.org/10.1109/ACCESS.2024.3430850.
2.
Zhu L, Zhu Z, Zhang C, Xu Y, Kong X. Multimodal sentiment analysis based on fusion methods: A survey. Inform Fusion. 2023;95:306–25. https://doi.org/10.1016/j.inffus.2023.02.028.
3.
Zhao T, Meng L, Song D. Multimodal aspect-based sentiment analysis: a survey of tasks, methods, challenges and future directions. Inform Fusion. 2024;112:102552. https://doi.org/10.1016/j.inffus.2024.102552.
4.
Yu J, Chen K, Xia R. Hierarchical interactive multimodal transformer for aspect-based multimodal sentiment analysis. IEEE Trans Affect Comput. 2022;14(3):1966–78. https://doi.org/10.1109/TAFFC.2022.3171091.
5.
Sun H, Chen YW, Lin L, Tensorformer. A tensor-based multimodal transformer for multimodal sentiment analysis and depression detection. IEEE Trans Affect Comput. 2022;14(4):2776–86. https://doi.org/10.1109/TAFFC.2022.3233070.
6.
Geetha AV, Mala T, Priyanka D, Uma E. Multimodal emotion recognition with deep learning: advancements, challenges, and future directions. Inform Fusion. 2024;105:102218. https://doi.org/10.1016/j.inffus.2023.102218.
7.
Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A. Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges and Future Directions. Inform Fusion. 2023;91:424–44. https://doi.org/10.1016/j.inffus.2022.09.025.
8.
Cai Y, Li X, Zhang Y, Li J, Zhu F, Rao L. Multimodal sentiment analysis based on multi-layer feature fusion and multi-task learning. Sci Rep. 2025;15(1):2126. https://doi.org/10.1038/s41598-025-85859-6.
9.
Chao H, Dong L. Emotion Recognition Using Three-Dimensional Feature and Convolutional Neural Network from Multichannel EEG Signals. IEEE Sens J. 2021;21(2):2024–34. https://doi.org/10.1109/JSEN.2020.3020828.
10.
Filali H, Riffi J, Boulealam C, Mahraz MA, Tairi H. Multimodal emotional classification based on meaningful learning. Big Data Cogn Comput. 2022;6(3):95. https://doi.org/10.3390/bdcc6030095.
11.
Ghorbanali A, Sohrabi MK. Capsule network-based deep ensemble transfer learning for multimodal sentiment analysis. Expert Syst Appl. 2024;239:122454. https://doi.org/10.1016/j.eswa.2023.122454.
12.
El-Deen DTN, El-Sayed RS, Hussein AM, Zaki MS. Multi-label Classification for Sentiment Analysis Using CBGA Hybrid Deep Learning Model. Eng Lett 2024;32(2).
13.
Bilotti U, Bisogni C, De Marsico M, Tramonte S. Multimodal Emotion Recognition via Convolutional Neural Networks: Comparison of different strategies on two multimodal datasets. Eng Appl Artif Intell. 2024;130:107708. https://doi.org/10.1016/j.engappai.2023.107708.
14.
Aslam A, Sargano AB, Habib Z. Attention-based Multimodal Sentiment Analysis and Emotion Recognition Using Deep Neural Networks. Appl Soft Comput. 2023;144:110494. https://doi.org/10.1016/j.asoc.2023.110494.
15.
Tellamekala MK, Amiriparian S, Schuller BW, André E, Giesbrecht T, Valstar M. COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition. IEEE Trans Pattern Anal Mach Intell. 2024;46(2):805–22. https://doi.org/10.1109/TPAMI.2023.3325770.
16.
Vaiani L, Quatra ML, Cagliero L, Garza P, ACM International Conference on Multimedia. ViPER: Video-based Perceiver for Emotion Recognition. ; 2022; USA. New York, NY; pp. 67–73. https://doi.org/10.1145/3551876.3554806
17.
Goncalves L, Busso C. Robust Audiovisual Emotion Recognition: Aligning Modalities, Capturing Temporal Information, and Handling Missing Features. IEEE Trans Affect Comput. 2022;13(4):2156–70. https://doi.org/10.1109/TAFFC.2022.3216993.
18.
Khemani B, Patil S, Kotecha K, Tanwar S. A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions. J Big Data. 2024;11(1):18. https://doi.org/10.1186/s40537-023-00876-4.
19.
Wierciński T, Rock M, Zwierzycki R, Zawadzka T, Zawadzki M. Emotion recognition from physiological channels using graph neural network. Sensors. 2022;22(8):2980. https://doi.org/10.3390/s22082980.
20.
Lu N, Han Z, Han M, Qian J. Bi-stream graph learning based multimodal fusion for emotion recognition in conversation. Inform Fusion. 2024;106:102272. https://doi.org/10.1016/j.inffus.2024.102272.
21.
Li J, Wang X, Lv G, Zeng J. GraphMFT: A graph network based multimodal fusion technique for emotion recognition in conversation. Neurocomputing. 2023;550:126427. https://doi.org/10.1016/j.neucom.2023.126427.
22.
Wang H, Ren C, Yu Z. Multimodal sentiment analysis based on cross-instance graph neural networks. Appl Intell. 2024;54(4):3403–16. https://doi.org/10.1007/s10489-024-05309-0.
23.
Al-Saadawi HFT, Das R, TER-CA-WGNN. Trimodel Emotion Recognition Using Cumulative Attribute-Weighted Graph Neural Network. Appl Sci. 2024;14(6):2252. https://doi.org/10.3390/app14062252.
24.
Phan HT, Nguyen NT, Hwang D. Convolutional Attention Neural Network over Graph Structures for Improving the Performance of Aspect-Level Sentiment Analysis. Inf Sci. 2022;589:416–39. https://doi.org/10.1016/j.ins.2021.12.127.
25.
Lin K, Chen R, Chen J, Lu P, Yang F. Multi-View Block Matrix-Based Graph Convolutional Network. Eng Lett 2024;32(6).
26.
Koelstra S, Muhl C, Soleymani M, Lee J-S, Yazdani A, Ebrahimi T. Deap: A database for emotion analysis; using physiological signals. IEEE Trans Affect Comput. 2011;3(1):18–31. https://doi.org/10.1109/T-AFFC.2011.15.
27.
Russell JA. A circumplex model of affect. J Personal Soc Psychol. 1980;39(6):1161–78. https://doi.org/10.1037/h0077714.
28.
Ouzar Y, Bousefsaf F, Djeldjli D, Maaoui C. Video-based multimodal spontaneous emotion recognition using facial expressions and physiological signals. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022; 2460–2469.
29.
Yuan W, Research on Emotion Recognition Based on Facial Expressions and EEG [dissertation]. University of Electronic Science and Technology of China; 2024. 10.27005/d.cnki.gdzku.2024.004238
30.
Li J. Research on Multimodal Emotion Recognition Based on EEG, Peripheral Physiological Signals, and Facial Expressions [dissertation]. Nanjing University of Posts and Telecommunications; 2023. 10.27251/d.cnki.gnjdc.2023.000751
Total words in MS: 7875
Total words in Title: 12
Total words in Abstract: 184
Total Keyword count: 4
Total Images in MS: 8
Total Tables in MS: 6
Total Reference count: 30