5. Experiment design
5.1 Dataset introduction
This study employs the DEAP dataset (Koelstra et al. [19]) for model validation. Compiled by researchers at Queen Mary University of London, DEAP is a publicly available dataset containing multimodal recordings of human emotional responses. It includes electroencephalogram (EEG), electrocardiogram (ECG), electrodermal activity (EDA), electromyogram (EMG), and facial expression videos.The simultaneous capture of facial expressions and physiological signals enables robust analysis of emotion-expression correlations [26]. Given DEAP's multimodal nature, it serves as an ideal benchmark for emotion perception models. For real-time emotion inference in practical scenarios, the paper prioritizes readily acquirable modalities. While EEG requires specialized equipment, peripheral physiological signals (ECG, EDA, EMG) and facial expressions can be captured via consumer-grade wearables. Consequently, this research focuses on these two pragmatic modalities.
The DEAP dataset operationalizes emotions according to Russell's Circumplex Model [27], a dimensional theory formalizing affective states through two orthogonal axes:
Valence: Quantifies the hedonic value of emotions (positive vs. negative).
Arousal: Measures the physiological intensity of emotions (calm vs. excited).
Complementing these core dimensions, DEAP incorporates two supplementary metrics [19]:
Dominance: Assesses the perceived control level over emotional stimuli.
Liking: Gauges subjective preference toward stimulus content.
This framework enables continuous 9-point scale ratings (1–9) of emotional experiences, providing fine-grained continuous descriptors of affective states. Such dimensional quantification offers richer nuance than categorical models (e.g., joy, anger).
5.2 Model implementation
In this paper, the algorithm design and model implementation were conducted according to the designed GGMEN model.
Step 1: Data preprocessing. The paper adopted different shallow feature extraction methods for peripheral physiological features data, and employed the Transformer model to extract facial expression features data from facial expression videos.
The paper applied threshold rules to map the dimensional values (Valence, Arousal, Dominance, Liking) from the DEAP dataset into discrete emotion categories. Using the discrete 9-point scale ratings of emotional experiences, the emotions in the dataset were classified into nine categories: joy, excited, anger, pressure, sad, fear, surprise, calm, and boring. The specific threshold criteria for this mapping are outlined in Table 1.
Table 1
Emotion | Valence | Arousal | Dominance | Liking |
|---|
Joy | High (6–9) | Moderate-High (5–9) | Moderate-High (5–9) | High (6–9) |
Excited | High (6–9) | High (7–9) | High (6–9) | High (6–9) |
Anger | Low (1–4) | High (7–9) | Low (1–4) | Low (1–4) |
Pressure | Low (1–4) | High (7–9) | Moderate (4–6) | Low (1–4) |
Sad | Low (1–4) | Low (1–4) | Low (1–4) | Low (1–4) |
Fear | Low (1–4) | High (7–9) | Very Low (1–3) | Low (1–4) |
Surprise | Moderate (4–6) | Extremely High (8–9) | Moderate (4–6) | Moderate (4–6) |
Calm | High (6–9) | Low (1–4) | High (6–9) | High (6–9) |
Boring | Low (1–4) | Low (1–4) | Moderate-High (5–9) | Very Low (1–3) |
The preprocessing pipeline generates two distinct feature matrices:
Physiological signal features: 1280×15 matrix (1280 samples × 14 features + 1 emotion label column).
Facial expression features: 874×515 matrix (874 samples × 514 features + 1 emotion label column).
Step 2: Graph Construction. The peripheral physiological data graph and facial expression data graph were constructed separately through a systematic process. First, node features were initialized to represent the fundamental units of each graph. For the peripheral physiological data, edges were then established based on biological signal correlations, while for the facial expression data, edges were defined according to predefined anatomical relationships between facial key points, such as those derived from the facial action coding system (FACS). Finally, both the node features and edge connections were converted into PyTorch tensors to ensure compatibility with subsequent deep learning processing. This structured approach effectively transforms multimodal data into graph representations suitable for neural network analysis.
Step 3: Construct the GCN, GIN, and attention mechanism models, all of which use ReLU as the activation function.
Step 4: Model Fusion. During fusion, the input dimensions of both physiological and facial features are first calculated. The physiological data is then processed through two GCN layers while the facial features are processed through one GIN layer, followed by weighted modality fusion using an attention mechanism.
Step 5: Classification Layer. The fused feature vector undergoes processing through a fully connected layer, where the Softmax activation function generates probabilistic outputs for multi-class emotion classification (e.g., happiness, sadness, anger).
Step 6: Model Training. The model undergoes supervised training using backpropagation with gradient-based optimization. The training process employs standard techniques such as batch normalization and dropout to enhance learning stability and prevent overfitting.
5.3 Experimental process
The experimental environment for this paper is as follows:
Processor: Intel® Xeon® Gold 5218CR CPU
Memory: 128GB DDR4 RAM
Operating System: Windows 11 Pro
Software Stack: Python 3.11.7, PyTorch 2.6.0, CUDA 12.6 (cu126).
5.3.1 Model parameter tuning experiment
The model was trained on the DEAP dataset using the Adam optimizer, with hyperparameters including epochs, learning rate, weight decay, and dropout rate selected for tuning. Performance was evaluated based on accuracy, precision, and F1-score. The experiments were conducted in two stages: first, a broad parameter search was performed to identify promising ranges, followed by a refined search within narrowed ranges to determine the optimal parameter combination.
The first experiment represents the initial phase of model parameter optimization, aimed at identifying the parameter ranges that yield better performance.
Hyperparameters are pre-set configuration parameters manually defined before training deep learning models, whose choices significantly shape the model's convergence speed, generalization capability, and ultimate performance. Epochs determine how many times the model processes the entire training dataset, influencing the extent of pattern recognition. Learning rate controls the step size for parameter updates, serving as the core hyperparameter in gradient descent. Weight decay regulates penalty terms to suppress overfitting and enhance generalization, while dropout randomly drops out units during training to force the model to learn redundant representations, thus preventing overfitting.
Guided by theoretical insights, empirical experience, and considerations of model complexity and computational cost, the experiment initially defines broad ranges for four key parameters: epochs, learning rate, weight decay, and dropout rate. A relatively large step size is used for each parameter to efficiently explore the search space. The Adam optimizer is employed to ensure stable performance and accelerate the training process.
The parameter ranges established for the first experimental phase were configured as follows:
epochs: [50, 100, 150],
learning rate: [0.1, 0.01, 0.001],
weight decay: [0.0001, 0.001, 0.01],
dropout rate: [0.2, 0.3, 0.5]
The optimal parameters obtained in this experiment were: epochs = 50, learning rate = 0.001, weight decay = 0.01, dropout = 0.3, achieving an F1-score of 0.7082.
Based on the optimal parameters from the first experiment, the parameter ranges were reset by adjusting values within the same order of magnitude in the second experiment. The refined ranges were set as follows:
epochs: [20,30,40,50,60,70,80]
learning rate: [0.0006,0.0008,0.001,0.002,0.003,0.004]
weight decay: [0.006,0.008,0.01,0.015,0.02]
dropout rate: [0.2,0.25,0.3,0.35,0.4,0.45,0.5]
In the second experiment, this paper designed a detailed parameter optimization process. Different combinations of parameters were set for epochs, learning rate, weight decay, and dropout. Finally, after 1,470 model training processes, the evaluation metrics of 1,470 models were compared. Table 2 only lists the evaluation results of the model parameters of the top 6 models in terms of model performance, and arranges them in descending order of accuracy.
Table 2
The evaluation results of the model parameters of the top 6 models
epochs | learning_rate | weight_decay | dropout | accuracy (%) | precision (%) | f1 (%) |
|---|
50 | 0.001 | 0.008 | 0.4 | 0.8125 | 0.71175 | 0.7507 |
70 | 0.0006 | 0.006 | 0.2 | 0.8047 | 0.74134 | 0.7502 |
60 | 0.0006 | 0.02 | 0.35 | 0.8047 | 0.70310 | 0.7492 |
80 | 0.0008 | 0.006 | 0.4 | 0.7969 | 0.69324 | 0.7382 |
60 | 0.0008 | 0.006 | 0.25 | 0.7891 | 0.71955 | 0.7381 |
80 | 0.0006 | 0.006 | 0.25 | 0.7813 | 0.68546 | 0.7300 |
The experimental results in Table 2 indicate that among the parameters involved in the model, the value of Epochs is 50, the value of Learning Rate is 0.001, the value of Weight Decay is 0.008, and the value of Dropout is 0.4. The model's accuracy is 0.8125, and the F1 score is 0.7507.
5.3.2 Analysis of the impact of hyperparameters on model
Based on parameter combinations yielding peak model performance, this article systematically analyzes how individual hyperparameters impact model behavior.
When analyzing the impact of epoch on model performance, with fixed hyperparameters learning rate = 0.001, weight decay = 0.008, and dropout = 0.4, the paper systematically analyzed variations in accuracy, precision, and F1-score across epoch values [20, 30, 40, 50, 60, 70, 80]. The experimental results are visualized in Fig. 5.
Experimental results demonstrate that the model achieves optimal performance at 50 epochs. Accuracy remains at a consistently high level (0.7802–0.8125), peaking at epoch 50 with subsequent minor fluctuations indicating stable overall classification capability on test data. The precision decreased at epoch 60, although there was a slight rebound; overall it was in a downward trend, which reflects the decreasing reliability of positive instance recognition. The F1-score fluctuates between 0.62 and 0.75, peaking at epoch 50. Its strong correlation with precision trends suggests evolving trade-off calibration between precision and recall throughout training.
When analyzing the impact of learning rate on model performance, with fixed hyperparameters epoch = 50, weight decay = 0.008, and dropout = 0.4, the paper quantitatively assessed variations in accuracy, precision, and F1-score across learning rates [0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004]. The corresponding experimental outcomes are illustrated in Fig. 6.
Experimental results indicate that a learning rate of approximately 0.001 yields optimal model performance in classification accuracy, positive-class reliability, and precision-recall balance on the test set. At learning rate = 0.001, classification accuracy peaks at 0.8125, demonstrating superior discriminative capability on test data. The concurrently maximized F1-score of 0.7507 signifies effective calibration between precision and recall. Precision trends corroborate these observations, establishing 0.001 as the parametrically optimal configuration. Conversely, learning rates exceeding 0.001 precipitate progressive performance degradation, classification accuracy declines by 2.8–5.4%, indicating compromised generalization, precision decreases monotonically (Δ ≈ 12%), reflecting elevated false positive rates, F1-score deterioration manifests loss of precision-recall equilibrium, confirming overall efficacy reduction.
When analyzing the impact of weight decay on model performance,with fixed hyperparameters epoch = 50, learning rate = 0.001, and dropout = 0.4, the paper systematically evaluated variations in accuracy, precision, and F1-score over weight decay values [0.006, 0.008, 0.01, 0.015, 0.02]. Experimental results are presented in Fig. 7.
The experimental results show that when Weight_Decay = 0.008, the model demonstrates better performance in terms of classification accuracy, reliability of positive predictions, and comprehensive balancing ability on the test set.
When weight decay equals 0.008, the test accuracy reaches a relative peak. This indicates that under this regularization intensity, the model has the optimal overall classification ability for test data. The F1 score is also at its highest level, meaning that the model shows the best performance in balancing precision and recall. The precision reaches its peak, indicating that the "reliability" of predicting positive examples is relatively high. That is, among the results predicted as positive examples by the model, the proportion of truly positive examples is high.
When the weight decay deviates from 0.008 (either greater than or less than 0.008), the accuracy decreases to varying degrees. This shows that when the regularization is too strong (weight decay > 0.008), the model fails to learn effective features. In other words, when the weight decay value is greater than 0.008, the regularization effect is excessive, preventing the model from learning features that are helpful for classification tasks from the data. When the regularization is too weak (weight decay < 0.008), the model may overfit during training, resulting in poor generalization performance. That is, when the weight decay value is less than 0.008, the regularization effect is insufficient, causing the model to overfit to the training data and perform poorly on new test data.
When analyzing the impact of drop out on model performance,with fixed hyperparameters epoch = 50, learning rate = 0.001, weight decay = 0.008, the paper systematically evaluated model performance across dropout rates [0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5], with results presented in Fig. 8.
Experimental results demonstrate that a dropout rate of 0.40 achieves optimal model performance in terms of classification accuracy, positive-class reliability, and precision-recall balance on the test set. When examining dropout rates within the 0.20–0.40 range, the test accuracy remains consistently high, indicating that this range provides appropriate regularization strength that effectively mitigates overfitting while preserving the model's learning capacity. The F1-scores remain relatively stable in this range, showing good balance between precision and recall. Although precision shows some fluctuation, it generally aligns with accuracy and F1-score trends without exhibiting extreme deviations.
At a dropout rate of 0.45, precision drops significantly to approximately 0.60, suggesting that higher dropout ratios lead to increased misclassification of positive instances, potentially due to excessive disruption of learned features. The F1-score decreases correspondingly, indicating an imbalance between precision and recall. While accuracy shows only a modest decline, this smaller reduction compared to precision can be attributed to factors such as the proportion of negative samples in the classification task.
When the dropout rate returns to 0.50, both precision and F1-score show partial recovery, demonstrating that while the model retains some capacity to adapt to extremely high dropout rates, the overall performance remains compromised. This suggests that although the model exhibits certain self-recovery capabilities under extreme regularization conditions, the excessive dropout still negatively impacts its final performance metrics.
5.3.3 Ablation experiment
The proposed GGMEN model employs a hybrid architecture combining GCN and GIN to integrate facial expressions and peripheral physiological signals for multimodal emotion recognition. To quantify the contributions of individual components, the paper conducts systematic ablation experiments by removing core modules while maintaining the optimal hyperparameters: epochs = 50, learning rate = 0.001, weight decay = 0.008, and dropout = 0.4. The ablation experiment design is summarized in Table 3.
Table 3
Ablation experiment design
Component Type | Experimental Group | Control Group |
|---|
Data Modality | Unimodal: Facial Expressions | Full multimodal input |
Unimodal: Peripheral Physiological Signals |
Architecture | GCN replaced with Standard CNN | Complete GGMEN (GCN + GIN) |
The ablation experiments designed in this article include data modal ablation experiments and network module ablation experiments.
The data modality ablation experiment aims to evaluate the individual contributions of facial expressions (FE) and peripheral physiological signals (PPS), determining whether multimodal fusion enhances emotion recognition performance compared to unimodal inputs. The experiments assess whether retaining both modalities improves model efficiency and accuracy. The experimental process is as follows:
Step 1: Ablating facial expressions
Experimental Group: Input only PPS data (GGMEN architecture unchanged).
Control Group: Full multimodal input with GGMEN architecture.
Evaluation Metric: Compare accuracy between groups to quantify FE’s impact on performance.
Step 2: Ablating peripheral physiological signals
Experimental Group: Input only facial expressions data (GGMEN architecture unchanged).
Control Group: Full multimodal input with GGMEN architecture.
Evaluation Metric: Compare accuracy between groups to assess PPS’s contribution.
The results of the data modality ablation experiments are presented in Table 4 below.
Table 4
Data Modality Ablation Experiment Results
Model Variant | Modality Input | Dataset | Accuracy (%) |
|---|
GGMEN (Full) | Facial Expressions + Peripheral Physiological Signals | DEAP | 81.25 |
GCN + GIN | Facial Expressions | DEAP | 74.34 |
GCN + GIN | Peripheral Physiological Signals | DEAP | 72.68 |
Experimental results demonstrate that the model achieves an accuracy of 81.25% when utilizing both facial expressions and peripheral physiological signals as multimodal inputs. In contrast, the accuracy drops significantly to 74.34% with facial expressions alone and 72.68% with peripheral physiological signals alone. These findings confirm that multimodal fusion provides richer feature representations, thereby effectively enhancing emotion recognition performance and validating the value of multimodal integration.
Comparative analysis of the unimodal configurations reveals that facial expressions (74.34%) yield higher accuracy than peripheral physiological signals (72.68%), indicating that facial data contains more discriminative information for this specific task under the current experimental setup. However, both unimodal approaches exhibit performance limitations due to inherent information deficiency, underscoring the superiority of multimodal fusion.
Notably, the same GCN + GIN architecture demonstrates varying performance levels depending on the input modalities, highlighting its adaptability to multimodal data while simultaneously emphasizing that modality selection and combination constitute critical factors influencing model effectiveness. Future work should further investigate modality complementarity and feature fusion strategies to optimize performance.
The network module ablation experiment aims to validate the functional contributions of both GCN and GIN layers in feature extraction and multimodal fusion, while assessing their necessity as key components for performance enhancement. This experiment provides critical insights for architectural optimization. All tests were conducted using the model's optimal hyperparameters (epochs = 50, learning rate = 0.001, weight decay = 0.008, dropout = 0.4 ) with fused multimodal input (facial expressions + peripheral physiological signals). The experimental process is as follows:
Step 1: GCN Layer Ablation
Experimental Group: Replace GCN layers with standard convolutional layers while retaining all other components (GIN layers, fusion mechanism, etc.)
Control Group: Original GGMEN architecture
Evaluation Metric: Accuracy comparison to quantify GCN's advantage in graph-structured data processing
Step 2: GIN Layer Ablation
Experimental Group: Remove GIN layers while maintaining GCN layers and other components
Control Group: Original GGMEN architecture
Evaluation Metric: Accuracy comparison to assess GIN's role in feature refinement and multimodal fusion
The quantitative results of network module ablation are presented in Table 5.
Table 5
Network module ablation experiment results
Model Variant | Modality Input | Dataset | Accuracy (%) |
|---|
GGMEN | Facial Expressions + Peripheral Physiological Signals | DEAP | 81.25 |
CCN + GIN | Facial Expressions + Peripheral Physiological Signals | DEAP | 79.63 |
GCN | Facial Expressions + Peripheral Physiological Signals | DEAP | 77.68 |
Experimental results reveal significant performance disparities among the three models under identical multimodal conditions. The GGMEN model achieves state-of-the-art accuracy of 81.25% on the DEAP benchmark dataset, substantially outperforming the GCN baseline (77.68%). This superiority stems from GGMEN's advanced fusion strategy that synergistically integrates facial expressions and peripheral physiological signals, effectively capturing complementary cross-modal correlations while minimizing information redundancy and noise interference.
In contrast, the CNN + GIN hybrid model exhibits critical limitations. It’s naive fusion operations fail to model complex cross-modal interactions, while facial grid rasterization disrupts physiological constraints like eyebrow-eye muscle synergy. Standalone Graph Convolutional Network (GCN) models exhibit suboptimal performance due to inherent architectural constraints. Their inability to process raw visual data necessitates pre-extraction of facial landmarks, while insufficient temporal modeling capability yields coarse feature granularity—particularly detrimental when capturing subtle physiological patterns like orbicularis oculi microtremors or RR interval fluctuations. The GGMEN framework addresses these limitations through hierarchical feature extraction and biologically grounded amalgamation mechanisms, thereby establishing a novel paradigm for robust multimodal emotion recognition.
5.3.4 Model comparison
To validate model efficacy, the paper benchmarks the proposed architecture against contemporary state-of-the-art approaches, with comparative results summarized in Table 6.
Table 6
Model | Modals | DataSet | Model | Accuracy (%) |
|---|
GGMEN | Facial expressions, Peripheral physiological signals | DEAP | GCN + GIN | 81.25 |
Yassine Ouzar [28] | Facial expressions, Peripheral physiological signals | BP4D+ | a 3D squeeze and exitation based 3D Xception architecture | 71.9 |
Yuan Wenzhen [29] | EEG signals, Facial expressions | DEAP | LMF | 89.33 |
Li Jing [30] | EEG signals, Peripheral physiological signals, Facial expressions | DEAP | NMSNet | 86.5 |
Performance comparisons reveal that GGMEN significantly outperforms other models processing identical input modalities (facial expressions + peripheral physiological signals), achieving 81.25% accuracy compared to Yassine Ouzar's 71.9%, a substantial 9.35 percentage point advantage. This demonstrates GGMEN's superior capacity for modeling bimodal affective data. While models by Yuan [20] (89.33%) and Li [21] (86.5%) attain higher accuracy, their performance stems primarily from incorporating electroencephalography (EEG) signals, which exhibit heightened sensitivity to emotional fluctuations. Crucially, EEG-dependent approaches necessitate invasive electrodes, whereas GGMEN leverages exclusively non-invasive modalities, offering superior deployment feasibility and user comfort.
The performance hierarchy establishes GGMEN's competitive advantage: Its GCN + GIN architecture enables effective structural representation of multimodal relationships through graph-based topology modeling. This approach detects subtle pattern variations via graph isomorphism principles, outperforming conventional convolutional operations and feature fusion mechanisms that exhibit limited relational reasoning capacity. These findings validate the architectural innovation underpinning GGMEN's state-of-the-art performance in non-invasive emotion recognition.