Introduction
Figure skating, as a highly challenging competitive sport that integrates technical and artistic elements, requires extremely high professional standards for action evaluation. Traditional manual judging suffers from strong subjectivity and poor consistency, while existing computer vision methods face enormous challenges when processing figure skating's high-speed rotations, complex jumps, and refined artistic movements. These actions exhibit high spatiotemporal correlations, multi-dimensional posture changes, and strict technical specification requirements. Although existing deep learning methods have achieved progress in general action recognition, they lack specialized algorithmic designs for the unique motion scenarios of figure skating.
This research innovatively proposes a 3D convolutional neural network model tailored for figure skating, achieving high-precision automatic recognition of complex figure skating actions through systematic research across four levels: constructing specialized network architectures, designing action capture algorithms, establishing pose recognition mechanisms, and perfecting performance validation systems. This research not only advances sports action recognition technology development but also establishes an important technical foundation for intelligent sports training systems and automated competition judging.
1 3D CNN Architecture for Figure Skating
1.1 Specialized 3D CNN Network Design for Figure Skating
Targeting the special characteristics of figure skating, this research designs a spatiotemporal convolutional kernel architecture adapted for high-speed rotational actions. Traditional 3D CNNs suffer from temporal receptive field mismatch and spatial feature loss when processing high-speed figure skating rotations[1]. To address this, we propose a multi-temporal-scale convolutional kernel combination strategy, using three different-sized spatiotemporal convolutional kernels (3×3×3, 5×5×5, and 7×7×7) in parallel to capture short-term posture changes, medium-term action transitions, and long-term action sequence features respectively. As shown in Fig. 1, this multi-temporal-scale convolutional kernel architecture effectively integrates motion features at different temporal scales through a hierarchical fusion mechanism of feature pyramids.
In multi-resolution feature pyramid construction, we adopt a bottom-up feature fusion mechanism that performs weighted aggregation of spatiotemporal features from different levels, effectively improving recognition capability for actions at different scales. Considering interference from ice surface reflections and motion blur caused by high-speed movement on visual recognition, the network integrates deblurring preprocessing modules and illumination normalization layers. This module dynamically adjusts input image contrast and saturation by learning the optical characteristics of ice surface reflections, while using temporal information to compensate for detail loss caused by motion blur[2]. Experiments show that compared to standard 3D ResNet, this specialized network improves accuracy in figure skating action recognition tasks by 8.7%, with only a 12% increase in computational complexity.
1.2 Innovative Design of Spatiotemporal Attention Mechanism
To achieve precise localization of key moments and key body parts in figure skating actions, this research proposes a spatiotemporal attention mechanism based on action key frames. In the temporal dimension, we design a dynamic key frame detection algorithm that automatically identifies critical turning points in action sequences by analyzing posture change amplitude and motion velocity gradients between consecutive frames. This algorithm uses a sliding window mechanism to calculate temporal attention weights, assigning higher weight coefficients to critical moments such as takeoff instants, peak air positions, and landing moments[3].
In the spatial dimension, we establish an attention localization mechanism for key body parts, focusing on regions with high contribution to action recognition such as head, torso, and limb extremities. By introducing anatomical prior knowledge, we construct a human kinematic constraint model to ensure attention weight distributions conform to biomechanical principles[4]. The core innovation of this mechanism lies in the spatiotemporal decoupling and recoupling strategy, which first calculates temporal and spatial attention separately, then performs adaptive recombination through a gating fusion network. This strategy effectively avoids interference between spatiotemporal attention, improving the stability and accuracy of the attention mechanism. Table 1 details the key parameter configurations of each spatiotemporal attention mechanism module, providing precise technical specifications for model reproduction.
Table 1
Spatiotemporal Attention Mechanism Parameter Configuration
|
Parameter
|
Temporal Attention
|
Spatial Attention
|
Fusion Module
|
|
Window Size
|
16 frames
|
7×7 pixels
|
-
|
|
Learning Rate
|
0.001
|
0.002
|
0.0005
|
|
Dropout Rate
|
0.3
|
0.2
|
0.4
|
|
Hidden Dimensions
|
512
|
256
|
128
|
|
Activation Function
|
ReLU
|
Sigmoid
|
Tanh
|
|
Weight Initialization
|
Xavier
|
He Normal
|
Xavier
|
Ablation experiments show that after introducing this attention mechanism, the model's recognition accuracy for complex rotational actions improved by 6.2%, and key point localization error for jumping actions decreased by 23%.
1.3 Network Training and Optimization Strategies
Addressing the inconsistent sequence lengths and gradient vanishing problems in figure skating action sequences, this research proposes specialized training optimization strategies. For gradient stabilization, we adopt improved gradient clipping techniques that dynamically adjust clipping thresholds to adapt to different sequence input lengths. This method adaptively sets upper limits for gradient norms based on sequence length and current training stage, effectively preventing gradient explosion in long sequence training. The multi-scale loss function design combines classification loss, temporal consistency loss, and action boundary detection loss with weight ratios of 0.6, 0.3, and 0.1 respectively. Figure 2 shows the dynamic scheduling strategy for the three loss function weights during training, where classification loss weight gradually decreases in later training stages while action boundary detection loss weight correspondingly increases, reflecting the training philosophy from coarse-grained to fine-grained recognition.
Temporal consistency loss ensures action continuity by minimizing differences between adjacent frame feature vectors[5], while action boundary detection loss specifically optimizes action transition points. Data augmentation strategies are customized for figure skating characteristics, including random cropping on the temporal axis, frame rate transformation, and random rotation and mirror flipping in the spatial domain[6]. To avoid overfitting, we introduce an adaptive regularization mechanism that dynamically adjusts L2 regularization strength based on validation set performance. Additionally, we employ temperature-regulated knowledge distillation technology to transfer knowledge from pretrained general action recognition models to figure skating-specific models. Experimental results show that the optimized training strategy improves model convergence speed by 40% and increases final recognition accuracy by 11.3% compared to baseline methods.
1.4 Spatiotemporal Geometric Theoretical Modeling for Figure Skating Actions
Based on Riemannian geometric theory, this research proposes a spatiotemporal manifold modeling framework for figure skating actions. We treat figure skating action sequences as trajectories embedded in high-dimensional spatiotemporal manifolds, capturing the intrinsic structural features of actions by computing local geometric curvature. We establish rotation-invariant representations for rotational actions based on the SO(3) group and derive invariance theorems for figure skating actions under spatiotemporal transformations, proving the preservation of essential action features under rotation, translation, and scale transformations. This theoretical framework provides a mathematical foundation for designing adaptive spatiotemporal convolutional kernels, enabling the network to dynamically adjust receptive fields according to the geometric properties of actions. Experiments demonstrate that geometry manifold-based convolution operations achieve 7.3% higher accuracy in rotation invariance tests compared to traditional methods, establishing an important foundation for theoretical research in figure skating action recognition.
2 Figure Skating Action Capture Algorithm Research
2.1 Trajectory Capture Algorithm for Complex Rotational Actions
Rotational actions in figure skating are characterized by high angular velocity, long duration, and dramatic body posture changes
[7], making them difficult for traditional target tracking algorithms to handle effectively. This research proposes a 3D CNN-based athlete body contour extraction algorithm that achieves high-precision contour recognition by learning spatial distribution patterns of body parts during rotation. The algorithm first uses spatiotemporal feature extraction networks to capture body contour information in consecutive frames, then obtains precise body boundaries through morphological filtering and edge detection techniques
[8]. Addressing posture continuity maintenance during high-speed rotation, we design a trajectory smoothing algorithm based on kinematic constraints. This algorithm utilizes physical constraints of human joint motion, predicting joint positions in the next frame through Kalman filters and performing optimal estimation combined with current observations. The calculation formula for rotational angular velocity is:
where
is the angular velocity at time
(rad/s),
is the body rotation angle relative to the vertical axis (rad), and
is the time interval (s).
For occlusion and self-occlusion situations, we developed multi-viewpoint information fusion trajectory reconstruction technology that analyzes visible joint points from different viewpoints and uses 3D reconstruction algorithms to recover spatial coordinates of occluded parts.
A
Table 2
Rotational Action Capture Performance Comparison
|
Method
|
Accuracy (%)
|
Processing Speed (FPS)
|
Occlusion Robustness
|
|
Traditional Optical Flow
|
67.3
|
25
|
Poor
|
|
2D CNN Tracking
|
78.9
|
18
|
Fair
|
|
Our 3D CNN Method
|
92.4
|
22
|
Excellent
|
|
State-of-the-art
|
85.7
|
20
|
Good
|
2.2 Spatiotemporal Modeling Method for Jumping Actions
Jumping actions are an important component of figure skating technical scoring, requiring precise description of motion characteristics during takeoff, airborne rotation, and landing phases
[9]. This research establishes a physics-based kinematic spatiotemporal modeling framework for jumping actions, decomposing the jumping process into three consecutive sub-phases. The takeoff phase mainly analyzes acceleration changes and takeoff angles of athletes, identifying takeoff moments by detecting sudden velocity changes in the vertical direction. Airborne phase modeling focuses on coupled analysis of the center of gravity's parabolic trajectory and rotational motion. The center of gravity trajectory equation is:
where
is the vertical position of the center of gravity at time
(m),
is the center of gravity height at takeoff (m),
is the vertical velocity at takeoff (m/s),
is gravitational acceleration (9.8 m/s²), and
is time (s).
The automatic counting algorithm for airborne rotations is achieved by analyzing periodic changes in body posture angles. The algorithm monitors the torso rotation angle relative to the ice surface, recording one complete rotation when the cumulative angle change reaches 360°. Rotation counting precision is ensured through angle thresholds and time window constraints. Landing phase identification is based on vertical impact force detection and body posture stability analysis. By establishing multi-phase spatiotemporal feature fusion models, we achieve precise classification of different types of jumping actions (such as Axel jumps, toe loop jumps, etc.)[10]. This modeling method achieves 91.2% accuracy in jumping action recognition tasks with rotation counting error controlled within ± 0.1 rotations. Figure 3 clearly depicts the complete spatiotemporal trajectory modeling process of jumping actions from takeoff to landing, showing the coupling relationship between center of gravity parabolic trajectory and body rotational motion.
2.3 Precise Capture of Artistic Performance Actions
Artistic performance action evaluation standards focus more on action fluidity, coordination, and expressiveness, requiring higher levels of algorithm precision
[11]. This research develops specialized feature extraction algorithms for arm and leg extension actions based on human kinematic models, focusing on analyzing spatial positions and motion trajectories of limb extremities. By establishing temporal models of joint angles, we quantitatively evaluate action extension amplitude and maintenance duration. Quantitative modeling of facial expression and body coordination is a key technical challenge in artistic scoring
[12]. This research proposes a multimodal feature fusion coordination assessment method combining facial expression recognition and body posture analysis, evaluating coordination levels by computing correlation coefficients between expression intensity and action amplitude. The coordination index calculation formula is:
where
is the coordination index,
is the weight coefficient for the
feature,
is the correlation coefficient between the
expression feature and corresponding action feature, and
is the total number of features.
Music beat and action synchronization analysis is achieved through time-frequency domain signal processing techniques. The algorithm first extracts beat information from music, then analyzes periodic characteristics of athlete actions, calculating time deviations and phase synchronization between them. This technology is significant for scoring dance-type figure skating events, achieving 95.8% recognition accuracy for music beat and action rhythm matching.
2.4 Group Theory-Based Action Symmetry Loss Function
Targeting the symmetry characteristics of figure skating actions, we propose a novel loss function design based on group theory. Utilizing the Lie group SO(3) to describe the algebraic structure of 3D rotations, we establish a constrained optimization framework that preserves action essential invariance. This loss function consists of three components: classification loss, group invariance loss, and symmetry preservation loss, with weight ratios of 0.5:0.3:0.2. Group invariance loss ensures recognition result consistency by minimizing feature differences of actions under group actions, while symmetry preservation loss ensures mirror actions have the same recognition confidence. We theoretically prove the convexity and convergence of this loss function, deriving its convergence rate as O(1/√t) under Lipschitz continuous conditions. Experimental validation shows that compared to traditional cross-entropy loss, this method improves accuracy by 4.8% in symmetric action recognition tasks, demonstrating good theoretical guidance value and practical effectiveness.
3 Figure Skating Pose Recognition Algorithm Research
3.1 Hierarchical Recognition Mechanism for Technical Actions
This research constructs a three-layer progressive recognition architecture from basic postures to complete routines, effectively solving the complex hierarchical problems of figure skating technical actions. The bottom-layer basic posture recognition module identifies basic states such as gliding, stationary, and preparatory positions by analyzing spatial configurations of key skeletal nodes. This module uses graph convolutional networks to model human skeletal topological structure, achieving 97.3% recognition accuracy for basic postures. Mid-layer combination action recognition is based on temporal convolutional networks, mapping basic posture sequences to technical action labels such as spiral positions and layback spins
[13]. This layer introduces an action template matching mechanism that evaluates action quality by computing similarity between actual actions and standard templates. The action similarity calculation formula is:
where
is the similarity between action
and template
,
are feature vectors of action
and template
respectively, and
is the standard deviation parameter of the Gaussian kernel.
Top-layer complete routine recognition processes combination action sequences through long short-term memory networks, recognizing composite technical actions such as Axel jumps and layback spins[14]. This layer integrates automatic action difficulty coefficient evaluation algorithms that provide comprehensive scoring based on technical complexity, execution quality, and innovation of actions. Experiments show that this hierarchical recognition mechanism achieves 94.1% recognition accuracy on the international competition standard figure skating action database, with an average processing time of 0.23 seconds per action sequence.
3.2 Multimodal Feature Fusion Recognition Method
Multimodal feature fusion is a key technology for improving figure skating pose recognition accuracy. This research proposes a deep joint modeling framework for skeletal key points and visual textures. The skeletal feature extraction module uses an improved PoseNet network, outputting 3D coordinate information for 17 key joint points and describing geometric features of body postures by computing joint angles and limb length ratios. Visual texture features are extracted through a ResNet-50 backbone network, focusing on visual semantic information such as clothing colors, blade trajectories, and background environments[15]. Kinematics and dynamics feature fusion adopts attention-weighted strategies. Kinematics features include first and second-order derivative information such as velocity, acceleration, and angular velocity, while dynamics features involve physical quantities such as center of gravity changes, momentum conservation, and angular momentum changes[16]. Adaptive allocation of feature weights is achieved through gating networks that dynamically adjust contribution degrees of different modal features based on current action types and execution phases.
Temporal consistency constraints ensure recognition result smoothness by minimizing adjacent frame feature differences, effectively avoiding recognition jitter caused by illumination changes or pose estimation errors[17]. Figure 4 details the multimodal feature fusion network architecture design, where the attention weight allocation mechanism can adaptively adjust contribution degrees of various modal features according to different action types.
3.3 Error Action Detection and Correction Suggestions
Automatic detection and correction of erroneous actions is a core function of intelligent training systems. This research establishes an error detection framework based on standard action deviation analysis. This framework first constructs reference models for standard actions, including key posture sequences, time nodes, and quality assessment standards, then identifies error types by comparative analysis of differences between actual executed actions and standard models. The key posture missing detection algorithm judges action completeness by monitoring whether action sequences contain necessary technical elements, such as takeoff posture, peak air posture, and landing buffering posture in jumping actions[18]. Recognition of improper execution mainly targets problems such as insufficient action amplitude, too short duration, and poor body coordination, making automatic judgments by setting quantitative thresholds for technical specifications.
The algorithm adopts a multi-level error classification system, dividing errors into fatal errors, major errors, and minor errors, establishing corresponding deduction standards for each error type[19]. Generation of personalized action improvement suggestions is based on the combination of expert knowledge bases and machine learning algorithms. The system provides targeted improvement plans for each detected error based on athlete technical levels, physical conditions, and historical performance data. Suggestion content includes technical action key points, training methods, and common problem solutions, displayed to users through visualization interfaces showing error locations and improvement directions. This error detection system achieves 89.7% detection accuracy in professional athlete training tests, with generated improvement suggestions receiving 92% recognition from coaches.
3.4 Algorithm Complexity Theoretical Analysis
We conduct comprehensive theoretical complexity analysis of the proposed 3D convolutional neural network. In terms of time complexity, the network forward propagation computational complexity is O(WHTN²), where W, H, T represent input width, height, and temporal dimensions respectively, and N is the number of network channels. The spatiotemporal attention mechanism complexity is O(T²WH + W²HT), significantly reducing the traditional attention complexity of O(T²W²H²) through decomposed computation. In terms of space complexity, network parameter count is O(N²K³), where K is the convolution kernel size, achieving linear growth through parameter sharing and channel grouping strategies. Based on Rademacher complexity theory, we derive the model's generalization error bound as O(√(log N/m)), where m is the number of training samples. Convergence analysis shows that under strongly convex loss function conditions, the algorithm convergence rate is O(1/t). This theoretical analysis provides important guidance for model design, ensuring method scalability and practical deployment feasibility.
4 Model Performance Validation and Optimization Research
4.1 Model Accuracy Validation and Benchmark Comparison Analysis
To comprehensively evaluate the performance of the proposed model, this research designs a specialized evaluation metric system for figure skating that comprehensively considers multi-dimensional indicators such as action recognition accuracy, temporal prediction precision, real-time processing capability, and robustness. For action recognition accuracy, we use classification accuracy, precision, recall, and F1 score as main evaluation metrics, while introducing weighted average precision to handle sample imbalance among different action types. Temporal prediction precision is measured by calculating time deviations between predicted action boundaries and true boundaries, which is important for action segmentation and real-time scoring systems.
Comparison experiments with traditional computer vision methods include mainstream technical approaches such as optical flow, 2D CNN, LSTM, and Transformer[20], conducted on a figure skating dataset containing 1200 action sequences covering four major categories: jumps, spins, steps, and artistic performances. Results show that the research method improves overall recognition accuracy by 11.3% compared to optimal baseline methods, with more pronounced advantages in complex rotational action recognition, achieving accuracy improvements of 15.8%. Model accuracy and computational efficiency balance analysis is conducted through Pareto frontier analysis, showing that the proposed model maintains high precision while achieving real-time processing requirements, establishing a foundation for practical applications. Computational complexity analysis shows that compared to traditional 3D CNN methods, the research model reduces parameter count by 23% and floating-point operations by 18%. Table 3 comprehensively compares performance of different methods on figure skating action recognition tasks, validating the comprehensive advantages of this research method in accuracy and computational efficiency.
Table 3
Model Performance Comparison Analysis
|
Method
|
Overall Accuracy (%)
|
Jump Recognition (%)
|
Spin Recognition (%)
|
Step Recognition (%)
|
Processing Speed (FPS)
|
Model Size (MB)
|
|
Optical Flow + SVM
|
73.2
|
68.5
|
71.8
|
78.9
|
35
|
15.2
|
|
2D CNN + LSTM
|
81.7
|
76.3
|
79.2
|
86.4
|
28
|
45.7
|
|
3D ResNet-50
|
87.4
|
82.1
|
85.6
|
91.2
|
22
|
98.3
|
|
Transformer-based
|
89.1
|
84.7
|
87.3
|
92.8
|
18
|
156.9
|
|
Our Method
|
94.7
|
91.2
|
92.8
|
96.1
|
25
|
76.4
|
4.2 Generalization Performance Validation and Robustness Testing
Model generalization performance is an important indicator for evaluating practical application value. This research conducts in-depth validation of model generalization capability from multiple dimensions. Cross-athlete body type and technical style generalization testing covers athlete groups of different ages, genders, body types, and technical levels, with test data including three levels: junior, adult, and professional groups. Results show that the model maintains 91.3% recognition accuracy in cross-body-type testing and achieves 89.7% accuracy in cross-technical-style testing, demonstrating good individual adaptability. Robustness testing under different rink environments considers the influence of environmental factors such as lighting conditions, background complexity, camera angles, and image quality.
Testing in indoor standard rinks, outdoor natural ice rinks, and different lighting conditions shows that model performance degradation is controlled within 5% under various environments, demonstrating strong environmental adaptability. Model parameter sensitivity analysis evaluates model stability through perturbation testing of key hyperparameters, including learning rate, batch size, network depth, and attention weights[21]. Analysis results show that the model has good robustness to most hyperparameters, with significant performance degradation only when learning rate changes exceed one order of magnitude. Optimization boundary research determines optimal parameter configurations for the model through grid search and Bayesian optimization methods, providing parameter tuning guidance for practical deployment. Cross-dataset validation experiments are conducted on International Skating Union standard datasets and self-built datasets, with the model maintaining 87.2% recognition accuracy on unseen datasets. Table 4 statistics robustness testing results under various environmental interference conditions, with confidence interval analysis showing good model stability and reliability.
Table 4
Robustness Testing Results Statistics
|
Test Condition
|
Accuracy Drop (%)
|
Standard Deviation
|
Confidence Interval (95%)
|
Sample Size
|
|
Different Lighting
|
2.3
|
0.8
|
[1.9, 2.7]
|
240
|
|
Camera Angle Variation
|
3.1
|
1.2
|
[2.6, 3.6]
|
180
|
|
Background Complexity
|
1.8
|
0.6
|
[1.5, 2.1]
|
200
|
|
Image Quality Degradation
|
4.2
|
1.5
|
[3.5, 4.9]
|
160
|
|
Cross-rink Environment
|
2.9
|
1.0
|
[2.4, 3.4]
|
220
|
|
Athlete Body Type
|
3.7
|
1.3
|
[3.1, 4.3]
|
300
|
4.3 Computational Efficiency Optimization and Real-time Validation
Real-time performance is a key requirement for practical intelligent figure skating analysis systems. This research improves model computational efficiency from network structure optimization and hardware acceleration perspectives. Network structure pruning adopts channel-level pruning strategies based on importance scoring, removing redundant and inefficient network structures by analyzing contribution degrees of various convolution channels to final recognition performance. The pruning process consists of three stages: sensitivity analysis, iterative pruning, and fine-tuning recovery, ultimately compressing model parameter count by 35% while maintaining recognition accuracy loss below 2%. Quantization compression technology quantizes network weights from 32-bit floating-point numbers to 8-bit integers, reducing model size by 75% through combined post-training quantization and quantization-aware training with almost no accuracy loss. GPU parallel computing optimization mainly improves memory access patterns for 3D convolution operations, reducing single-frame processing time from 45 milliseconds to 28 milliseconds through techniques such as data prefetching, memory coalesced access, and computation-communication overlap. Additionally, TensorRT inference acceleration framework is used for model graph optimization and layer fusion, further improving inference speed. Precision maintenance technology under real-time processing requirements balances speed and accuracy through adaptive batching and dynamic resolution adjustment.
When the system detects processing delays exceeding thresholds, it automatically reduces input resolution or processing frame rate to ensure real-time performance. The final system achieves 25 FPS processing speed on NVIDIA RTX 3080 GPU, meeting real-time analysis requirements while recognition accuracy decreases by only 1.8%.
Conclusion
The 3D convolutional neural network model for figure skating proposed in this research successfully addresses the technical challenges of automatic recognition of complex figure skating actions through systematic innovation across four levels. At the network architecture level, we construct specialized 3D CNN and innovative spatiotemporal attention mechanisms adapted for high-speed rotations; at the action capture level, we achieve refined algorithms for three categories of actions: rotational, jumping, and artistic performance; at the pose recognition level, we establish complete mechanisms for hierarchical recognition and error detection; at the performance validation level, we form comprehensive evaluation systems for accuracy, generalization, and efficiency.
Experiments validate the effectiveness and advancement of the model, providing important support for intelligent sports technology development. However, the research still has certain limitations, such as robustness under extreme lighting conditions requiring improvement and cross-sport generalization capability needing further enhancement. Future work will focus on model lightweight design, multi-sport extension applications, and deep integration with augmented reality technology, promoting intelligent sports technology development toward broader application domains.