A
Higher Performance Full-Body Tracking Method by Integrating Multiple Tracking Techniques Based on Deep Latent Space
A
KazuhiroEsaki1Email
KatashiNagao1Email
1Department of Intelligent Systems, Graduate School of InformaticsNagoya UniversityNagoyaJapan
Kazuhiro Esaki
Department of Intelligent Systems, Graduate School of Informatics, Nagoya University, Nagoya, Japan esaki.kazuhiro.m7@s.mail.nagoya-u.ac.jp
Katashi Nagao
Department of Intelligent Systems, Graduate School of Informatics, Nagoya University, Nagoya, Japan
nagao@i.nagoya-u.ac.jp
A
Abstract
In this paper, we propose the Deep Latent Space Assimilation Model (D-LSAM), a novel framework for integrating multiple body tracking techniques in XR environments to achieve more precise, real-time motion capture. Inside-Out Body Tracking (IOBT) on VR headsets can accurately track upper-body and finger movements, yet it struggles to capture areas outside the camera’s field of view—particularly the lower body. On the other hand, external-camera or smartphone-based systems can observe the entire body but often suffer from delays or reduced accuracy. The D-LSAM addresses these limitations by combining a Wasserstein autoencoder for pose compression, a Transformer-driven Latent Time-Stepping module for movement prediction, and a cross-attention gating mechanism that adaptively fuses data from various sources. Experimental results confirm that the D-LSAM outperforms both the extended Kalman filter and particle filter-based methods in short- to mid-term motion forecasting. Future work will emphasize faster inference, improved handling of rapid movements, and support for a wider range of devices. Progress in this methodology holds promise for delivering more immersive XR applications and for advancing fields such as medicine, sports, and rehabilitation.
Additional Keywords and Phrases: Body Tracking, Motion Capture, Data Assimilation, Deep Learning, Extended Reality
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
1
Introduction
Full-body tracking technology plays an important role in fields as diverse as gaming, film making, medical rehabilitation, and sports analysis, enabling real-time motion recognition and tracking. In particular, body tracking technology is being deployed in the XR field to enable self-projection and avatar manipulation to provide a more immersive experience (Caserman et al. 2018; Winkler et al. 2022). Accurate full-body tracking is essential for improving interaction in virtual spaces and is expected to be applied in immersive games, remote work support, and as a support tool for rehabilitation (Cha et al. 2021; Lam et al. 2023; Neidhardt et al. 2023; Obdrzálek et al. 2012).
However, each existing body tracking technology has its own specific challenges. For example, Inside-Out Body Tracking (IOBT), which is used in VR headsets, can achieve highly accurate tracking of the upper body and fingers in real time using information from the headset and built-in camera, but it has limitations in tracking joints outside the field of view and the lower body. In contrast, markerless motion capture technology using an external camera can track the entire body, but delays and reduced accuracy in minute movements are problematic. In addition, overhead data acquisition is possible without a specific device for a sensor, but there is a trade-off between tracking accuracy and inference speed, and real-time performance remains an issue (Armitano-Lago et al. 2022; Baldinger et al. 2024).
Each of these techniques has its advantages and challenges, and especially in real-time applications, there is a need for fusion models that combine different techniques to mitigate the limitations of individual methods. In recent years, pose prediction technology (human motion forecasting) has advanced, and methods for predicting future motion by utilizing past motion data have been studied (Aksan et al. 2020; Li et al. 2018; Sofianos et al. 2021; Wang et al. 2023). In particular, methods that utilize deep learning have shown the potential to estimate complex human motion patterns with high accuracy.
In the past, methods such as the extended Kalman filter (EKF) and particle filter have been used to integrate multiple sensor information (Liu et al. 2011; Montañez et al. 2023). However, these methods suffer from poor adaptability to complex human behavior and changes in the environment. In this context, particle filter methods incorporating deep learning have also been studied, and the differentiable particle filter (DPF) (Jonschkowski et al. 2018), designed in a differentiable form for resampling and state updating of particle filters, has been proposed. The DPF is a method for optimal state estimation in combination with neural networks and has been shown to be useful for data-adaptive sequential Bayesian inference, for example, in studies on visual localization (Karkus et al. 2018).
Furthermore, the Deep Latent Space Particle Filter (D-LSPF) (Mücke et al. 2024) has been proposed as a method that combines particle filtering and deep learning. The D-LSPF leverages deep learning to perform time evolution and particle updates in latent space, building on the processing flow of particle filters to integrate multiple datasets, and has been shown to improve accuracy and robustness over conventional filtering methods.
In this study, we propose a new model called the Deep Latent Space Assimilation Model (D-LSAM), which further develops the D-LSPF to achieve more efficient and accurate data fusion by introducing distribution-based prediction and selective fusion in the latent space. The proposed method aims to integrate data from different tracking techniques to achieve more accurate motion representation in XR environments.
2
Related Work
2.1
Body Tracking
Body tracking technology is essential for improving the accuracy of interactions in XR environments. In particular, in virtual spaces, it plays a role in enhancing the immersive experience by accurately reflecting the user's body movements, enabling avatar manipulation and self-projection (González-Franco et al. 2020; Park et al. 2025). In recent years, this technology has been applied not only to games and entertainment but also to fields such as medical rehabilitation and remote work support, and various methods have been proposed to accurately track full-body movements (Jiang et al. 2022; Obdrzálek et al. 2012). In particular, applications such as sports training and rehabilitation require accurate reflection of movements without delay, and research is underway to meet these requirements (Lam et al. 2023; Obukhov et al. 2023; Suo et al. 2024).
Body tracking methods in XR can be broadly classified into sensor-based and markerless methods. In the sensor-based method, a dedicated device such as the HTC VIVE Tracker is worn to acquire position and posture information for highly accurate tracking (Caserman et al. 2018; Vox et al. 2021). This method has excellent real-time performance and is suitable for VR games and training applications, but it requires that users wear an additional device, which increases the burden on them. On the other hand, markerless methods use depth or stereo cameras and machine learning algorithms to estimate user motions (Fortini et al. 2023). In particular, techniques such as QuestSim have been developed to simulate reasonable full-body movements from minimal sensor information (Winkler et al. 2022), and they are expected to improve the accuracy of tracking using only a head-mounted display and controller.
2.2
Pose Estimation
Pose estimation is a technique for identifying the joint positions of the human body and is the basis for 2D and 3D body tracking. In recent years, the development of deep learning has enabled highly accurate and real-time posture estimation, which has found a wide range of applications in gaming, medical rehabilitation, and sports analysis (Zhou et al. 2023).
In 2D pose estimation, the joint positions of the human body are estimated on the image plane using a monocular camera. Typical methods include OpenPose (Cao et al. 2018), BlazePose (Bazarevsky et al. 2020; Grishchenko et al. 2022), and RTMPose (Jiang et al. 2023). OpenPose detects keypoints of multiple people in real time and is used in numerous applications. BlazePose, on the other hand, is optimized for mobile devices and provides highly accurate real-time pose estimation using 33 keypoints. RTMPose, a highly efficient multi-person pose estimation model based on the MMPose framework (Chen et al. 2019), excels in achieving a balance between speed and accuracy.
In 3D pose estimation, depth information is estimated from 2D images to recover joint positions in 3D space. Two types of methods exist: those that use a monocular camera and those that utilize multiple views. For the monocular approach, MotionAGFormer (Mehraban et al. 2023) has been proposed, which combines a transformer and a graph convolutional network (GCN) and is capable of highly accurate estimation considering local joint relationships. For multi-view methods, a learnable triangulation technique has been developed, which integrates information from different camera viewpoints to achieve highly accurate 3D pose estimation (Iskakov et al. 2019).
Conventional optical motion capture is highly accurate but requires a dedicated environment and marker attachment. Markerless motion capture, on the other hand, can acquire natural motions without restrictions and is expected to be used in clinical fields and sports analysis (Wade et al. 2022). In particular, OpenPose-based systems using multiple cameras have improved 3D accuracy (Nakano et al. 2019). However, there are many unresolved issues, such as the effects of ambient light and errors associated with the high speed of operation. An understanding of the advantages and limitations of these techniques is required to select the appropriate method for the appropriate application.
2.3
Human Pose Forecasting
Human pose prediction plays an important role in interactive applications, especially in VR and AR. Pose prediction techniques are essential for improving motion recognition and interaction, as predicting future poses can provide a more natural and immersive experience. Methods based on deep learning have attracted much attention in this field and have contributed to improving the accuracy of pose prediction.
In recent years, new architectures have been proposed to capture spatial and temporal correlations. For example, graph convolutional networks (STS-GCNs), which process spatial and temporal features separately, have achieved excellent accuracy in long-term pose prediction (Sofianos et al. 2021), and GCNext, which pursues a unified GCN approach, has greatly improved prediction efficiency (Wang et al. 2023). These methods have succeeded in striking a better balance between computational cost and accuracy than previous methods and are setting a new standard in motion prediction.
In addition, methods utilizing sequence-to-sequence models and transformers have emerged, showing particularly strong performance in predicting complex behaviors: CNN-based sequence-to-sequence models enable highly accurate prediction of dynamics (Li et al. 2018), and Transformer models have been shown to be more accurate than conventional RNNs in long-term prediction (Aksan et al. 2020).
These developments are leading to more accurate and efficient motion prediction. However, improving real-time performance and computational efficiency remains an important challenge.
2.4
Differentiable Particle Filter
The differentiable particle filter (DPF) introduces differentiability to conventional particle filters to enable integration with deep learning models and end-to-end learning and to improve performance in nonlinear state estimation problems. Conventional particle filters have been widely for state estimation in nonlinear state-space models (Liu et al. 2011; Montañez et al. 2023). However, they do not assume differentiability of the model, making integration with deep learning models and end-to-end learning difficult.
Jonschkowski et al. (2018) implemented particle filters in a differentiable form to improve the performance of state estimation by making behavioral and observational models trainable. Their method makes the structure of the particle filter differentiable end-to-end and incorporates algorithmic prior knowledge to make model learning more efficient.
Chen and Li (2023) reviewed the latest advances in methods for building each component of a particle filter using neural networks and optimizing them by gradient descent. They examined in detail the design choices regarding the main components of a DPF, including the dynamic model, observation model, distribution proposal, optimization goals, and differentiable resampling technique.
In addition, research on DPFs incorporating normalizing flow has received considerable attention. Normalizing flow is a method of constructing complex distributions by applying a series of reversible nonlinear transformations to a simple initial distribution to achieve a more expressive approximation (Rezende and Mohamed 2015). This approach has been reported to increase the flexibility of the proposed distribution and dynamic model and improve state estimation performance in complex real-world scenarios (Chen et al. 2021, 2024).
Finally, Mücke et al. (2024) proposed a new neural network-based particle filtering method called Deep Latent Space Particle Filter (D-LSPF). The method uses a Wasserstein autoencoder (WAE) and a deformed vision transformer layer to map high-dimensional data into a low-dimensional latent space. It then filters in that latent space and performs time evolution with a transformer model, which significantly improves computational efficiency. Their method has been shown to dramatically increase computational speed and accuracy by up to an order of magnitude over conventional high-precision particle filters in multiple test cases.
3
Data Assmilation Based on Deep Latent Space
We propose a new method, the Deep Latent Space Assimilation Model (D-LSAM), which integrates multiple tracking data using deep learning. Unlike conventional DPFs, D-LSAM performs state estimation based on distribution and sampling without using actual particles. It is based on the DPF framework and consists of three main process: Transition Modeling, Observation Modeling, and Differentiable Resampling. In this study, these components were implemented using the WAE (Tolstikhin et al. 2017) and Transformer (Vaswani et al. 2017).
The novelty of this method is that it establishes a framework for adaptively integrating different observation data by fusing observation data through cross-attention and introducing selective state updating using a gating mechanism that is based on Mamba's selection mechanism (Gu and Dao 2023). In this section, each component of the proposed method is described in detail.
As shown in Fig. 1, the WAE encoder is used to compress the observed data into latent space and generate an initial distribution. Transitions by Latent Time-Stepping are performed on the data sampled from the distribution to generate the predictive distribution zpred. Next, the observed data zobs are used to modify the predictive distribution to the next state. Finally, the sampled latent variables are reconstructed by the WAE decoder to estimate the 3D location of the joint points.
A
Fig. 1
Overview of D-LSAM.
3.1
Pose Auto-Encoder
In this study, we used the WAE (Tolstikhin et al. 2017) as a transducer to compress information about human poses. The WAE compresses or restores the input pose data into a low-dimensional latent space, which is important for realizing time series transitions on latent space. Specifically, the WAE takes as input the 3D positional information of the joint points of the full body and compresses this data and maps it to the latent space. This process is done through a GCN (Kipf and Welling 2016; Ullah et al. 2019), which is effective in properly capturing the relationships among joint points in a pose and efficiently encoding joint motion. The decoder, on the other hand, consists of three fully connected layers and reconstructs the original 3D position information from the latent space information. With this encoder and decoder configuration, the WAE compresses and reconstructs the pose data and latent representation (Fig. 2).
A
Fig. 2
Architecture of Pose Auto-Encoder.
In training the WAE, the loss function mentioned in the D-LSPF study was used. First, the mean squared error (MSE) is used to minimize the reconstruction error. Next, the maximum mean error (MMD) loss is added so that the distribution of the latent space approaches the standard normal distribution. This MMD loss is intended to align the structure of the latent space with the normal distribution, allowing the model to be more generalized in its representation in the latent space (Gretton et al. 2007). Consistency loss is also combined with the MMD loss to ensure consistency between the encoder and decoder. This is calculated as the MSE between the latent vectors obtained by passing the output of the decoder through the encoder again. The ratios of MSE, MMD, and consistency loss affect the model's convergence speed and reconstruction accuracy, necessitating weighting of the loss functions and tuning of hyperparameters.
3.2
Latent Time Stepping
Latent Time-Stepping is the key component for predicting time series on latent space and serves as the transition model for the integrated model. The model employs the Transformer Encoder-Decoder architecture and is pre-trained to predict future poses based on past pose sequences. Specifically, 3D position data of joint points are given as input, and these data are first compressed into low-dimensional latent variables through the WAE encoder. The compressed latent variables are then used to learn temporal transitions with Transformer used to predict future states. This prediction is reconstructed by the WAE decoder and eventually restored to the original 3D location data. The architecture during the pre-training phase is shown in Fig. 3.
A
Fig. 3
Architecture of Latent Time-Stepping during pre-training.
One of the features of this model is the consideration of uncertainty with reference to the variational autoencoder (VAE) (Kingma and Welling 2013). In conventional models, forecasts are made definitively as a single point estimate, but in Variational Latent Time-Stepping, the model outputs the mean and log variance of the latent variable, which is then used as the basis for stochastic forecasts of future poses. This allows the model to recognize uncertainty in forecasting and allows for more flexible forecasting.
Furthermore, the work of Chen et al. (2011, 2024) has shown that more complex probability distributions can be represented by using normalizing flow. Based on this, this study introduces normalizing flow in the prediction in latent space to enable the representation of complex motion patterns that cannot be captured by a simple Gaussian distribution. Specifically, by applying normalizing flow, which is expressed as a sequence of reversible transformations to the predicted latent variables, more flexible modeling of probability distributions is realized.
During training, the negative log-likelihood (NLL) loss, shown in the following equation, was used as a loss function to improve forecast accuracy. The NLL loss works to minimize the deviation from the actual data based on the predicted probability distribution. This loss term plays an important role in obtaining more appropriate results when making forecasts that incorporate uncertainty.
(1)
In addition, Soft-Dynamic Time Warping (DTW) loss (Cuturi and Blondel 2017) has also been used to measure the similarity of time series data. DTW adequately captures temporal variation by minimizing the temporal distortion between predicted and actual poses.
(2)
Here,
denotes the set of all paths between sequences
and
, and
represents the vector distance between corresponding time points. This loss term is important for improving the prediction accuracy of time series data.
3.3
Latent Space Data Assimilation
The D-LSAM performs state estimation based on distribution and sampling without using actual particles, and therefore does not calculate similarity to observed data or perform resampling. As an alternative to distribution updating by resampling, a cross-attention mechanism is introduced to fuse observed data and propose distributions. Furthermore, a gating mechanism is introduced and combined with selective state updating to achieve efficient state updating while making effective use of predictions and observed data.
A
Fig. 4
Observed data fusion and selective state updates.
As shown in Fig. 4, a proposed distribution is generated based on the predicted distribution and observed data (embedded representation). Then, from the current and proposed distributions, important information is selectively updated to the next state using the gating mechanism.
3.3.1
Distribution Proposal Using Cross-Attention
Cross-attention, shown in Fig. 5, was used to integrate the predictive distribution and the observed data. Cross-attention is applied to the parameter zpred (µpred, log σpred) of the predictive distribution at time t and the latent representation zobs of the observed data to estimate the parameter zprop (µprop, log σprop). This process generates a proposed distribution that takes into account the observed data.
A
Fig. 5
Fusion of observed data and generation of proposed distributions using cross-attention.
3.3.2
Selective Update with Gating Mechanism
In order to smoothly join the proposed distribution from the current state, a gating mechanism was constructed with reference to Mamba's selection mechanism, as shown in Fig. 6. The gate value is calculated dynamically based on the parameters of the current state distribution and the parameters of the proposed distribution and is output in the range of 0 to 1 using the cross-attention, fully connected layers, and the sigmoid function. The final state update is calculated as a weighted addition of the distribution using the gate values, as shown in the following equations:
(3)
(4)
Since the gate value is dynamically determined by considering the current state and the confidence level of the proposed distribution, the gate value will be small and past information will be prioritized when the uncertainty of the proposed distribution is high. On the other hand, when the confidence level of the proposed distribution is high, the gate value becomes large and new information is more strongly reflected. This mechanism enables adaptive state updating while reducing the effect of noise.
A
Fig. 6
Gate mechanism architecture.
4
Experiments on Multi-Source Pose Data Fusion
To validate the proposed method, experiments were conducted using motion capture data obtained from IOBT of the Meta Quest 3 VR headset and TDPT (ThreeD Pose Tracker), a smartphone application. IOBT uses the built-in camera of a VR headset and can track the upper body, including hands and fingers, in real time and with high accuracy, while movements outside the headset's field of view are difficult to track. TDPT is markerless motion capture using the built-in camera of the smartphone and can track full-body movements, but has real-time issues. The proposed method aims to integrate these data to achieve more robust and accurate motion estimation.
Data collection for the fine tuning of the training model and for testing was done in the environment shown in Fig. 7.
A
Fig. 7
Experimental environment for data collection.
4.1
Dataset and Preprocessing
AIST++ (Li et al. 2021; Tsuchida et al. 2019), a large dance movement dataset, was used for pre-training of each component of the proposed method because it contains various dance movements performed by multiple dancers and contains many dynamic human movement data as well as highly accurate pose annotations.
In this study, in order to construct a dataset including a variety of movements, three categories––daily movements, dance movements, and sports movements––were established, and data on ten different movements were collected twice each. The daily movements included basic movements such as “walking,” “sitting,” and “talking,” the dance movements included choreographic moves including basic steps such as “up-down,” “box step,” and “running man,” and the sports movements included movements such as “squatting,” “throwing,” and “serving. For each movement, both IOBT and TDPT were used to record data for 54 joint positions and rotations that conform to Unity Humanoid. Each movement was recorded for 10 s and sampled at 60 Hz. Examples of the collected motion data are shown in Fig. 8.
A
Fig. 8
Examples of movement data collected from IOBT (upper body data) and TDPT (full body data).
In addition, to obtain ground truth (GT) data, four cameras were used to capture each movement from different viewpoints, and existing pose estimation methods such as Mediapipe were applied to create highly accurate data. The 3D joint positions were obtained by integrating the information obtained from multiple viewpoints.
GT data were acquired using a markerless motion capture system with multi-view cameras to obtain 3D skeleton data. A shooting environment of approximately 3.5-m square was set up. RGB cameras capable of synchronized recording at full HD resolution and 60 fps were installed at each corner to simultaneously capture the subject's full-body movements from multiple viewpoints. The position and orientation of each camera were calibrated beforehand to estimate internal and external parameters.
A
The Mediapipe Pose model (Bazarevsky et al. 2020; Grishchenko et al. 2022) was applied to each acquired viewpoint image to extract the 2D coordinates of 33 joint points. Mediapipe Pose is a lightweight, real-time pose estimation library that outputs 33 landmark points corresponding to a standard skeleton structure. These were combined with camera parameters to perform triangulation based on multi-view geometry, and the 3D joint positions for each frame were reconstructed using singular value decomposition (SVD). The SVD-based method is a standard technique for algebraically integrating information from multiple viewpoints. However, errors in 2D estimation or occlusions can sometimes result in unnatural joint placements. Therefore, anatomical constraints, such as bone lengths, were introduced to correct the 3D reconstruction results, ensuring consistency with human anatomy. This enabled robust, anatomically plausible 3D pose estimation against noise. The final joint point data were converted to the Unity Humanoid format using a proprietary algorithm and further smoothed temporally by applying the One Euro Filter (Casiez et al. 2012). This series of processes enables the generation of high-precision GT motion data from multi-view video.
4.2
Model Training and Evaluation
The proposed method requires pre-training of two key components: the WAE and Latent-Time Stepping. For this purpose, we first pre-trained on the AIST + + dataset and then performed fine tuning on the original dataset. One hundred dance sequences (30,000 frames in total) from the AIST + + dataset were used for training. For the original dataset, 3D pose data recorded by the multiple cameras were used as GT, and the recorded data for each two movements were split at a ratio of training:testing = 1:1.
In training the WAE, data conversion to the Unity Humanoid format and standardization were performed as preprocessing, and scaling, rotation about the horizontal plane, and random noise were applied as data expansion.
As hyperparameters, we tuned the number of GCN blocks and units in the encoder of the WAE, the number of layers and units in the fully connected layer of the decoder, as well as the batch size, learning rate, and weights of each loss function.
Optuna (Akiba et al. 2019) was used to optimize the hyperparameters, and the training results showed that the average error in the reconstruction of the WAE was 26.1 mm, confirming sufficient reproducibility.
For Latent-Time Stepping training, the frame rate was unified to 30 fps, and training was performed to predict motions up to 0.4 s (12 frames) into the future from the past 1 s (30 frames) of data. As a comparison, Table 1 shows the results of training STS-GCN (Sofianos et al. 2021), a previous study on pose prediction, under the same conditions.
Table 1
Training results of pose prediction models
MPJPE (m) by Time
100 ms
400 ms
STS-GCN
0.082
0.152
Ours
0.079
0.154
aTraining was performed using the Latent-Time Stepping method with a frame rate unified to 30 fps, predicting motions up to 0.4 s (12 frames) into the future from the past 1 s (30 frames) of data. The results were compared under the same conditions with the STS-GCN model (Sofianos et al. 2021).
The average error at 100 ms was about 8 cm, confirming that relatively natural motion prediction was possible in the short-term prediction.
Furthermore, we conducted a detailed examination of the prediction model architecture. First, we compared varying input sequence lengths of 12, 18, 30, and 60 frames. When the input was too short (12 frames), prediction accuracy decreased due to insufficient information. Conversely, when the input was too long (60 frames), while the amount of effective information did not increase, errors tended to accumulate more easily. Experimental results confirmed that an input length of 30 frames yielded the best accuracy, achieving a balance between short-term and long-term prediction.
For the loss function, we employed a composite loss combining the conventional MSE with the Dynamic Time Warping (DTW) introduced in the previous section and the Negative Log-Likelihood (NLL). Comparative experiments using MSE alone and MSE + DTW showed no significant difference, but ultimately, a configuration using MSE, DTW, and NLL with weighted combination demonstrated the most stable and high accuracy.
Regarding the network architecture, we tuned hyperparameters such as the number of heads in the multi-head attention, the number of blocks, and the embedding dimension in the Transformer encoder–decoder.
To verify the effectiveness of the proposed method, the accuracy of pose integration was evaluated and compared with that of previous methods. Delays of 100 and 400 ms were assumed for the inference, and the results are shown in Tables 2 and 3, which compare the prediction accuracy by category. Mean per joint position error (MPJPE) was employed as the evaluation index; MPJPE represents the mean error of the reconstructed 3D position of each joint point, with smaller values indicating more accurate estimation.
As a baseline for the evaluation, we employed the D-LSPF, which uses the same components as the proposed method but performs particle-based weight calculations and soft resampling (Jang et al. 2016) in the observation updates. Similar experiments were also conducted with a traditional method, the EKF, which is widely used in sensor fusion as it estimates the internal state of the system while correcting it with observed data.
Table 2
Evaluation accuracy assuming 100 ms delay (MPJPE).
MPJPE (m) by Category
Daily
Dance
Sport
EKF
0.0995
0.1252
0.1537
D-LSPF
0.1299
0.1437
0.1680
Ours
0.0650
0.1008
0.1347
aThis experiment assumed a 100-ms delay in the MPJPE calculation.
Table 3
Evaluation accuracy assuming 400 ms delay (MPJPE)
MPJPE (m) by Category
Daily
Dance
Sport
EKF
0.1161
0.2111
0.1961
D-LSPF
0.1392
0.1733
0.1829
Ours
0.0814
0.1270
0.1516
aThis experiment assumed a 400-ms delay in the MPJPE calculation.
Experimental results confirmed that the proposed method has the lowest MPJPE in all categories and can achieve highly accurate pose estimation. This improvement in accuracy may be attributed to the fact that the proposed method adopts a distribution-based approach, which allows for more flexible prediction and data integration on the latent space.
The reason for the overall large error in the D-LSPF is thought to be that the error in the observed data obtained is large, and particle-based resampling to get closer to the observed data has in fact increased the error. In this respect, the effectiveness of the proposed method, which enables flexible estimation balanced with forecasting even for observed data with large errors, was verified. In the 400-ms forecast, the EKF error was larger, especially in the Dance and Sport categories. This suggests that the EKF cannot capture complex human movements in longer-term forecasts. In this respect, deep learning motion prediction was effective in longer-term prediction.
5
Architectural Changes for Improved Processing Efficiency
This section describes architectural improvements aimed at accelerating inference processing for the practical implementation of the proposed D-LSAM method. We first measured processing speeds and analyzed bottlenecks during the initial phase and then designed multiple improvement strategies. Specifically, experiments were conducted on four approaches: replacing the prediction module with a lighter-weight architecture, integrating the future prediction and observation fusion modules, converting sequential predictions to batch processing, and performing fusion in the latent space centered on observation data. The impact of each approach on accuracy and processing speed was evaluated. First, we report the measured inference times for D-LSAM in its initial stage. We then present the design principles for each improvement method, concluding with a comprehensive comparison and discussion.
A
5.1 Background of Acceleration and Analysis of Computational Bottlenecks
In XR environments, achieving practical full-body pose estimation requires both high estimation accuracy and real-time processing. Particularly in interactions involving movement or user physical actions, system latency directly impacts user experience and physical discomfort, making latency reduction critical (Choi et al. 2018; Warburton et al. 2023). Previous research indicates that latency below the order of milliseconds to tens of milliseconds is desirable for typical VR applications (Warburton et al. 2023). Achieving high-accuracy full-body pose estimation within these constraints is a technically challenging task. This challenge is particularly acute on standalone XR devices like Meta Quest 3, where computational resources are limited. To perform real-time full-body pose estimation while sharing computational resources with other tasks like real-time rendering and physics simulation, it is essential to improve processing efficiency and reduce model size while maintaining estimation accuracy as much as possible.
Figure 9 shows the input processing and full-body motion estimation flow in the initial implementation of D-LSAM proposed in Section 4. At each time step t, the model receives two types of input: real-time upper-body data at time t and full-body data at time t − N with an N-frame delay. First, the model uses the past series of full-body data to predict the full-body state for the next frame. The prediction obtained at this stage reflects the overall movement context of the subject but lacks detailed information about the upper body. Next, the predicted full-body state is corrected using the real-time upper body data to adjust for pose discrepancies and align it with the current movement. By iteratively applying this two-step process of prediction and observation-based correction, the full-body data at time t is estimated incrementally.
Fig. 9
Process flow in D-LSAM.
Click here to Correct
Next, we conducted experiments to measure inference time during the initial stages of D-LSAM. The device used for measurement was Meta Quest 3, where inference was performed on a standalone app using Unity Sentis with the model converted to ONNX format. The first ten inferences were discarded as warm-up, and the average inference time was calculated using the total time for the subsequent 100 inferences. The result showed an average inference time of approximately 100 ms. While this processing speed achieves high accuracy, improvement is necessary for real-time processing in XR environments.
A detailed analysis of this processing time revealed the following average time distribution (Table 4).
Table 4
Breakdown of D-LSAM inference time by module.
Module Name
Time (ms)
Percentage (%)
Pose Encoder
20.52
19.3
Pose Decoder
3.60
3.4
Observation model
5.31
5.0
Transformer Encoder
3.03
2.8
Transformer Decoder
35.79
33.6
Fusion layer
20.69
19.5
Selection gates
12.58
11.8
Normalizing flow
4.89
4.6
Total
106.41
100.0
aTime required for each module and its proportion of the total.
These results reveal that future prediction and observation integration at each step constitute the primary bottleneck, accounting for 64.9% of the total inference time, with the future prediction module alone contributing 51.7%. In particular, the sequential processing performed at each step—predicting the next step and correcting based on observation data—is expected to be the main cause of computational load.
5.2 Acceleration Techniques
Based on the analysis results in Section 5.1, we identified the following four factors that characterize the computational behavior and bottlenecks of D-LSAM.
1. Lightweighting through Prediction Module Replacement
Replacing the Transformer-based prediction module with an MLP-based TSMixer improves computational efficiency while reducing memory usage.
2. Integration of Prediction and Observation Fusion Processing
Executing previously separate prediction and observation fusion processing within a single module reduces overhead and redundant calculations.
3. Conversion from Sequential to Batch Processing
Changing step-by-step sequential processing to batch processing for the entire sequence enables acceleration through parallel computation.
4. Observation-Based Integration of Latent Space
Observation data provides high-precision, delay-free information for the upper body. Using this as a starting point not only suppresses the accumulation of errors associated with sequential prediction but also enables the construction of latent representations based on observations.
5.2.1 Lightweighting through Replacement of Prediction Modules
This subsection describes an attempt to replace the future prediction module with a lighter-weight architecture to improve the inference speed of the proposed D-LSAM method. In the D-LSAM used in the experiments up to Section 4, the process of predicting the next state from past latent sequences and correcting the obtained result with upper-body observation data is performed sequentially. The Transformer used for future prediction was identified as the primary bottleneck in inference time, with sequential processing delays significantly limiting overall processing efficiency. Conventional Transformer-based prediction modules capture dependencies between sequences using the Multi-Head Attention mechanism. Processing in high-dimensional latent spaces, in particular, results in large attention matrix sizes, imposing heavy loads on both memory usage and computational time.
In the field of time series forecasting, lightweight yet high-performance methods such as TSMixer (Time Series Mixer) (Chen et al. 2023) and PatchTST (Nie et al. 2022) have been proposed in recent years. The shift from Transformer-based models to MLP-based models has enabled significant speed improvements. Therefore, this study introduces TSMixer as an alternative to the Transformer used in the forecasting component. TSMixer is a lightweight time series forecasting model composed solely of MLPs, without using RNNs or attention mechanisms. It achieves sequence forecasting by alternately applying a Time-Mixing MLP, which captures temporal dependencies, and a Feature-Mixing MLP, which captures correlations between features.
When integrating TSMixer into D-LSAM, the following design principles were adopted. TSMixer takes the past sequence (length tpast) of encoded full-body data as input and predicts the future sequence (length tfuture) of frames with a delay. Specifically, the input sequence is first projected into a latent space. A Time-Mixing MLP captures temporal dependencies, followed by a Feature-Mixing MLP that learns correlations between features. Applying these layers alternately enables efficient representation of spatiotemporal patterns across the entire sequence. The final layer outputs latent representations for tfuture steps, yielding the predicted full-body pose sequence. Compared to Transformer, this architecture features simpler structure and superior computational efficiency due to its lack of attention mechanisms, promising significant computational improvements. It is particularly well-suited for inference on standalone XR devices.
5.2.2 Overhead Reduction through Integration of Prediction and Observation Fusion Modules
In the initial version of the D-LSAM, the future prediction module performed pre-training and executed future prediction and observation fusion processing in separate stages. Specifically, it first estimated the next state in the latent space using the prediction model, then performed corrections using observation data. While this design achieves high estimation accuracy through pre-training, it also complicates the model training procedure. Furthermore, data transfer between modules and computational overhead associated with sequential processing may reduce overall throughput.
To address this issue, this study investigates a method that integrates prediction and observation fusion into a single module. This improved approach avoids pre-training the full-body pose prediction model. Instead, both future state estimation and observation fusion are directly learned simultaneously during the D-LSAM training process. Specifically, we eliminated the Transformer Encoder-Decoder previously used as the prediction model. Instead, we adopted a design that integrates sequence prediction and observation integration by alternately applying self-attention to past sequences and cross-attention to observation data. This approach is expected to simplify learning and reduce sequential processing.
5.2.3 Conversion from Sequential Forecasting to Batch Forecasting
In the initial version of the D-LSAM, to compensate for missing frames in delayed full-body data, a sequential forecasting method was employed on the latent space, with corrections made at each step based on observed data. However, this sequential processing increased computational load, becoming a bottleneck for inference speed.
To mitigate this issue, we introduced a method that applies forecasting of future states and corrections based on observation of past states to the entire fixed-length sequence in one batch. The flow of this process is shown in Fig. 10. Specifically, we first simultaneously forecast the latent states for all steps from the past full-sequence data, then use the observation data to correct the entire sequence in one batch. This approach is expected to improve computational efficiency compared to sequential processing while enabling estimation that maintains consistency across the entire sequence.
Fig. 10
Batch prediction of latent states for fixed-length sequences and correction of entire sequence using observed data (Method 3).
Click here to Correct
On the other hand, the sequential processing method in the initial implementation had the advantage of dynamically adapting delay extension in situations where delays were variable. In contrast, it is important to note that the batch processing method limits the maximum delay extension to a fixed-length sequence set in advance.
To apply this batch processing method to the D-LSAM, we implemented a new architecture. Specifically, we first predict the full-body data for the delayed frame batch using the past full-body data sequence as input. Subsequently, we apply cross-attention while referencing the observed upper-body data to correct the resulting prediction sequence. This design is expected to enable efficient inference by eliminating sequential processing while ensuring consistency across the entire sequence.
5.2.4 Integration of Latent Spaces Based on Observations
In conventional latent space-based sequential forecasting, the generation of future states primarily relied on the model's autoregressive predictions. In contrast, we introduce a mechanism that directly references the upper body observation sequence to correct the latent state of the entire body.
Specifically, cross-attention is first applied to the upper body observation data and the full-body past sequence to generate an initial sequence that includes the delayed frame portion. Subsequently, cross-attention applied to the observation data and cross-attention applied to the past sequence are alternately applied to progressively correct the initial sequence. This process enables efficient, batch-wise correction of the entire sequence while suppressing the accumulation of delays and errors that commonly occur in sequential forecasting.
This improvement enables sequence correction in latent space to be performed in a manner closely dependent on the observed data. Compared to conventional sequential correction methods, it is expected to enhance inference efficiency while also stabilizing accuracy.
5.3 Comparison of Accuracy and Inference Speed for Each Method
This subsection quantitatively compares the accuracy and inference time of each method presented in Section 5.2 and discusses the optimal approach for practical implementation. Inference time was measured using the same procedure as in the experiments described in Section 5.1. The accuracy (MPJPE) and inference time for the baseline (the model before improvement) and each improvement method are shown in Table 5 and Table 6.
Table 5
Estimation accuracy and inference time assuming 100 ms delay (MPJPE)
Method
MPJPE (Daily)
MPJPE (Dance)
MPJPE (Sport)
Inference time (ms)
Baseline
0.0650
0.1008
0.1347
100.5
Method 1
0.0823
0.1158
0.1523
45.7
Method 2
0.0799
0.1196
0.1384
82.6
Method 3
0.0703
0.0973
0.1239
55.7
Method 4
0.0582
0.0953
0.1157
41.8
aThis experiment assumed a 100-ms delay in the MPJPE calculation.
Table 6
Estimation accuracy and inference time assuming 400 ms delay (MPJPE)
Method
MPJPE (Daily)
MPJPE (Dance)
MPJPE (Sport)
Inference time (ms)
Baseline
0.0814
0.1270
0.1516
100.5
Method 1
0.0903
0.1262
0.1592
45.7
Method 2
0.0901
0.1256
0.1453
82.6
Method 3
0.0826
0.1315
0.1426
55.7
Method 4
0.0693
0.1127
0.1283
41.8
aThis experiment assumed a 400-ms delay in the MPJPE calculation.
Method 1 (Architecture Simplification)
Replacing the Transformer Encoder-Decoder-based prediction module with an MLP-based TSMixer significantly reduced inference time (100.5 ms → 45.7 ms). However, this also resulted in a decrease in accuracy (MPJPE) across all datasets. While changing to a simpler MLP-based architecture substantially reduced processing time, the Transformer's ability to learn dependencies likely played a crucial role in predicting human movements. The simple structure of the MLP-based model has limitations in capturing complex spatiotemporal patterns, resulting in a particularly noticeable accuracy drop on datasets containing complex movement patterns, such as Dance and Sport.
Method 2 (Integrated Architecture)
Eliminating the pre-training of the Transformer Encoder-Decoder-based prediction module and integrating the forecasting and observation correction modules yielded no significant improvement over the baseline. Inference time decreased slightly from 100.5 ms to 82.6 ms, but accuracy slightly decreased on average, with MPJPE rising from 0.109 to 0.116. The simplification of architecture through integration did not yield the expected benefits. Instead, it is concluded that the loss of pre-training benefits led to the decrease in accuracy.
Method 3 (Batch Processing Approach)
By changing from sequential processing to batch prediction and correction application for fixed-length sequences, inference time was reduced from 100.5 ms to 55.7 ms. Notably, this reduction in inference time was accompanied by a slight improvement in accuracy (mean MPJPE: 0.124 → 0.115). Compared to Method 1, inference time increases by about 10 ms, but considering the accuracy improvement, Method 3 is superior. This confirms sequential processing was the primary bottleneck, and processing the entire sequence in batches significantly improved computational efficiency. Furthermore, the ability to apply corrections considering the entire sequence likely contributed to the accuracy improvement.
Method 4 (Observation-Based Approach)
This approach further refines Method 3 and demonstrated the most outstanding results. In conventional methods, the past sequence was converted into a 256-dimensional embedding, then the future sequence was predicted using self-attention, followed by cross-attention with the observation sequence to correct the entire sequence. In contrast, Method 4 changed the processing to observation-based. First, cross-attention was applied to the past sequence of the full-body data and the observation sequence up to the present to generate initial values for the missing frames in the full-body data. Subsequently, by alternately processing the observation sequence and past sequence with cross-attention, it achieved the highest accuracy across all datasets (mean MPJPE: 0.115 → 0.095) while maintaining inference time comparable to Method 3 (55.7 ms → 41.8 ms).
The experimental results indicate that Method 4 is the most promising approach for practical implementation. Method 4 successfully reduces inference time to less than half that of the baseline (41.8 ms) while simultaneously achieving a significant improvement in accuracy. In particular, it was demonstrated that the correction approach starting from observation data is more effective than the conventional prediction-then-correction framework.
From the perspective of inference time, batching sequential processing (Methods 3 and 4) proved most effective, while simplifying the architecture (Method 1) resulted in excessive accuracy loss. Conversely, from the accuracy perspective, the superiority of Transformer-based architecture was confirmed, particularly for actions involving complex spatiotemporal patterns (such as Dance and Sport), highlighting the importance of the attention mechanism.
These results provide important guidelines for balancing processing efficiency and accuracy in the practical implementation of real-time full-body pose estimation systems. Method 4 demonstrates the feasibility of achieving a system that simultaneously achieves practical inference time (under 50 ms) and high accuracy.
6 Conclusion and Future Work
In this study, we proposed D-LSAM, a novel deep learning-based framework for integrating multiple motion capture data sources. Specifically, the proposed method utilizes a Wasserstein Auto-Encoder and Latent-Time Stepping to improve the accuracy and consistency of pose estimation by performing data compression and time series prediction. Furthermore, in comparison to previous methods, the proposed method shows improved accuracy in pose estimation and outperforms the extended Kalman filter.
6.1 Extension to Hand and Finger Tracking
In this study, we focused on full-body motion tracking. However, a future direction is to extend the framework to include more detailed tracking, such as hands and fingers. Capturing finer motions, including finger articulations and hand gestures, would enable more precise and expressive motion representation, thereby expanding the potential applications in XR environments, such as gesture recognition, object interaction, and high-fidelity avatar animation.
6.2 Validation of Different Devices
To further expand the usefulness of the proposed method, future work should include validation on VR headsets other than Quest3 and on body tracking systems other than TDPT. For example, through the use of other VR headsets, such as Pico4 and Motion Tracker, we need to confirm the method’s versatility of different sensor characteristics and data formats. Integration with camera-based tracking systems [e.g. OpenPose, MoveAI (Move AI 2025)] and wearable sensor devices should also be considered.
6.3 Future Challenges and Prospects
Although the method proposed in this study has contributed to improving the accuracy of pose estimation, several challenges still remain. In particular, further improvement in accuracy is needed for better adaptability to very fast movements and extreme poses. It is also important to continue research on uncertainty modeling to achieve more accurate and flexible estimation.
Future research is expected to further enhance the generality and adaptability of the proposed method and lead to the development of a system that can handle real-time applications and large datasets.
A
Acknowledgement
We thank all members in Nagao laboratory in Nagoya University who contributed to this work, particularly for their assistance with the experimental setup and data collection. Informed consent was obtained from all individual participants included in the study. The subjects signed informed consent forms regarding the disclosure of their data.This work was supported by Council for Science, Technology and Innovation, “Cross-ministerial Strategic Innovation Promotion Program (SIP), Development of foundational technologies and rules for expansion of the virtual economy” (JPJ012495) (funding agency: NEDO).
A
Author Contribution
Kazuhiro Esaki and Katashi Nagao conceived the study. Kazuhiro Esaki curated the data and, together with Katashi Nagao, developed the methodology. Katashi Nagao administered and supervised the project. Kazuhiro Esaki and Katashi Nagao prepared the original draft, and Katashi Nagao reviewed and edited the manuscript. All authors have read and approved the published version of the manuscript.
A
A
Data Availability
All datasets used in this study were created by the authors and are publicly available in a GitHub repository.These data and related files can be accessed via the following URL and downloaded through Google Drive.GitHub Repository: https://github.com/kazuhiro1999/data-assimilationGoogle Drive Download Link: https://drive.google.com/file/d/1cy4QFPCc_HHepIjy2ne-C3Yx8aii7hrW/view?usp=sharingThese data are provided as open access and requires no additional application or approval.
REFERENCES
Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2623–2631. https://doi.org/10.1145/3292500.3330701
Aksan E, Cao P, Kaufmann M, Hilliges O (2020) A Spatio-temporal Transformer for 3D Human Motion Prediction. 2021 International Conference on 3D Vision (3DV) 565–574. https://doi.org/10.1109/3DV53792.2021.00066
Armitano-Lago C, Willoughby D, Kiefer AW (2022) A SWOT Analysis of Portable and Low-Cost Markerless Motion Capture Systems to Assess Lower-Limb Musculoskeletal Kinematics in Sport. Front Sports Act Living 3. https://doi.org/10.3389/fspor.2021.809898
Baldinger M, Lippmann K, Senner V (2024) Artificial Intelligence-Based Motion Capture: Current Technologies, Applications and Challenges. Artificial Intelligence in Sports, Movement, and Health 161–176. https://doi.org/10.1007/978-3-031-67256-9_10
Bazarevsky V, Grishchenko I, Raveendran K, Zhu T, Zhang F, Grundmann M (2020) BlazePose: On-device Real-time Body Pose tracking. ArXiv, abs/2006.10204
Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2018) OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans Pattern Anal Mach Intell 43:172–186. https://doi.org/10.1109/TPAMI.2019.2929257
Caserman P, Garcia-Agundez A, Konrad RA, Göbel S, Steinmetz R (2018) Real-time body tracking in virtual reality using a Vive tracker. Virtual Real 23:155–168. https://doi.org/10.1007/s10055-018-0374-z
Casiez G, Roussel N, Vogel D (2012) 1 € Filter: A Simple Speed-Based Low-Pass Filter for Noisy Input in Interactive Systems. In Proceedings of the 30th Annual ACM Conference on Human Factors in Computing Systems (CHI '12), Austin, TX, USA, pp. 2527–2530. https://doi.org/10.1145/2207676.2208639
Cha K, Wang J, Li Y, Shen L, Chen Z, Long J (2021) A novel upper-limb tracking system in a virtual environment for stroke rehabilitation. J Neuroeng Rehabil 18. https://doi.org/10.1186/s12984-021-00957-6
Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X, Sun S, Feng W, Liu Z, Xu J, Zhang Z, Cheng D, Zhu C, Cheng T, Zhao Q, Li B, Lu X, Zhu R, Wu Y, Liu K, Dai J (2019) MMDetection: Open MMLab Detection Toolbox and Benchmark. ArXiv, abs/1906.07155.
Chen S, Li CL, Yoder N, Arik SÖ, Pfister T (2023) TSMixer: An all-MLP Architecture for Time Series Forecasting. Trans Mach Learn Res, vol. 2023. https://api.semanticscholar.org/CorpusID:257482532
Chen X, Li Y (2023) An overview of differentiable particle filters for data-adaptive sequential Bayesian inference. ArXiv, abs/2302.09639,
Chen X, Wen H, Li Y (2021) Differentiable Particle Filters through Conditional Normalizing Flow. 2021 IEEE 24th International Conference on Information Fusion (FUSION) 1–6. https://doi.org/10.23919/FUSION49465.2021.9626998
A
Chen X, Li Y (2024) Normalizing Flow-Based Differentiable Particle Filters. IEEE Trans Signal Process 73:493–507. https://doi.org/10.1109/TSP.2024.3521338
Choi S-W, Lee S, Seo M-W, Kang S-J (2018) Time Sequential Motion-to-Photon Latency Measurement System for Virtual Reality Head-Mounted Displays. Electronics 7:171. https://doi.org/10.3390/electronics7090171
Cuturi M, Blondel M (2017) Soft-DTW: a Differentiable Loss Function for Time-Series. International Conference on Machine Learning. https://doi.org/10.48550/arXiv.1703.01541
Fortini L, Leonori M, Gandarias JM, Momi ED, Ajoudani A (2023) Markerless 3D human pose tracking through multiple cameras and AI: Enabling high accuracy, robustness, and real-time performance. ArXiv, abs/2303.18119.
González-Franco M, Cohn BA, Ofek E, Burin D, Maselli A (2020) The Self-Avatar Follower Effect in Virtual Reality. 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) 18–25. https://doi.org/10.1109/VR46266.2020.00019
Gretton A, Fukumizu K, Teo CH, Song L, Scholkopf B, Smola A (2007) A Kernel Statistical Test of Independence. In Proceedings of the 21st International Conference on Neural Information Processing Systems 585–592. https://dl.acm.org/doi/10.5555/2981562.2981636
Grishchenko I, Bazarevsky V, Zanfir A, Bazavan EG, Zanfir M, Yee R, Raveendran K, Zhdanovich M, Grundmann M, Sminchisescu C (2022) BlazePose GHUM Holistic: Real-time 3D Human Landmarks and Pose Estimation. ArXiv, abs/2206.11678, 2022
Gu A, Dao T (2023) Mamba: Linear-Time Sequence Modeling with Selective State Spaces. ArXiv, abs/2312.00752
Iskakov K, Burkov E, Lempitsky V, Malkov Y (2019) Learnable Triangulation of Human Pose. International Conference on Computer Vision (ICCV)
Jang EJ, Gu SS, Poole B (2016) Categorical Reparameterization with Gumbel-Softmax. ArXiv, abs/1611.01144
Jiang T, Lu P, Zhang L, Ma N, Han R, Lyu C, Li Y, Chen K (2023) RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose. ArXiv, abs/2303.07399
Jiang J, Streli P, Qiu H, Fender AR, Laich L, Snape P, Holz C (2022) AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing. 17th European Conference on Computer Vision 443–460. https://doi.org/10.1007/978-3-031-20065-6_26
Jonschkowski R, Rastogi D, Brock O (2018) Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors. ArXiv, abs/1805.11122
Karkus P, Hsu D, Lee WS (2018) Particle Filter Networks: End-to-End Probabilistic Localization From Visual Observations. ArXiv, abs/1805.08975.
Kingma DP, Welling M (2013) Auto-Encoding Variational Bayes. CoRR. https://doi.org/10.48550/arXiv.1312.6114
Kipf T, Welling M (2016) Semi-Supervised Classification with Graph Convolutional Networks. ArXiv, abs/1609.02907
Lam WW, Tang Y-M, Fong KNK (2023) A systematic review of the applications of markerless motion capture (MMC) technology for clinical measurement in rehabilitation. J Neuroeng Rehabil 20:57. https://doi.org/10.1186/s12984-023-01186-9
Li C, Zhang Z, Lee WS, Lee GH (2018) Convolutional Sequence to Sequence Model for Human Dynamics. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 5226–5234. https://doi.org/10.1109/CVPR.2018.00548
Li R, Yang S, Ross DA, Kanazawa A (2021) AI Choreographer: Music Conditioned 3D Dance Generation with AIST++. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 13381–13392. https://doi.org/10.1109/ICCV48922.2021.01315
Liu Z, Lee D, Sepp W (2011) Particle filter based monocular human tracking with a 3d cardbox model and a novel deterministic resampling strategy. 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems 3626–3631. https://doi.org/10.1109/IROS.2011.6094733
Mehraban S, Adeli V, Taati B (2023) MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network. 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 6905–6915. https://doi.org/10.1109/WACV57701.2024.00677
Montañez OJ, Suarez MJ, Fernandez EA (2023) Application of Data Sensor Fusion Using Extended Kalman Filter Algorithm for Identification and Tracking of Moving Targets from LiDAR-Radar Data. Remote Sens 15:3396. https://doi.org/10.3390/rs15133396
Move AI (2025) Move AI. https://www.move.ai. Accessed 28 Feb 2025
Mücke NT, Boht'e SM, Oosterlee CW (2024) The deep latent space particle filter for real-time data assimilation with uncertainty quantification. Sci Rep 14:19447. https://doi.org/10.1038/s41598-024-69901-7
Nakano N, Sakura T, Ueda K, Omura L, Kimura A, Iino Y, Fukashiro S, Yoshioka S (2019) Evaluation of 3D Markerless Motion Capture Accuracy Using OpenPose With Multiple Video Cameras. Front Sports Act Living 2. https://doi.org/10.3389/fspor.2020.00050
Neidhardt M, Schmidt SG, Fiedler I, Grube S, Busse B, Schlaefer A (2023) VR-based body tracking to stimulate musculoskeletal training. ArXiv, abs/2308.03375
Nie Y, Nguyen NH, Sinthong P, Kalagnanam J (2022) A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. https://doi.org/10.48550/arXiv.2211.14730
Obdrzálek S, Kurillo G, Han JJ, Abresch RT, Bajcsy R (2012) Real-Time Human Pose Detection and Tracking for Tele-Rehabilitation in Virtual Reality. Stud Health Technol Inform 173:320–324. https://doi.org/10.3233/978-1-61499-022-2-320
Obukhov AD, Volkov A, Pchelintsev AN, Nazarova AO, Teselkin DV, Surkova E, Fedorchuk I (2023) Examination of the Accuracy of Movement Tracking Systems for Monitoring Exercise for Musculoskeletal Rehabilitation. Sensors 23:8058. https://doi.org/10.3390/s23198058
Park M, Lee J, Yang H, Kim J (2025) Designing and Analyzing Virtual Avatar Based on Rigid-Body Tracking in Immersive Virtual Environments. IEEE Access 13:5522–5533. https://doi.org/10.1109/ACCESS.2025.3525630
Rezende DJ, Mohamed S (2015) Variational Inference with Normalizing Flows. ArXiv, abs/1505.05770
Sofianos T, Sampieri A, Franco L, Galasso F (2021) Space-Time-Separable Graph Convolutional Network for Pose Forecasting. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 11189–11198. https://doi.org/10.48550/arXiv.2110.04573
Suo X, Tang W, Li Z (2024) Motion Capture Technology in Sports Scenarios: A Survey. Sensors 24:2947. https://doi.org/10.3390/s24092947
Tolstikhin I, Bousquet O, Gelly S, Schoelkopf B (2017) Wasserstein Auto-Encoders. ArXiv, abs/1711.01558
Tsuchida S, Fukayama S, Hamasaki M, Goto M (2019) AIST Dance Video Database: Multi-genre, Multi-dancer, and Multi-camera Database for Dance Information Processing. Proceedings of the 20th International Society for Music Information Retrieval Conference
Ullah I, Manzo M, Shah M, Madden MG (2019) Graph Convolutional Networks: analysis, improvements and results. Appl Intell 52:9033–9044. https://doi.org/10.1007/s10489-021-02973-4
Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is All you Need. https://dl.acm.org/doi/10.5555/3295222.3295349. Neural Information Processing Systems
Vox JP, Weber A, Wolf KI, Izdebski K, Schüler T, Schüler P, Wallhoff F, Friemert D (2021) An Evaluation of Motion Trackers with Virtual Reality Sensor Technology in Comparison to a Marker-Based Motion Capture System Based on Joint Angles for Ergonomic Risk Assessment. Sensors 21:3145. https://doi.org/10.3390/s21093145
Wade L, Needham L, McGuigan PM, Bilzon JLJ (2022) Applications and limitations of current markerless motion capture methods for clinical gait biomechanics. Peer 10. https://doi.org/10.7717/peerj.12995
Wang X, Cui Q, Chen C, Liu M (2023) GCNext: Towards the Unity of Graph Convolutions for Human Motion Prediction. AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v38i6.28375
Warburton M, Mon-Williams M, Mushtaq F, Morehead JR (2023) Measuring motion-to-photon latency for sensorimotor experiments with virtual reality systems. Behav Res Methods 55:3658–3678. https://doi.org/10.3758/s13428-022-01983-5
Winkler A, Won J, Ye Y (2022) QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars. SIGGRAPH Asia 2022 Conference Papers 2:1–8. https://doi.org/10.1145/3550469.3555411
Zhou L, Meng X, Liu Z, Wu M, Gao Z, Wang P (2023) Human Pose-based Estimation, Tracking and Action Recognition with Deep Learning. A Survey. ArXiv, abs/2310.13039
Click here to Correct
Abstract
In this paper, we propose the Deep Latent Space Assimilation Model (D-LSAM), a novel framework for integrating multiple body tracking techniques in XR environments to achieve more precise, real-time motion capture. Inside-Out Body Tracking (IOBT) on VR headsets can accurately track upper-body and finger movements, yet it struggles to capture areas outside the camera’s field of view—particularly the lower body. On the other hand, external-camera or smartphone-based systems can observe the entire body but often suffer from delays or reduced accuracy. The D-LSAM addresses these limitations by combining a Wasserstein autoencoder for pose compression, a Transformer-driven Latent Time-Stepping module for movement prediction, and a cross-attention gating mechanism that adaptively fuses data from various sources. Experimental results confirm that the D-LSAM outperforms both the extended Kalman filter and particle filter-based methods in short- to mid-term motion forecasting. Future work will emphasize faster inference, improved handling of rapid movements, and support for a wider range of devices. Progress in this methodology holds promise for delivering more immersive XR applications and for advancing fields such as medicine, sports, and rehabilitation.
Total words in MS: 7700
Total words in Title: 15
Total words in Abstract: 191
Total Keyword count: 0
Total Images in MS: 11
Total Tables in MS: 6
Total Reference count: 56