A
KazuhiroEsaki1Emailesaki.kazuhiro.m7@s.mail.nagoya-u.ac.jp
KatashiNagao1Emailnagao@i.nagoya-u.ac.jp
1Department of Intelligent Systems, Graduate School of InformaticsNagoya UniversityNagoyaJapan
Kazuhiro Esaki
Department of Intelligent Systems, Graduate School of Informatics, Nagoya University, Nagoya, Japan esaki.kazuhiro.m7@s.mail.nagoya-u.ac.jp
Katashi Nagao
Department of Intelligent Systems, Graduate School of Informatics, Nagoya University, Nagoya, Japan
nagao@i.nagoya-u.ac.jp
A
Abstract
In this paper, we propose the Deep Latent Space Assimilation Model (D-LSAM), a novel framework for integrating multiple body tracking techniques in XR environments to achieve more precise, real-time motion capture. Inside-Out Body Tracking (IOBT) on VR headsets can accurately track upper-body and finger movements, yet it struggles to capture areas outside the camera’s field of view—particularly the lower body. On the other hand, external-camera or smartphone-based systems can observe the entire body but often suffer from delays or reduced accuracy. The D-LSAM addresses these limitations by combining a Wasserstein autoencoder for pose compression, a Transformer-driven Latent Time-Stepping module for movement prediction, and a cross-attention gating mechanism that adaptively fuses data from various sources. Experimental results confirm that the D-LSAM outperforms both the extended Kalman filter and particle filter-based methods in short- to mid-term motion forecasting. Future work will emphasize faster inference, improved handling of rapid movements, and support for a wider range of devices. Progress in this methodology holds promise for delivering more immersive XR applications and for advancing fields such as medicine, sports, and rehabilitation.
Additional Keywords and Phrases: Body Tracking, Motion Capture, Data Assimilation, Deep Learning, Extended Reality
Full-body tracking technology plays an important role in fields as diverse as gaming, film making, medical rehabilitation, and sports analysis, enabling real-time motion recognition and tracking. In particular, body tracking technology is being deployed in the XR field to enable self-projection and avatar manipulation to provide a more immersive experience (Caserman et al. 2018; Winkler et al. 2022). Accurate full-body tracking is essential for improving interaction in virtual spaces and is expected to be applied in immersive games, remote work support, and as a support tool for rehabilitation (Cha et al. 2021; Lam et al. 2023; Neidhardt et al. 2023; Obdrzálek et al. 2012).
However, each existing body tracking technology has its own specific challenges. For example, Inside-Out Body Tracking (IOBT), which is used in VR headsets, can achieve highly accurate tracking of the upper body and fingers in real time using information from the headset and built-in camera, but it has limitations in tracking joints outside the field of view and the lower body. In contrast, markerless motion capture technology using an external camera can track the entire body, but delays and reduced accuracy in minute movements are problematic. In addition, overhead data acquisition is possible without a specific device for a sensor, but there is a trade-off between tracking accuracy and inference speed, and real-time performance remains an issue (Armitano-Lago et al. 2022; Baldinger et al. 2024).
Each of these techniques has its advantages and challenges, and especially in real-time applications, there is a need for fusion models that combine different techniques to mitigate the limitations of individual methods. In recent years, pose prediction technology (human motion forecasting) has advanced, and methods for predicting future motion by utilizing past motion data have been studied (Aksan et al. 2020; Li et al. 2018; Sofianos et al. 2021; Wang et al. 2023). In particular, methods that utilize deep learning have shown the potential to estimate complex human motion patterns with high accuracy.
In the past, methods such as the extended Kalman filter (EKF) and particle filter have been used to integrate multiple sensor information (Liu et al. 2011; Montañez et al. 2023). However, these methods suffer from poor adaptability to complex human behavior and changes in the environment. In this context, particle filter methods incorporating deep learning have also been studied, and the differentiable particle filter (DPF) (Jonschkowski et al. 2018), designed in a differentiable form for resampling and state updating of particle filters, has been proposed. The DPF is a method for optimal state estimation in combination with neural networks and has been shown to be useful for data-adaptive sequential Bayesian inference, for example, in studies on visual localization (Karkus et al. 2018).
Furthermore, the Deep Latent Space Particle Filter (D-LSPF) (Mücke et al. 2024) has been proposed as a method that combines particle filtering and deep learning. The D-LSPF leverages deep learning to perform time evolution and particle updates in latent space, building on the processing flow of particle filters to integrate multiple datasets, and has been shown to improve accuracy and robustness over conventional filtering methods.
In this study, we propose a new model called the Deep Latent Space Assimilation Model (D-LSAM), which further develops the D-LSPF to achieve more efficient and accurate data fusion by introducing distribution-based prediction and selective fusion in the latent space. The proposed method aims to integrate data from different tracking techniques to achieve more accurate motion representation in XR environments.
Body tracking technology is essential for improving the accuracy of interactions in XR environments. In particular, in virtual spaces, it plays a role in enhancing the immersive experience by accurately reflecting the user's body movements, enabling avatar manipulation and self-projection (González-Franco et al. 2020; Park et al. 2025). In recent years, this technology has been applied not only to games and entertainment but also to fields such as medical rehabilitation and remote work support, and various methods have been proposed to accurately track full-body movements (Jiang et al. 2022; Obdrzálek et al. 2012). In particular, applications such as sports training and rehabilitation require accurate reflection of movements without delay, and research is underway to meet these requirements (Lam et al. 2023; Obukhov et al. 2023; Suo et al. 2024).
Body tracking methods in XR can be broadly classified into sensor-based and markerless methods. In the sensor-based method, a dedicated device such as the HTC VIVE Tracker is worn to acquire position and posture information for highly accurate tracking (Caserman et al. 2018; Vox et al. 2021). This method has excellent real-time performance and is suitable for VR games and training applications, but it requires that users wear an additional device, which increases the burden on them. On the other hand, markerless methods use depth or stereo cameras and machine learning algorithms to estimate user motions (Fortini et al. 2023). In particular, techniques such as QuestSim have been developed to simulate reasonable full-body movements from minimal sensor information (Winkler et al. 2022), and they are expected to improve the accuracy of tracking using only a head-mounted display and controller.
Pose estimation is a technique for identifying the joint positions of the human body and is the basis for 2D and 3D body tracking. In recent years, the development of deep learning has enabled highly accurate and real-time posture estimation, which has found a wide range of applications in gaming, medical rehabilitation, and sports analysis (Zhou et al. 2023).
In 2D pose estimation, the joint positions of the human body are estimated on the image plane using a monocular camera. Typical methods include OpenPose (Cao et al. 2018), BlazePose (Bazarevsky et al. 2020; Grishchenko et al. 2022), and RTMPose (Jiang et al. 2023). OpenPose detects keypoints of multiple people in real time and is used in numerous applications. BlazePose, on the other hand, is optimized for mobile devices and provides highly accurate real-time pose estimation using 33 keypoints. RTMPose, a highly efficient multi-person pose estimation model based on the MMPose framework (Chen et al. 2019), excels in achieving a balance between speed and accuracy.
In 3D pose estimation, depth information is estimated from 2D images to recover joint positions in 3D space. Two types of methods exist: those that use a monocular camera and those that utilize multiple views. For the monocular approach, MotionAGFormer (Mehraban et al. 2023) has been proposed, which combines a transformer and a graph convolutional network (GCN) and is capable of highly accurate estimation considering local joint relationships. For multi-view methods, a learnable triangulation technique has been developed, which integrates information from different camera viewpoints to achieve highly accurate 3D pose estimation (Iskakov et al. 2019).
Conventional optical motion capture is highly accurate but requires a dedicated environment and marker attachment. Markerless motion capture, on the other hand, can acquire natural motions without restrictions and is expected to be used in clinical fields and sports analysis (Wade et al. 2022). In particular, OpenPose-based systems using multiple cameras have improved 3D accuracy (Nakano et al. 2019). However, there are many unresolved issues, such as the effects of ambient light and errors associated with the high speed of operation. An understanding of the advantages and limitations of these techniques is required to select the appropriate method for the appropriate application.
Human pose prediction plays an important role in interactive applications, especially in VR and AR. Pose prediction techniques are essential for improving motion recognition and interaction, as predicting future poses can provide a more natural and immersive experience. Methods based on deep learning have attracted much attention in this field and have contributed to improving the accuracy of pose prediction.
In recent years, new architectures have been proposed to capture spatial and temporal correlations. For example, graph convolutional networks (STS-GCNs), which process spatial and temporal features separately, have achieved excellent accuracy in long-term pose prediction (Sofianos et al. 2021), and GCNext, which pursues a unified GCN approach, has greatly improved prediction efficiency (Wang et al. 2023). These methods have succeeded in striking a better balance between computational cost and accuracy than previous methods and are setting a new standard in motion prediction.
In addition, methods utilizing sequence-to-sequence models and transformers have emerged, showing particularly strong performance in predicting complex behaviors: CNN-based sequence-to-sequence models enable highly accurate prediction of dynamics (Li et al. 2018), and Transformer models have been shown to be more accurate than conventional RNNs in long-term prediction (Aksan et al. 2020).
These developments are leading to more accurate and efficient motion prediction. However, improving real-time performance and computational efficiency remains an important challenge.
2.4Differentiable Particle Filter
The differentiable particle filter (DPF) introduces differentiability to conventional particle filters to enable integration with deep learning models and end-to-end learning and to improve performance in nonlinear state estimation problems. Conventional particle filters have been widely for state estimation in nonlinear state-space models (Liu et al. 2011; Montañez et al. 2023). However, they do not assume differentiability of the model, making integration with deep learning models and end-to-end learning difficult.
Jonschkowski et al. (2018) implemented particle filters in a differentiable form to improve the performance of state estimation by making behavioral and observational models trainable. Their method makes the structure of the particle filter differentiable end-to-end and incorporates algorithmic prior knowledge to make model learning more efficient.
Chen and Li (2023) reviewed the latest advances in methods for building each component of a particle filter using neural networks and optimizing them by gradient descent. They examined in detail the design choices regarding the main components of a DPF, including the dynamic model, observation model, distribution proposal, optimization goals, and differentiable resampling technique.
In addition, research on DPFs incorporating normalizing flow has received considerable attention. Normalizing flow is a method of constructing complex distributions by applying a series of reversible nonlinear transformations to a simple initial distribution to achieve a more expressive approximation (Rezende and Mohamed 2015). This approach has been reported to increase the flexibility of the proposed distribution and dynamic model and improve state estimation performance in complex real-world scenarios (Chen et al. 2021, 2024).
Finally, Mücke et al. (2024) proposed a new neural network-based particle filtering method called Deep Latent Space Particle Filter (D-LSPF). The method uses a Wasserstein autoencoder (WAE) and a deformed vision transformer layer to map high-dimensional data into a low-dimensional latent space. It then filters in that latent space and performs time evolution with a transformer model, which significantly improves computational efficiency. Their method has been shown to dramatically increase computational speed and accuracy by up to an order of magnitude over conventional high-precision particle filters in multiple test cases.
3Data Assmilation Based on Deep Latent Space
We propose a new method, the Deep Latent Space Assimilation Model (D-LSAM), which integrates multiple tracking data using deep learning. Unlike conventional DPFs, D-LSAM performs state estimation based on distribution and sampling without using actual particles. It is based on the DPF framework and consists of three main process: Transition Modeling, Observation Modeling, and Differentiable Resampling. In this study, these components were implemented using the WAE (Tolstikhin et al. 2017) and Transformer (Vaswani et al. 2017).
The novelty of this method is that it establishes a framework for adaptively integrating different observation data by fusing observation data through cross-attention and introducing selective state updating using a gating mechanism that is based on Mamba's selection mechanism (Gu and Dao 2023). In this section, each component of the proposed method is described in detail.
As shown in Fig. 1, the WAE encoder is used to compress the observed data into latent space and generate an initial distribution. Transitions by Latent Time-Stepping are performed on the data sampled from the distribution to generate the predictive distribution zpred. Next, the observed data zobs are used to modify the predictive distribution to the next state. Finally, the sampled latent variables are reconstructed by the WAE decoder to estimate the 3D location of the joint points.
In this study, we used the WAE (Tolstikhin et al. 2017) as a transducer to compress information about human poses. The WAE compresses or restores the input pose data into a low-dimensional latent space, which is important for realizing time series transitions on latent space. Specifically, the WAE takes as input the 3D positional information of the joint points of the full body and compresses this data and maps it to the latent space. This process is done through a GCN (Kipf and Welling 2016; Ullah et al. 2019), which is effective in properly capturing the relationships among joint points in a pose and efficiently encoding joint motion. The decoder, on the other hand, consists of three fully connected layers and reconstructs the original 3D position information from the latent space information. With this encoder and decoder configuration, the WAE compresses and reconstructs the pose data and latent representation (Fig. 2).
In training the WAE, the loss function mentioned in the D-LSPF study was used. First, the mean squared error (MSE) is used to minimize the reconstruction error. Next, the maximum mean error (MMD) loss is added so that the distribution of the latent space approaches the standard normal distribution. This MMD loss is intended to align the structure of the latent space with the normal distribution, allowing the model to be more generalized in its representation in the latent space (Gretton et al. 2007). Consistency loss is also combined with the MMD loss to ensure consistency between the encoder and decoder. This is calculated as the MSE between the latent vectors obtained by passing the output of the decoder through the encoder again. The ratios of MSE, MMD, and consistency loss affect the model's convergence speed and reconstruction accuracy, necessitating weighting of the loss functions and tuning of hyperparameters.
Latent Time-Stepping is the key component for predicting time series on latent space and serves as the transition model for the integrated model. The model employs the Transformer Encoder-Decoder architecture and is pre-trained to predict future poses based on past pose sequences. Specifically, 3D position data of joint points are given as input, and these data are first compressed into low-dimensional latent variables through the WAE encoder. The compressed latent variables are then used to learn temporal transitions with Transformer used to predict future states. This prediction is reconstructed by the WAE decoder and eventually restored to the original 3D location data. The architecture during the pre-training phase is shown in Fig. 3.
One of the features of this model is the consideration of uncertainty with reference to the variational autoencoder (VAE) (Kingma and Welling 2013). In conventional models, forecasts are made definitively as a single point estimate, but in Variational Latent Time-Stepping, the model outputs the mean and log variance of the latent variable, which is then used as the basis for stochastic forecasts of future poses. This allows the model to recognize uncertainty in forecasting and allows for more flexible forecasting.
Furthermore, the work of Chen et al. (2011, 2024) has shown that more complex probability distributions can be represented by using normalizing flow. Based on this, this study introduces normalizing flow in the prediction in latent space to enable the representation of complex motion patterns that cannot be captured by a simple Gaussian distribution. Specifically, by applying normalizing flow, which is expressed as a sequence of reversible transformations to the predicted latent variables, more flexible modeling of probability distributions is realized.
During training, the negative log-likelihood (NLL) loss, shown in the following equation, was used as a loss function to improve forecast accuracy. The NLL loss works to minimize the deviation from the actual data based on the predicted probability distribution. This loss term plays an important role in obtaining more appropriate results when making forecasts that incorporate uncertainty.
In addition, Soft-Dynamic Time Warping (DTW) loss (Cuturi and Blondel 2017) has also been used to measure the similarity of time series data. DTW adequately captures temporal variation by minimizing the temporal distortion between predicted and actual poses.
Here,
denotes the set of all paths between sequences
and
, and
represents the vector distance between corresponding time points. This loss term is important for improving the prediction accuracy of time series data.
3.3Latent Space Data Assimilation
The D-LSAM performs state estimation based on distribution and sampling without using actual particles, and therefore does not calculate similarity to observed data or perform resampling. As an alternative to distribution updating by resampling, a cross-attention mechanism is introduced to fuse observed data and propose distributions. Furthermore, a gating mechanism is introduced and combined with selective state updating to achieve efficient state updating while making effective use of predictions and observed data.
As shown in Fig. 4, a proposed distribution is generated based on the predicted distribution and observed data (embedded representation). Then, from the current and proposed distributions, important information is selectively updated to the next state using the gating mechanism.
3.3.1Distribution Proposal Using Cross-Attention
Cross-attention, shown in Fig. 5, was used to integrate the predictive distribution and the observed data. Cross-attention is applied to the parameter zpred (µpred, log σpred) of the predictive distribution at time t and the latent representation zobs of the observed data to estimate the parameter zprop (µprop, log σprop). This process generates a proposed distribution that takes into account the observed data.
3.3.2Selective Update with Gating Mechanism
In order to smoothly join the proposed distribution from the current state, a gating mechanism was constructed with reference to Mamba's selection mechanism, as shown in Fig. 6. The gate value is calculated dynamically based on the parameters of the current state distribution and the parameters of the proposed distribution and is output in the range of 0 to 1 using the cross-attention, fully connected layers, and the sigmoid function. The final state update is calculated as a weighted addition of the distribution using the gate values, as shown in the following equations:
Since the gate value is dynamically determined by considering the current state and the confidence level of the proposed distribution, the gate value will be small and past information will be prioritized when the uncertainty of the proposed distribution is high. On the other hand, when the confidence level of the proposed distribution is high, the gate value becomes large and new information is more strongly reflected. This mechanism enables adaptive state updating while reducing the effect of noise.
4Experiments on Multi-Source Pose Data Fusion
To validate the proposed method, experiments were conducted using motion capture data obtained from IOBT of the Meta Quest 3 VR headset and TDPT (ThreeD Pose Tracker), a smartphone application. IOBT uses the built-in camera of a VR headset and can track the upper body, including hands and fingers, in real time and with high accuracy, while movements outside the headset's field of view are difficult to track. TDPT is markerless motion capture using the built-in camera of the smartphone and can track full-body movements, but has real-time issues. The proposed method aims to integrate these data to achieve more robust and accurate motion estimation.
Data collection for the fine tuning of the training model and for testing was done in the environment shown in Fig. 7.
4.1Dataset and Preprocessing
AIST++ (Li et al. 2021; Tsuchida et al. 2019), a large dance movement dataset, was used for pre-training of each component of the proposed method because it contains various dance movements performed by multiple dancers and contains many dynamic human movement data as well as highly accurate pose annotations.
In this study, in order to construct a dataset including a variety of movements, three categories––daily movements, dance movements, and sports movements––were established, and data on ten different movements were collected twice each. The daily movements included basic movements such as “walking,” “sitting,” and “talking,” the dance movements included choreographic moves including basic steps such as “up-down,” “box step,” and “running man,” and the sports movements included movements such as “squatting,” “throwing,” and “serving. For each movement, both IOBT and TDPT were used to record data for 54 joint positions and rotations that conform to Unity Humanoid. Each movement was recorded for 10 s and sampled at 60 Hz. Examples of the collected motion data are shown in Fig. 8.
In addition, to obtain ground truth (GT) data, four cameras were used to capture each movement from different viewpoints, and existing pose estimation methods such as Mediapipe were applied to create highly accurate data. The 3D joint positions were obtained by integrating the information obtained from multiple viewpoints.
GT data were acquired using a markerless motion capture system with multi-view cameras to obtain 3D skeleton data. A shooting environment of approximately 3.5-m square was set up. RGB cameras capable of synchronized recording at full HD resolution and 60 fps were installed at each corner to simultaneously capture the subject's full-body movements from multiple viewpoints. The position and orientation of each camera were calibrated beforehand to estimate internal and external parameters.
A
The Mediapipe Pose model (Bazarevsky et al.
2020; Grishchenko et al.
2022) was applied to each acquired viewpoint image to extract the 2D coordinates of 33 joint points. Mediapipe Pose is a lightweight, real-time pose estimation library that outputs 33 landmark points corresponding to a standard skeleton structure. These were combined with camera parameters to perform triangulation based on multi-view geometry, and the 3D joint positions for each frame were reconstructed using singular value decomposition (SVD). The SVD-based method is a standard technique for algebraically integrating information from multiple viewpoints. However, errors in 2D estimation or occlusions can sometimes result in unnatural joint placements. Therefore, anatomical constraints, such as bone lengths, were introduced to correct the 3D reconstruction results, ensuring consistency with human anatomy. This enabled robust, anatomically plausible 3D pose estimation against noise. The final joint point data were converted to the Unity Humanoid format using a proprietary algorithm and further smoothed temporally by applying the One Euro Filter (Casiez et al.
2012). This series of processes enables the generation of high-precision GT motion data from multi-view video.
4.2Model Training and Evaluation
The proposed method requires pre-training of two key components: the WAE and Latent-Time Stepping. For this purpose, we first pre-trained on the AIST + + dataset and then performed fine tuning on the original dataset. One hundred dance sequences (30,000 frames in total) from the AIST + + dataset were used for training. For the original dataset, 3D pose data recorded by the multiple cameras were used as GT, and the recorded data for each two movements were split at a ratio of training:testing = 1:1.
In training the WAE, data conversion to the Unity Humanoid format and standardization were performed as preprocessing, and scaling, rotation about the horizontal plane, and random noise were applied as data expansion.
As hyperparameters, we tuned the number of GCN blocks and units in the encoder of the WAE, the number of layers and units in the fully connected layer of the decoder, as well as the batch size, learning rate, and weights of each loss function.
Optuna (Akiba et al. 2019) was used to optimize the hyperparameters, and the training results showed that the average error in the reconstruction of the WAE was 26.1 mm, confirming sufficient reproducibility.
For Latent-Time Stepping training, the frame rate was unified to 30 fps, and training was performed to predict motions up to 0.4 s (12 frames) into the future from the past 1 s (30 frames) of data. As a comparison, Table 1 shows the results of training STS-GCN (Sofianos et al. 2021), a previous study on pose prediction, under the same conditions.
Table 1
Training results of pose prediction models
MPJPE (m) by Time | 100 ms | 400 ms |
|---|
STS-GCN | 0.082 | 0.152 |
Ours | 0.079 | 0.154 |
aTraining was performed using the Latent-Time Stepping method with a frame rate unified to 30 fps, predicting motions up to 0.4 s (12 frames) into the future from the past 1 s (30 frames) of data. The results were compared under the same conditions with the STS-GCN model (Sofianos et al. 2021).
The average error at 100 ms was about 8 cm, confirming that relatively natural motion prediction was possible in the short-term prediction.
Furthermore, we conducted a detailed examination of the prediction model architecture. First, we compared varying input sequence lengths of 12, 18, 30, and 60 frames. When the input was too short (12 frames), prediction accuracy decreased due to insufficient information. Conversely, when the input was too long (60 frames), while the amount of effective information did not increase, errors tended to accumulate more easily. Experimental results confirmed that an input length of 30 frames yielded the best accuracy, achieving a balance between short-term and long-term prediction.
For the loss function, we employed a composite loss combining the conventional MSE with the Dynamic Time Warping (DTW) introduced in the previous section and the Negative Log-Likelihood (NLL). Comparative experiments using MSE alone and MSE + DTW showed no significant difference, but ultimately, a configuration using MSE, DTW, and NLL with weighted combination demonstrated the most stable and high accuracy.
Regarding the network architecture, we tuned hyperparameters such as the number of heads in the multi-head attention, the number of blocks, and the embedding dimension in the Transformer encoder–decoder.
To verify the effectiveness of the proposed method, the accuracy of pose integration was evaluated and compared with that of previous methods. Delays of 100 and 400 ms were assumed for the inference, and the results are shown in Tables 2 and 3, which compare the prediction accuracy by category. Mean per joint position error (MPJPE) was employed as the evaluation index; MPJPE represents the mean error of the reconstructed 3D position of each joint point, with smaller values indicating more accurate estimation.
As a baseline for the evaluation, we employed the D-LSPF, which uses the same components as the proposed method but performs particle-based weight calculations and soft resampling (Jang et al. 2016) in the observation updates. Similar experiments were also conducted with a traditional method, the EKF, which is widely used in sensor fusion as it estimates the internal state of the system while correcting it with observed data.
Table 2
Evaluation accuracy assuming 100 ms delay (MPJPE).
MPJPE (m) by Category | Daily | Dance | Sport |
|---|
EKF | 0.0995 | 0.1252 | 0.1537 |
D-LSPF | 0.1299 | 0.1437 | 0.1680 |
Ours | 0.0650 | 0.1008 | 0.1347 |
aThis experiment assumed a 100-ms delay in the MPJPE calculation.
Table 3
Evaluation accuracy assuming 400 ms delay (MPJPE)
MPJPE (m) by Category | Daily | Dance | Sport |
|---|
EKF | 0.1161 | 0.2111 | 0.1961 |
D-LSPF | 0.1392 | 0.1733 | 0.1829 |
Ours | 0.0814 | 0.1270 | 0.1516 |
aThis experiment assumed a 400-ms delay in the MPJPE calculation.
Experimental results confirmed that the proposed method has the lowest MPJPE in all categories and can achieve highly accurate pose estimation. This improvement in accuracy may be attributed to the fact that the proposed method adopts a distribution-based approach, which allows for more flexible prediction and data integration on the latent space.
The reason for the overall large error in the D-LSPF is thought to be that the error in the observed data obtained is large, and particle-based resampling to get closer to the observed data has in fact increased the error. In this respect, the effectiveness of the proposed method, which enables flexible estimation balanced with forecasting even for observed data with large errors, was verified. In the 400-ms forecast, the EKF error was larger, especially in the Dance and Sport categories. This suggests that the EKF cannot capture complex human movements in longer-term forecasts. In this respect, deep learning motion prediction was effective in longer-term prediction.
5Architectural Changes for Improved Processing Efficiency
This section describes architectural improvements aimed at accelerating inference processing for the practical implementation of the proposed D-LSAM method. We first measured processing speeds and analyzed bottlenecks during the initial phase and then designed multiple improvement strategies. Specifically, experiments were conducted on four approaches: replacing the prediction module with a lighter-weight architecture, integrating the future prediction and observation fusion modules, converting sequential predictions to batch processing, and performing fusion in the latent space centered on observation data. The impact of each approach on accuracy and processing speed was evaluated. First, we report the measured inference times for D-LSAM in its initial stage. We then present the design principles for each improvement method, concluding with a comprehensive comparison and discussion.