Category-Level 6D Pose Estimation Based on Deep Cross-Modal Feature Fusion

ChunhuiTang1

MingyangZhang1

YiZhao1

ShouxueShan2✉Emailgxhelp@163.com

1School of Optical Information and Computer EngineeringUniversity of Shanghai for Science and TechnologyNo. 516 Jungong Road200093ShanghaiYangpu DistrictChina

2College of Economics and InformationZhejiang Tongji Vocational College of Science and TechnologyNingwei Street311200HangzhouXiaoshanChina

Abstract

Category-level 6D pose estimation methods aim to predict the rotation, translation, and size of unseen objects in a given category. RGB-D based dense correspondence methods have achieved leading performance. However, due to the differences in textures and shapes of the objects within a category, the object masks acquired by previous instance segmentation methods may be defective, resulting in inaccurate object point clouds acquired by depth map back-projection and RGB image patches acquired by cropping. Existing fusion methods that directly stitch RGB and geometric features cannot obtain accurate fused features. To solve these problems, we propose a new data processing method to improve the accuracy of the input data. The object position information provided by the object detection algorithm is fused with the image embedding information extracted through the vision transformer to obtain an accurate object mask. In addition, we introduce a new implicit fusion strategy that employs a cross-attention mechanism to align two different semantic features and subsequently reason about the fused features of the two different input data through a transformer-based architecture. We demonstrate the approach's effectiveness by conducting experiments on two publicly available datasets, REAL275 and CAMERA275.

Keywords

6D Pose estimation

Implicit fusion

Instance segment

Transformer

and Mingyang Zhang: These authors contributed equally to this work.

Introduction

Category-level 6D pose estimation focuses on identifying an object's 3D translation and 3D rotation with respect to a camera. This technique plays a crucial role in several domains, such as robotic arm manipulation\cite{Mousavian_Eppner_Fox_2019}, augmented reality\cite{gattullo2019towards}, and virtual reality\cite{cipresso2018past}. Early research efforts focused on instance-level 6D pose estimationHe_Sun_Huang_Liu_Fan_Sun_2019, He_Huang_Fan_Chen_Sun_2021, zhou2023deep, 2024A. However, the application of this method has been limited due to the difficulty of generalizing it to other instances in the same category. Therefore, in recent years, scholars have started to pay more attention to the category-level pose estimation method with stronger generalization ability, which can more effectively deal with the pose estimation problem of other objects in the same category and thus has attracted widespread attention and importance.

Category-level pose estimation aims to use a limited sample of objects from different categories by training the network to predict the 6D poses and sizes of unseen objects in the same category. This approach has demonstrated excellent generalization performance on the pose estimation task\cite{wang2019normalized}. Existing category-level pose estimation methods based on RGB-D images usually adopt a two-stage process: firstly, an object-specific mask is extracted from the image using instance segmentation techniques\cite{he2017mask}; then, an RGB image, a depth map, and a mask image are combined to generate the corresponding image patches and point clouds, which are used as inputs to the subsequent pose estimation module. In the second stage, the pose estimation module establishes the correspondence between the input points and the 3D points in the normalized coordinate space(NOCS)\cite{wang2019normalized} by combining the features of each input point to solve the 6D pose of the object accuratelychen2021sgpa, wang2019normalized, liu2023prior, lin2022sar.

Fig. 1

cite{chen2021sgpa} pose estimation error analysis.

However, previous approaches have improved the accuracy of pose estimation, they have typically not adequately considered the quality of the input data, particularly the challenges posed by missing and noisy modal data. For example, Fig. 1 illustrates the defective mask images of the laptop and camera due to incomplete masks generated during instance segmentation. This situation makes the point cloud reconstructed from the depth information incomplete and noisy, seriously affecting the subsequent processing. In the feature fusion stage, previous methods directly combine RGB features with geometric features, failing to effectively solve the problem of spatial feature perturbation triggered by missing data and noise, which directly leads to a decrease in the accuracy of pose estimation and produces erroneous results.

In this paper, we aim to mitigate the negative impact of missing input modal data and noise on the pose estimation results by introducing innovative data processing and feature fusion techniques. Specifically, in the data processing phase, we employ an object detection algorithm in an open scene to automatically obtain the positional information of objects in the image and use it as an input to Segment Anything (SAM)\cite{kirillov2023segment}. This approach achieves instance segmentation without manually specifying the segmentation region and accurately extracts the mask even for objects not visible within the category. We propose a novel cross-modal feature fusion method in the feature-fusion stage. The method fully uses the intrinsic correlation between the two input data. It implicitly integrates relevant features from different modalities, thus effectively mitigating the spatial feature perturbation problem caused by missing data and noise. This cross-modal fusion strategy aims to enhance the model's understanding and utilisation of multi-source information, improving the robustness and accuracy of pose estimation and ensuring that reliable estimation results can be obtained even under complex and suboptimal conditions.

We adopt a deep cross-modal feature fusion module. This module implicitly aggregates significant features of both modal data by reasoning about the global semantic similarity between appearance and geometric information, effectively overcoming the adverse effects of missing and noisy modal data.

Related Work

Instance-Level Object Pose Estimation

Instance-level 6D object pose estimation aims to estimate the 6D pose of a specific object. This research area mainly includes four types of methods: correspondence-based, template matching, voting-based, and regression-based methods. Correspondent-based methods can be divided into sparse correspondence methods\cite{wang2019densefusion} and dense correspondence methods\cite{Li_Wang_Ji_2019}. These methods establish sparse or dense correspondences between the image or point cloud and the CAD model and use PnP or least squares to solve the 6D pose. Template-based methodsSundermeyer_Marton_Durner_Brucker_Triebel_2018, Li_Lin_Jia_2022 match the most similar reference model to the object to be estimated and directly apply the known pose of the model to the target object. Template-based methods effectively address the challenges posed by texture-less objects. Voting-based methodsLiu_Iwase_Kitani_2021, Tian_Pan_Ang_Lee_2020 estimate a set of predefined 2D or 3D key points through pixel-level or point-level voting schemes and establish correspondences with the CAD model to solve the object’s pose. Regression-based methodsGao_Lauri_Wang_Hu_Zhang_Frintrop_2020, Kleeberger_Huber_2020 directly obtain the object’s pose from the features extracted by a network.

Category-Level Object Pose Estimation

The category-level pose estimation method aims to predict the 6D poses of all instances within a specified category without relying on a specific object 3D model.NOCS\cite{wang2019normalized} creates a standard 3D model shared within the class in normalized object coordinate space, establishes dense correspondences between pixel points in the input RGB image and the standard 3D model in the NOCS space through a network, and employs the Umeyama algorithm\cite{Umeyama_1991} to obtain the object's pose and size. Considering the effect of shape variation of objects within a class, SPD\cite{liu2023prior} proposes a 3D model reconstruction strategy based on object deformation. It utilizes RGB-D data and deforms it based on the class shape a priori to adapt to the specific morphology of different instances. This approach effectively addresses the challenges due to shape differences and improves pose estimation accuracy. To reduce the effect of redundant information, SGPA\cite{chen2021sgpa} introduces a key point selection mechanism to reduce the effect of redundant information on estimation results. This method improves pose estimation accuracy by acquiring pixel-point-based fusion features\cite{wang2019densefusion} and establishing sparse correspondence between key points and points in NOCS space. Selecting key points enables the model to focus on the most representative features, further enhancing robustness. Unlike the above approach of building a standard 3D model in NOCS space, unlike the approach of building a standard 3D model in NOCS space, gCasp\cite{ligenerative} employs a generative network to directly generate a 3D model of an object and establish semantic correspondences between the input point cloud and the generated 3D model to solve the object's pose. This approach responds effectively to shape differences of objects within a class and provides a more flexible and adaptive solution.

Method

Overview

For a given RGB-D image, the object's RGB image patch and point cloud are acquired by our designed data processing module and used as inputs to the feature extraction module(Sect. 7). The feature extraction module employs CNN\cite{he2016deep} and 3D-CGN\cite{Lin_Huang_Wang_2020} as encoders for two different inputs. In the decoding phase, step-by-step fusion is performed using a deep cross-modal feature fusion module(DCF), which effectively aggregates semantic features from different modalities(Sect. 8). Next, the DenseFusion\cite{wang2019densefusion} module combines RGB appearance features and point cloud geometric features at each point, resulting in point-wise fused features. A multi-layer perceptron (MLP)\cite{taud2018multilayer} then extracts key points of the point cloud in the camera coordinate system, while the 3D model generation module\cite{hao2020dualsdf} constructs the 3D model in the object coordinate system(Sect. 9). Finally, by aligning the 3D models in the camera and object coordinate systems, gets the 6D pose of the object(Sect. 10).

Fig. 2

(a)Framework Overview. (b)Data Processing module. (c) The feature extraction and fusion module (DCF). (d) Pose and Size Estimation.

$\bigoplus$

denotes the feature concatenation operation and \& signifies the image and operation, which involves selecting the corresponding mask region from the depth image. In this module,

$F_{fusion}$

represents the fused features.

Data processing module

To address the previous problem of biased pose estimation results due to the data processing module's inability to obtain accurate masks for unseen objects within the category, as shown in Fig. 2(b), we propose a new end-to-end data processing module in conjunction with the SAM large-model\cite{kirillov2023segment} instance segmentation method, which greatly improves the ability to cope with the generalization of objects within the category.

As shown in Fig. 2(b) module, the object detection algorithm\cite{cheng2024yolo} acquires the bounding box of the object denoted as $N\times4$, where $N$ is the number of objects in the image, and 4 denotes the coordinate information of the upper-left and lower-right corners$(x_{left}, x_{right}, y_{top}, y_{bottom})$ of the bounding box. The position encoder\cite{Tancik_Srinivasan_Mildenhall_Fridovich-Keil_Raghavan_Singhal_Ramamoorthi_Barron_Ng_2020} extracts the object position feature as $F_p{\in}\mathbb{R}^{N\times2\times256}$, where 2 denotes the two corner points of the bounding box.

The input RGB image is first filled to

$1024\times1024$

size, and the image features

$F_i{\in}\mathbb{R}^{64\times64\times256}$

extracted by the VIT encoder\cite{dosovitskiy2020image} pre-trained with MAE\cite{He_Chen_Xie_Li_Dollar_Girshick_2022}. Two different semantic information of

$F_i$

and

$F_p$

are fed into the transformer-based network module for fusion. A self-attention mechanism\cite{vaswani2017attention} is first employed to enhance the capability of global representation. Subsequently, the position features are used as query in the cross-attention module\cite{liu2019cross}, and the image features $F_i$ are used as key and value to fuse the information of two different modalities, position and image. In order to further enhance the focus of image features on position information, the fusion features after MLP processing are used as the key and value of the subsequent cross-attention module.

In contrast, the original image features are used as a query to obtain the final fusion features. The fused features are mapped to

$F_{final}\in\mathbb{R}^{B{\times}C{\times}H{\times}W}$

by the up-sampling module, where

$H$

and

$W$

are the height and width of the input image, respectively. Finally, a series of fully connected layers are used to calculate the probability that each pixel in the object location region belongs to the mask and generate the final mask.

By employing a mask decoder based on the Transformer architecture and training on a large-scale mask dataset, the model can accurately extract the object's mask even in the face of objects with significant variations in texture and shape within the category. This advanced mask extraction method maintains high accuracy in complex and changing scenes, thus providing a more reliable basis for subsequent tasks.

Deep Cross-Modal Feature Fusion Module

In response to the fact that the previous direct splicing of RGB features corresponding to pixel points with geometric features cannot overcome the interference of missing input modal data and noise in the feature space, we introduce a practical deep cross-modal feature fusion module for RGB-D data in the original backbone network\cite{ligenerative}. The core of this module is to design a new transformer-based cross-modal fusion module called Deep Cross-Modal Feature Fusion Network (DCF).

The RGB features at the i-th layer of the decoder in each feature extraction branch are denoted as

$F_R{\in}R^{\widehat{H}{\times}\widehat{W}{\times}\widehat{C}}$

, and the point cloud geometric features

$F_P{\in}\mathbb{R}^{\widehat{N}{\times}C}$

are taken as the input to the DCF fusion module. For these features with two different semantic structures, we first apply the self-attention mechanism(SA)\cite{vaswani2017attention} to the RGB

$F_R$

and geometric features

$F_P$

to establish the correlation within each feature, enhancing their global representation ability. Then, the features $F_R$ are unfolded to align their feature space with the geometric features, and both are composed of feature vectors of size

$1{\times}C$

. The unfolded feature representation is denoted as

$F_{in}{\in}\mathbb{R}^{N_{in}{\times}C}$

, where

$N_{in}={\widehat{N}\widehat{H}}$

. A cross-attention mechanism is introduced to reduce the interference from missing modal data and noise. The global feature vector obtained through max pooling is used as the query feature, and the following linear transformations are applied to obtain the query (Q), key (K), and value (V) matrices required for the cross-attention mechanism:

$\begin{aligned} Q^{max} = {F^{max}_{in}}{\cdot}{W^{max}_q}\\ K,V = F_{in}{\cdot}(W_k,W_v) \end{aligned}$

where

$W_q$

$W_k$

and

$W_v {\in}\mathbb{R}^{C{\times}C}$

represent the linear transformation matrices for Q, K, and V, respectively.

$F_{max}$

represents the feature obtained from max pooling.

Fig. 3

Deep Cross-Modal Feature Fusion Module. \label{deep_fusion}

We use the multi-head cross-attention module to align the semantic information of two modalities and obtain network inference features. The multi-head cross-attention mechanism can learn information in parallel from different feature subspaces, thereby capturing a variety of features and dependencies between features in the input data, improving the inference ability of the network model:

$\begin{aligned} CAH_i=CrossAttentionHead(Q^{max},K_i,V_i) = \sigma(Q^{max}(K_i)^T/\sqrt{d_k})V_i \end{aligned}$

$\begin{aligned} F_{out} = MultiHead(Q, K, V)= Concat(CAH_1,...,CAH_h)W^o \end{aligned}$

where

$CrossAttention$

is the cross-attention mechanism,

$CA_i$

represents the i-th layer in the multi-head cross-attention,

$1/\sqrt{d_k}$

denotes the scaling factor and

$d_k = C/h$

$\sigma$

is the standard

$softmax$

normalization function.

$W^o{\in}\mathbb{R}^{C{\times}C}$

represents the transformation matrix, and

$h$

is the number of cross-attention heads. The features

$F_{out1}$

and

$F_{out2}$

obtained from the cross-attention module are concatenated with the original input features

$F_R$

and

$F_P$

, respectively, to obtain the semantically aligned features denoted as

$F_{P{\rightarrow}R}$

and

$F_{R{\rightarrow}P}$

. These semantically aligned features are concatenated into a new feature sequence

$F_{rgbd}$

, the input to the Transformer-based module.

Contrary to conventional direct feature concatenation, we utilize a Transformer-based deep network module to perform implicit fusion of cross-modal features. This method enables the reasonable inference of features from missing modalities while minimizing the impact of noise. Following processing through the cross-attention mechanism, the resulting feature representation is labeled as

$F_{rgbd}{\in}\mathbb{R}^{L{\times}C}$

, with L indicating the layer within this mechanism. Feature

$F_{rgbd}$

is subsequently fed into the Transformer-based module

$Trm(\cdot)$

, illustrated in Fig. 3, yielding the final implicit fusion feature representation, denoted as $F_{final}$.

$\begin{aligned} F_l = MSA(LN(F_{o})) + F_{o} \end{aligned}$

$\begin{aligned} F_{final} = MLP(LN(F_l)) + F_l,l=1...L \end{aligned}$

where

$F_o = F_{rgbd} + \sigma_{pos}$

$\sigma_{pos}\in{\mathbb{R}^{L{\times}C}}$

is the position embeddings.

$MSA$

stands for multi-head self-attention,

$LN$

for layer normalization, and

$MLP$

for multi-layer perceptron. Ultimately, the feature

$F_{final}$

is divided into two sequences,

$F^{\prime}_I$

and

$F^{\prime}_p$

. These sequences are then integrated into the original feature branches to serve as the fused features.

In summary, for the two modal features output from the decoding layer of the network, DCF not only establishes the long-term dependency between them, but also implicitly aggregates the key features of the two modal data by inferring the global semantic similarity between appearance and geometric information. This global semantic consistency helps mitigate feature space perturbations caused by missing or noisy modal data. With the help of two independent feature extraction branches and our proposed fusion module, the system is able to implicitly acquire dense point-pair features and pass these features to the DenseFusion\cite{wang2019densefusion} module to generate richer dense fusion features.

Keypoint Extraction and 3D Model Reconstruction

For the point-wise fusion features acquired from DCF and DenseFusion, we refer to \cite{ligenerative} by constructing dense correspondences between key points in the camera coordinate system and the 3D model within the generated object coordinate system. To determine the 6D pose of the object, we employ the Umeyama algorithm\cite{Umeyama_1991}. This process ensures accurate alignment and positioning of the 3D model relative to the camera's viewpoint.

$\begin{aligned} L_{loss}(l_{pi},l_i) = CrossEntropy(l_{pi}, l_i) \end{aligned}$

$\begin{aligned} l_i = \mathop{\arg\min}\limits_{j=1...N_c}\left\|NOCS(p_i)-NOCS(c_j)\right\|_2 \end{aligned}$

To acquire key points in the camera coordinate system, we execute point-wise classification on the fusion features of the input point cloud. The network's loss function is the cross-entropy, as shown in Equation \ref{crossentropy}. In this equation,

$l_{pi}$

represents the predicted semantic class label for a point

$p_i$

in the point cloud that has been augmented with RGB features, while

$l_i$

denotes the ground truth semantic class label.The semantic class label is defined as shown in Equation \ref{class_label}, where

$NOCS(p_i)$

denotes the corresponding point of $p_i$ in the NOCS within the training dataset, and

$c_j$

represents 3D points selected from 3D models with distinct semantic classes in the training data. NOCS stands for the corresponding points in the Normalized Object Coordinate Space, and

$j$

denotes the number of semantic classes (totalling 256 classes). By performing point-wise classification, we calculate the average coordinates of input point cloud points that have the same semantic class label and designate these averages as key points in the camera coordinate system.

Fig. 4

Reconstruct the 3D model. \label{3D_model}

In contrast to previous methods that establish class-unified 3D models within the NOCS, our method employs a generative network\cite{hao2020dualsdf} to reconstruct the 3D model of the object. As illustrated in Fig. 4, the generated 3D model is composed of 256 semantic primitives, each labeled with distinct semantic class labels, with different colors used to differentiate these labels. By establishing correspondences between the key points in the camera coordinate system and the 3D model in the generated object coordinate system using semantic labels, we can accurately determine the 6D pose of the object.

Pose and Size Estimator

By leveraging the semantic class correspondence between the key points

$P_{kpt}$

in the camera coordinate system and the generated 3D model

$P_{Model}$

, we apply Equation \ref{estimate_pose} to accurately solve for the 6D pose of the object. In this process,

$N=256$

indicates the total number of key points involved in establishing this correspondence.

$\begin{aligned} S,R,T = \mathop{\arg\min}\limits_{\widehat{S},\widehat{R},\widehat{T}}\frac{1}{N_c}\sum_{i=1}^{N_c}\left\|P_{kpt}-\widehat{S}(\widehat{R}P_{Model} + \widehat{T})\right\|^2 \end{aligned}$

Experiments

Dataset and Evaluation Metrics

The proposed method in this paper is evaluated on the NOCS dataset, focusing primarily on the object segmentation within the data processing module and the pose estimation performance of the entire network. The dataset comprises six object categories: bottle, bowl, camera, can, laptop, and mug.

We adopts two evaluation metrics in line with previous work: 3D Intersection over Union (IoU) and pose error. The 3D IoU is used to measure the ratio of the overlapping volume between the predicted 3D bounding box of an object by the network and the ground truth, with commonly used thresholds being 50% and 75%. The pose error measures the rotational and translational errors between the predicted pose by the network and the true pose. Specifically, the rotational error is quantified using angular difference, while the translational error is measured by calculating the displacement distance. Commonly used thresholds for these errors are as follows:

$(5^{\circ}, 2cm)$

$(5^{\circ}, 5cm)$

$(10^{\circ}, 2cm)$

$(10^{\circ}, 5cm)$

Training and Implementation

For the data processing stage of acquiring image patches and point clouds of objects, only the existing target detection algorithms need to be fine-tuned to accomplish the task of object detection of the objects in the NOCS dataset. The 2D bounding boxes of the objects are inputted into the SAM network as the prompt information to obtain the mask information. Finally, the image patches and the corresponding point cloud are extracted by the corresponding position of the mask in the original RGB and depth map, and the image block is resized to the specified size while 1024 point cloud points are sampled. For the image feature extraction part, a pre-trained ResNet34\cite{he2016deep} was used as an encoder and combined with a four-level PSPNet\cite{Zhao_Shi_Qi_Wang_Jia_2017} as a decoder. For point cloud geometric feature extraction, 3D-GCN\cite{Lin_Huang_Wang_2020} was chosen as the feature extraction network. Between the layers of the decoder, a deep cross-modal feature fusion module is inserted to effectively fuse the features of both modalities, image and point cloud. The fused features are consistent with the gCasp method to solve the 6D position of the object.

Evaluation on Benchmark Datasets

We compares the proposed method with several state-of-the-art algorithms wang2019normalized, tian2020shape, chen2021sgpa, lin2022sar, ligenerative. We compares the proposed method with several state-of-the-art algorithms wang2019normalized, tian2020shape, chen2021sgpa, lin2022sar, ligenerative. NOCS\cite{wang2019normalized} maps pixel points directly to a canonical space through a network; SPD\cite{tian2020shape} and SGPA\cite{chen2021sgpa} utilize shape priors and object deformation predictions to recover the shape of target objects. SAR\cite{lin2022sar} and gCasp\cite{ligenerative} both construct 3D models of objects using generative models and estimate object poses by spatial correspondences between point pairs. Table 1 shows the quantitative results of the CAMERA25 and REAL275 datasets. Overall, the data processing method proposed in this paper achieves superior IoU metrics on both datasets compared to the methods above. By accurately obtaining the target RGB images and point cloud information and combining it with the pose estimation method proposed in this paper, the pose estimation results for target objects outperform other methods.

Comparison of our method with other five SOTA methodswang2019normalized, tian2020shape, chen2021sgpa, lin2022sar, ligenerative on REAL275 and CAMERA25. \label{tab1}
textbf{Dataset}	Methods	IoU50	IoU75	$\mathbf{(5^{\circ},2cm)}$	$\mathbf{(5^{\circ},5cm)}$	$\mathbf{(10^{\circ},2cm)}$	$\mathbf{(10^{\circ},5cm)}$


midrule
multirow[m]{6}{*}{REAL275}	NOCS\cite{wang2019normalized}	$77.0$	$30.1$	$7.2$	$10.0$	$13.8$	$25.2$
	SPD\cite{tian2020shape}	$76.3$	$53.2$	$19.3$	$21.4$	$43.2$	$54.1$
	SGPA\cite{chen2021sgpa}	$79.6$	$61.9$	$35.9$	$39.6$	$61.3$	$70.7$

	SAR\cite{lin2022sar}	$78.3$	$62.4$	$31.6$	$42.3$	$50.3$	$68.3$
	gCasp\cite{ligenerative}	$78.0$	$65.3$	$46.9$	$54.7$	$64.2$	$76.3$
	Our	$\mathbf{83.0}$	$\mathbf{76.9}$	$\mathbf{53.6}$	$\mathbf{60.7}$	$\mathbf{72.6}$	$\mathbf{81.7}$
multirow[m]{6}{*}{REAL275}	NOCS\cite{wang2019normalized}	$77.0$	$30.1$	$7.2$	$10.0$	$13.8$	$25.2$
	SPD\cite{tian2020shape}	$76.3$	$53.2$	$19.3$	$21.4$	$43.2$	$54.1$
	SGPA\cite{chen2021sgpa}	$79.6$	$61.9$	$35.9$	$39.6$	$61.3$	$70.7$

	SAR\cite{lin2022sar}	$78.3$	$62.4$	$31.6$	$42.3$	$50.3$	$68.3$
	gCasp\cite{ligenerative}	$78.0$	$65.3$	$46.9$	$54.7$	$64.2$	$76.3$
	Our	$\mathbf{83.0}$	$\mathbf{76.9}$	$\mathbf{53.6}$	$\mathbf{60.7}$	$\mathbf{72.6}$	$\mathbf{81.7}$
midrule
multirow[m]{4}{*}{CAMERA25}	NOCS\cite{wang2019normalized}	$83.9$	$69.5$	$32.3$	$40.9$	$48.2$	$64.6$
	SGPA\cite{chen2021sgpa}	$93.2$	$88.1$	$70.7$	$74.5$	$82.7$	$88.4$
	SAR\cite{lin2022sar}	$86.8$	$79.0$	$66.7$	$70.9$	$75.3$	$80.3$
	gCasp\cite{ligenerative}	$95.7$	$89.3$	$71.7$	$77.0$	$80.1$	$86.9$
	Our	$\mathbf{96.2}$	$\mathbf{91.7}$	$\mathbf{74.8}$	$\mathbf{79.2}$	$\mathbf{83.0}$	$\mathbf{88.3}$

Fig. 5

Comparison of Segmentation Results between Our and Mask R-CNN. \label{Segment Results}

Fig. 6

Qualitative Comparison between Our Method and gCasp. \label{result compare}

Figure 5 presents a qualitative comparison of our proposed data processing method against the traditional Mask R-CNN\cite{he2017mask} on the real-world dataset REAL275. The results demonstrate that our method significantly improves the segmentation generalization capability for category-level datasets.

Figure 6 visually compares the pose estimation results on the real dataset between the proposed method in this paper and the benchmark network. It can be seen that through improvements in data processing and feature extraction, there are significant enhancements in both IoU and rotational and translational accuracy.

Ablation Study

To validate the effectiveness of our proposed method on the REAL275 dataset, we conducted a series of ablation studies. These experiments aim to evaluate the contribution of each component to the overall performance by progressively removing or modifying key elements of the method.

Effectiveness of the Preprocessing Module. Compared to previous category-level pose estimation methods, the combination of object detection and instance segmentation adopted in our approach demonstrates stronger generalization capabilities in open scenes. Specifically, while maintaining the backbone network for pose estimation unchanged, the data processing method proposed in this paper significantly improves the IoU metric. Detailed results are shown in Table 2.

Table 2
Comparison of Pose Estimation Results Using Different Data Processing Methods.
textbf{Data Processing Methods}	IoU50	IoU75
MaskRCNN	$78.0$	$65.3$
Our	$84.9$	$76.9$

Effectiveness of the Fusion Module. The choice of fusion strategy has a significant impact on model performance. To validate the effectiveness of the fusion module, we designed four experiments, as shown in the Table 3. We used the traditional Mask R-CNN and our proposed instance segment module as data processing methods during the data processing stage. Building upon gCasp\cite{ligenerative} as the backbone network in the feature fusion stage, we compared dense feature fusion with the cross-modal feature fusion proposed in this paper. The final results demonstrate that using the two-stage approach in the data processing stage and cross-modal feature fusion in the feature fusion stage yields the highest pose estimation accuracy.

begin{table}[h]\caption{Comparison of Pose Estimation Results Using Different Data Processing and Feature Fusion Methods}\label{tab3}\begin{tabular*}{\textwidth}{@{\extracolsep\fill}ccccccccc}\toprule \multicolumn{2}{c}{Data Processing Methods} & \multicolumn{2}{c}{Feature Fusion Methods} & \multicolumn{4}{c}{Results}\\ \hline MaskRCNN & Our & DenseFusion & Our & $5^\circ2cm$ & $5^\circ5cm$ & $10^\circ2cm$ & $10^\circ5cm$\\ \hline \checkmark & & \checkmark & & $48.2$ & $56.1$ & $65.6$ & $78.2$ \\ \checkmark & & & \checkmark & $50.4$ & $57.8$ &$67.4$ & $79.2$ \& \checkmark & \checkmark & & $49.1$ & $57.3$ & $66.2$ & $78.9$ \& \checkmark & & \checkmark & $53.5$ & $60.7$ & $72.7$ & $81.6$ \\\botrule\end{tabular*}\end{table}

Based on the data presented in Tables 2 and 3, we can draw the following conclusions: the data processing module primarily influences the IoU metric in object pose estimation. This module mainly provides information about the location and category of objects. The subsequent backbone network establishes accurate correspondences between the features extracted from pixel points and the 3D points in the 3D model within the object coordinate system to estimate the 6D pose of the object. Therefore, the data processing module and the cross-modal feature fusion module proposed in this paper contribute to significant improvements in object pose estimation, validating the effectiveness of our approach.

Conclusions

Existing category-level pose estimation methods suffer from insufficient generalization capabilities for unseen objects within the category, resulting in incomplete objects and point clouds that are missing or contain noise in the image patches acquired by the data processing module. This results in existing methods being unable to acquire accurate fused features to establish correct dense correspondences, producing incorrect pose estimation results. To address this problem, we propose a new data processing module. This data processing module effectively combines two methods with the ability to handle few-shot objects, and accurate image patches and point clouds can still be acquired even if the objects are not visible within the category. Secondly, a deep cross-modal feature fusion module is proposed, which is different from the previous direct concatenated RGB appearance and point cloud geometric features, by reasoning the global semantic similarity between the appearance and geometric information, and implicitly aggregating the salient features of the two modal data, which effectively overcomes the negative effects of the missing and noisy input modal data. The obtained pixel-point fusion features are finally used to establish a correspondence between the input point cloud and the generated 3D model of the object, and to solve the 6D orientation of the object. We validate the effectiveness of the method by comparing it with several state-of-the-art methods on the public datasets REAL275 and CAMERA25.

Although our proposed method improves the generalization ability of the pose estimation network for unseen objects within the category, it performs poorly for the pose estimation of symmetric objects, due to the fact that symmetric objects maintain the same appearance features under different poses. Second, the cross-modal feature fusion approach increases the complexity of the network, leading to a significant increase in network training time and computational cost. Therefore, future work should mainly address the pose estimation of symmetric objects as well as the adoption of lightweight feature fusion modules.

bibliography{sn-bibliography}

Author Contribution

C.T. and M.Y. made substantial contributions to the conception or design of the work;C.T. and X.S. drafted the work or revised it critically for important intellectual content;Y.Z. approved the version to be published; C.T., M.Z. Y.Z. and X.S. agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved

References:

Gattullo, Michele and Scurati, Giulia Wally and Fiorentino, Michele and Uva, Antonio Emmanuele and Ferrise, Francesco and Bordegoni, Monica (2019) Towards augmented reality manuals for industry 4.0: A methodology. robotics and computer-integrated manufacturing 56: 276--286 Elsevier

Zhou, Jun and Chen, Kai and Xu, Linlin and Dou, Qi and Qin, Jing (2023) Deep fusion transformer network with weighted vector-wise keypoints voting for robust 6d object pose estimation. 13967--13977, Proceedings of the IEEE/CVF International Conference on Computer Vision

Cipresso, Pietro and Giglioli, Irene Alice Chicchi and Raya, Mariano Alca{\ n}iz and Riva, Giuseppe (2018) The past, present, and future of virtual and augmented reality research: a network and cluster analysis of the literature. Frontiers in psychology 9: 2086 Frontiers Media SA

Mousavian, Arsalan and Eppner, Clemens and Fox, Dieter (2019) 6-DOF GraspNet: Variational Grasp Generation for Object Manipulation. en-US, Oct, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 10.1109/iccv.2019.00299, http://dx.doi.org/10.1109/iccv.2019.00299

Tremblay, Jonathan and To, Thang and Sundaralingam, Balakumar and Xiang, Yu and Fox, Dieter and Birchfield, Stan (2018) Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790

Liu, Jian and Sun, Wei and Yang, Hui and Zeng, Zhiwen and Liu, Chongpei and Zheng, Jin and Liu, Xingyu and Rahmani, Hossein and Sebe, Nicu and Mian, Ajmal (2024) Deep Learning-Based Object Pose Estimation: A Comprehensive Survey. arXiv preprint arXiv:2405.07801

Lee, Taeyeop and Lee, Byeong-Uk and Shin, Inkyu and Choe, Jaesung and Shin, Ukcheol and Kweon, In So and Yoon, Kuk-Jin (2022) UDA-COPE: Unsupervised domain adaptation for category-level object pose estimation. 14891--14900, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lin, Jiehong and Wei, Zewei and Li, Zhihao and Xu, Songcen and Jia, Kui and Li, Yuanqing (2021) Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. 3560--3569, Proceedings of the IEEE/CVF International Conference on Computer Vision

Liu, Jianhui and Chen, Yukang and Ye, Xiaoqing and Qi, Xiaojuan (2023) Prior-free category-level pose estimation with implicit space transformation. IEEE International Conference on Computer Vision 2023 (02/10/2023-06/10/2023, Paris)

Wang, He and Sridhar, Srinath and Huang, Jingwei and Valentin, Julien and Song, Shuran and Guibas, Leonidas J (2019) Normalized object coordinate space for category-level 6d object pose and size estimation. 2642--2651, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Chen and Xu, Danfei and Zhu, Yuke and Mart{\'\i}n-Mart{\'\i}n, Roberto and Lu, Cewu and Fei-Fei, Li and Savarese, Silvio (2019) Densefusion: 6d object pose estimation by iterative dense fusion. 3343--3352, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yu, Sheng and Zhai, Di-Hua and Xia, Yuanqing (2024) Catformer: Category-level 6d object pose estimation with transformer. 6808--6816, 7, 38, Proceedings of the AAAI Conference on Artificial Intelligence

He, Kaiming and Gkioxari, Georgia and Doll{\'a}r, Piotr and Girshick, Ross (2017) Mask r-cnn. 2961--2969, Proceedings of the IEEE international conference on computer vision

Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying (2024) Yolo-world: Real-time open-vocabulary object detection. 16901--16911, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C and Lo, Wan-Yen and others (2023) Segment anything. 4015--4026, Proceedings of the IEEE/CVF International Conference on Computer Vision

Tian, Meng and Ang, Marcelo H and Lee, Gim Hee (2020) Shape prior deformation for categorical 6d object pose and size estimation. Springer, 530--546, Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXI 16

Chen, Chun-Fu Richard and Fan, Quanfu and Panda, Rameswar (2021) Crossvit: Cross-attention multi-scale vision transformer for image classification. 357--366, Proceedings of the IEEE/CVF international conference on computer vision

Dosovitskiy, Alexey (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

Deng, Xinke and Geng, Junyi and Bretl, Timothy and Xiang, Yu and Fox, Dieter (2022) iCaps: Iterative category-level object pose and shape estimation. IEEE Robotics and Automation Letters 7(2): 1784--1791 IEEE

Li, Guanglin and Zhang, Yifeng Li2 Zhichao Ye1 Qihang and Kong, Tao and Zhang, Zhaopeng Cui1 Guofeng Generative Category-Level Shape and Pose Estimation with Semantic Primitives Supplementary Material.

Umeyama, S. (1991) Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence : 376 –380 https://doi.org/10.1109/34.88573, en-US, Apr, http://dx.doi.org/10.1109/34.88573

Hao, Zekun and Averbuch-Elor, Hadar and Snavely, Noah and Belongie, Serge (2020) Dualsdf: Semantic shape manipulation using a two-level representation. 7631--7641, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tian, Meng and Ang, Marcelo H. and Lee, Gim Hee (2020) Shape Prior Deformation for Categorical 6D Object Pose and Size Estimation. en-US, 530 –546, Jan, Computer Vision – ECCV 2020,Lecture Notes in Computer Science, 10.1007/978-3-030-58589-1_32, http://dx.doi.org/10.1007/978-3-030-58589-1_32

Chen, Kai and Dou, Qi (2021) Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. 2773--2782, Proceedings of the IEEE/CVF International Conference on Computer Vision

Lin, Haitao and Liu, Zichang and Cheang, Chilam and Fu, Yanwei and Guo, Guodong and Xue, Xiangyang (2022) Sar-net: Shape alignment and recovery network for category-level 6d object pose and size estimation. 6707--6717, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Peng, Sida and Liu, Yuan and Huang, Qixing and Zhou, Xiaowei and Bao, Hujun (2019) Pvnet: Pixel-wise voting network for 6dof pose estimation. 4561--4570, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, Yangzheng and Javaheri, Alireza and Zand, Mohsen and Greenspan, Michael (2022) Keypoint cascade voting for point cloud based 6DoF pose estimation. IEEE, 176--186, 2022 International Conference on 3D Vision (3DV)

Sundermeyer, Martin and Marton, Zoltan-Csaba and Durner, Maximilian and Brucker, Manuel and Triebel, Rudolph (2018) Implicit 3D Orientation Learning for 6D Object Detection from RGB Images. en-US, 712 –729, Jan, Computer Vision – ECCV 2018,Lecture Notes in Computer Science, 10.1007/978-3-030-01231-1_43, http://dx.doi.org/10.1007/978-3-030-01231-1_43

Li, Hongyang and Lin, Jiehong and Jia, Kui (2022) DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation. en-US, Oct

Li, Zhigang and Wang, Gu and Ji, Xiangyang (2019) CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation. en-US, Oct, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 10.1109/iccv.2019.00777, http://dx.doi.org/10.1109/iccv.2019.00777

Liu, Xingyu and Iwase, Shun and Kitani, Kris M. (2021) KDFNet: Learning Keypoint Distance Field for 6D Object Pose Estimation. en-US, Sep, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10.1109/iros51168.2021.9636489, http://dx.doi.org/10.1109/iros51168.2021.9636489

Tian, Meng and Pan, Liang and Ang, MarceloH. and Lee, GimHee (2020) Robust 6D Object Pose Estimation by Learning RGB-D Features. Cornell University - arXiv,Cornell University - arXiv https://doi.org/10.1109/icra40945.2020.9197555, en-US, May

Gao, Ge and Lauri, Mikko and Wang, Yulong and Hu, Xiaolin and Zhang, Jianwei and Frintrop, Simone (2020) 6D Object Pose Regression via Supervised Learning on Point Clouds. en-US, May, 2020 IEEE International Conference on Robotics and Automation (ICRA), 10.1109/icra40945.2020.9197461, http://dx.doi.org/10.1109/icra40945.2020.9197461

Kleeberger, Kilian and Huber, MarcoF. (2020) Single Shot 6D Object Pose Estimation. Cornell University - arXiv,Cornell University - arXiv en-US, Apr

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian (2016) Deep residual learning for image recognition. 770--778, Proceedings of the IEEE conference on computer vision and pattern recognition

Lin, Zhi-Hao and Huang, Sheng-Yu and Wang, Yu-Chiang Frank (2020) Convolution in the Cloud: Learning Deformable Kernels in 3D Graph Convolution Networks for Point Cloud Analysis. en-US, Jun, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr42600.2020.00187, http://dx.doi.org/10.1109/cvpr42600.2020.00187

Hao, Zekun and Averbuch-Elor, Hadar and Snavely, Noah and Belongie, Serge (2020) DualSDF: Semantic Shape Manipulation using a Two-Level Representation. Cornell University - arXiv,Cornell University - arXiv en-US, Apr

Taud, Hind and Mas, Jean-Franccois (2018) Multilayer perceptron (MLP). Geomatic approaches for modeling land change scenarios : 451--455 Springer

Zhao, Hengshuang and Shi, Jianping and Qi, Xiaojuan and Wang, Xiaogang and Jia, Jiaya (2017) Pyramid Scene Parsing Network. en-US, Jul, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr.2017.660, http://dx.doi.org/10.1109/cvpr.2017.660

He, Yisheng and Sun, Wei and Huang, Haibin and Liu, Jianran and Fan, Haoqiang and Sun, Jian (2019) PVN3D: A Deep Point-wise 3D Keypoints Voting Network for 6DoF Pose Estimation. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition en-US, Nov

He, Yisheng and Huang, Haibin and Fan, Haoqiang and Chen, Qifeng and Sun, Jian (2021) FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation. en-US, Jun, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr46437.2021.00302, http://dx.doi.org/10.1109/cvpr46437.2021.00302

Peng, Sida and Liu, Yuan and Huang, Qixing and Zhou, Xiaowei and Bao, Hujun (2019) PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation. en-US, Jun, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr.2019.00469, http://dx.doi.org/10.1109/cvpr.2019.00469

Lin, Tsung-Yi and Dollar, Piotr and Girshick, Ross and He, Kaiming and Hariharan, Bharath and Belongie, Serge (2017) Feature Pyramid Networks for Object Detection. en-US, Jul, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr.2017.106, http://dx.doi.org/10.1109/cvpr.2017.106

Liu, Shu and Qi, Lu and Qin, Haifang and Shi, Jianping and Jia, Jiaya (2018) Path Aggregation Network for Instance Segmentation. en-US, Jun, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10.1109/cvpr.2018.00913, http://dx.doi.org/10.1109/cvpr.2018.00913

Vaswani, A (2017) Attention is all you need. Advances in Neural Information Processing Systems

Liu, Mengyu and Yin, Hujun (2019) Cross attention network for semantic segmentation. IEEE, 2434--2438, 2019 IEEE International Conference on Image Processing (ICIP)

Tancik, Matthew and Srinivasan, PratulP. and Mildenhall, Ben and Fridovich-Keil, Sara and Raghavan, Nithin and Singhal, Utkarsh and Ramamoorthi, Ravi and Barron, JonathanT. and Ng, Ren (2020) Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition en-US, Jun

He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Dollar, Piotr and Girshick, Ross (2022) Masked Autoencoders Are Scalable Vision Learners. en-US, Jun, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr52688.2022.01553, http://dx.doi.org/10.1109/cvpr52688.2022.01553

Song, Yiwei and Tang, Chunhui (2024) A RGB-D feature fusion network for occluded object 6D pose estimation. Signal, Image and Video Processing 18(8-9): 6309-6319

Additional Files

Additional file 12