Introduction
Category-level 6D pose estimation focuses on identifying an object's 3D translation and 3D rotation with respect to a camera. This technique plays a crucial role in several domains, such as robotic arm manipulation\cite{Mousavian_Eppner_Fox_2019}, augmented reality\cite{gattullo2019towards}, and virtual reality\cite{cipresso2018past}. Early research efforts focused on instance-level 6D pose estimationHe_Sun_Huang_Liu_Fan_Sun_2019, He_Huang_Fan_Chen_Sun_2021, zhou2023deep, 2024A. However, the application of this method has been limited due to the difficulty of generalizing it to other instances in the same category. Therefore, in recent years, scholars have started to pay more attention to the category-level pose estimation method with stronger generalization ability, which can more effectively deal with the pose estimation problem of other objects in the same category and thus has attracted widespread attention and importance.
Category-level pose estimation aims to use a limited sample of objects from different categories by training the network to predict the 6D poses and sizes of unseen objects in the same category. This approach has demonstrated excellent generalization performance on the pose estimation task\cite{wang2019normalized}. Existing category-level pose estimation methods based on RGB-D images usually adopt a two-stage process: firstly, an object-specific mask is extracted from the image using instance segmentation techniques\cite{he2017mask}; then, an RGB image, a depth map, and a mask image are combined to generate the corresponding image patches and point clouds, which are used as inputs to the subsequent pose estimation module. In the second stage, the pose estimation module establishes the correspondence between the input points and the 3D points in the normalized coordinate space(NOCS)\cite{wang2019normalized} by combining the features of each input point to solve the 6D pose of the object accuratelychen2021sgpa, wang2019normalized, liu2023prior, lin2022sar.
However, previous approaches have improved the accuracy of pose estimation, they have typically not adequately considered the quality of the input data, particularly the challenges posed by missing and noisy modal data. For example, Fig. 1 illustrates the defective mask images of the laptop and camera due to incomplete masks generated during instance segmentation. This situation makes the point cloud reconstructed from the depth information incomplete and noisy, seriously affecting the subsequent processing. In the feature fusion stage, previous methods directly combine RGB features with geometric features, failing to effectively solve the problem of spatial feature perturbation triggered by missing data and noise, which directly leads to a decrease in the accuracy of pose estimation and produces erroneous results.
In this paper, we aim to mitigate the negative impact of missing input modal data and noise on the pose estimation results by introducing innovative data processing and feature fusion techniques. Specifically, in the data processing phase, we employ an object detection algorithm in an open scene to automatically obtain the positional information of objects in the image and use it as an input to Segment Anything (SAM)\cite{kirillov2023segment}. This approach achieves instance segmentation without manually specifying the segmentation region and accurately extracts the mask even for objects not visible within the category. We propose a novel cross-modal feature fusion method in the feature-fusion stage. The method fully uses the intrinsic correlation between the two input data. It implicitly integrates relevant features from different modalities, thus effectively mitigating the spatial feature perturbation problem caused by missing data and noise. This cross-modal fusion strategy aims to enhance the model's understanding and utilisation of multi-source information, improving the robustness and accuracy of pose estimation and ensuring that reliable estimation results can be obtained even under complex and suboptimal conditions.
1.We adopt a deep cross-modal feature fusion module. This module implicitly aggregates significant features of both modal data by reasoning about the global semantic similarity between appearance and geometric information, effectively overcoming the adverse effects of missing and noisy modal data.
References:
Gattullo, Michele and Scurati, Giulia Wally and Fiorentino, Michele and Uva, Antonio Emmanuele and Ferrise, Francesco and Bordegoni, Monica (2019) Towards augmented reality manuals for industry 4.0: A methodology. robotics and computer-integrated manufacturing 56: 276--286 Elsevier
Zhou, Jun and Chen, Kai and Xu, Linlin and Dou, Qi and Qin, Jing (2023) Deep fusion transformer network with weighted vector-wise keypoints voting for robust 6d object pose estimation. 13967--13977, Proceedings of the IEEE/CVF International Conference on Computer Vision
Cipresso, Pietro and Giglioli, Irene Alice Chicchi and Raya, Mariano Alca{\ n}iz and Riva, Giuseppe (2018) The past, present, and future of virtual and augmented reality research: a network and cluster analysis of the literature. Frontiers in psychology 9: 2086 Frontiers Media SA
Mousavian, Arsalan and Eppner, Clemens and Fox, Dieter (2019) 6-DOF GraspNet: Variational Grasp Generation for Object Manipulation. en-US, Oct, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 10.1109/iccv.2019.00299, http://dx.doi.org/10.1109/iccv.2019.00299
Tremblay, Jonathan and To, Thang and Sundaralingam, Balakumar and Xiang, Yu and Fox, Dieter and Birchfield, Stan (2018) Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790
Liu, Jian and Sun, Wei and Yang, Hui and Zeng, Zhiwen and Liu, Chongpei and Zheng, Jin and Liu, Xingyu and Rahmani, Hossein and Sebe, Nicu and Mian, Ajmal (2024) Deep Learning-Based Object Pose Estimation: A Comprehensive Survey. arXiv preprint arXiv:2405.07801
Lee, Taeyeop and Lee, Byeong-Uk and Shin, Inkyu and Choe, Jaesung and Shin, Ukcheol and Kweon, In So and Yoon, Kuk-Jin (2022) UDA-COPE: Unsupervised domain adaptation for category-level object pose estimation. 14891--14900, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Lin, Jiehong and Wei, Zewei and Li, Zhihao and Xu, Songcen and Jia, Kui and Li, Yuanqing (2021) Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. 3560--3569, Proceedings of the IEEE/CVF International Conference on Computer Vision
Liu, Jianhui and Chen, Yukang and Ye, Xiaoqing and Qi, Xiaojuan (2023) Prior-free category-level pose estimation with implicit space transformation. IEEE International Conference on Computer Vision 2023 (02/10/2023-06/10/2023, Paris)
Wang, He and Sridhar, Srinath and Huang, Jingwei and Valentin, Julien and Song, Shuran and Guibas, Leonidas J (2019) Normalized object coordinate space for category-level 6d object pose and size estimation. 2642--2651, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wang, Chen and Xu, Danfei and Zhu, Yuke and Mart{\'\i}n-Mart{\'\i}n, Roberto and Lu, Cewu and Fei-Fei, Li and Savarese, Silvio (2019) Densefusion: 6d object pose estimation by iterative dense fusion. 3343--3352, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yu, Sheng and Zhai, Di-Hua and Xia, Yuanqing (2024) Catformer: Category-level 6d object pose estimation with transformer. 6808--6816, 7, 38, Proceedings of the AAAI Conference on Artificial Intelligence
He, Kaiming and Gkioxari, Georgia and Doll{\'a}r, Piotr and Girshick, Ross (2017) Mask r-cnn. 2961--2969, Proceedings of the IEEE international conference on computer vision
Cheng, Tianheng and Song, Lin and Ge, Yixiao and Liu, Wenyu and Wang, Xinggang and Shan, Ying (2024) Yolo-world: Real-time open-vocabulary object detection. 16901--16911, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C and Lo, Wan-Yen and others (2023) Segment anything. 4015--4026, Proceedings of the IEEE/CVF International Conference on Computer Vision
Tian, Meng and Ang, Marcelo H and Lee, Gim Hee (2020) Shape prior deformation for categorical 6d object pose and size estimation. Springer, 530--546, Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXI 16
Chen, Chun-Fu Richard and Fan, Quanfu and Panda, Rameswar (2021) Crossvit: Cross-attention multi-scale vision transformer for image classification. 357--366, Proceedings of the IEEE/CVF international conference on computer vision
Dosovitskiy, Alexey (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Deng, Xinke and Geng, Junyi and Bretl, Timothy and Xiang, Yu and Fox, Dieter (2022) iCaps: Iterative category-level object pose and shape estimation. IEEE Robotics and Automation Letters 7(2): 1784--1791 IEEE
Li, Guanglin and Zhang, Yifeng Li2 Zhichao Ye1 Qihang and Kong, Tao and Zhang, Zhaopeng Cui1 Guofeng Generative Category-Level Shape and Pose Estimation with Semantic Primitives Supplementary Material.
Umeyama, S. (1991) Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence : 376 –380 https://doi.org/10.1109/34.88573, en-US, Apr, http://dx.doi.org/10.1109/34.88573
Hao, Zekun and Averbuch-Elor, Hadar and Snavely, Noah and Belongie, Serge (2020) Dualsdf: Semantic shape manipulation using a two-level representation. 7631--7641, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Tian, Meng and Ang, Marcelo H. and Lee, Gim Hee (2020) Shape Prior Deformation for Categorical 6D Object Pose and Size Estimation. en-US, 530 –546, Jan, Computer Vision – ECCV 2020,Lecture Notes in Computer Science, 10.1007/978-3-030-58589-1_32, http://dx.doi.org/10.1007/978-3-030-58589-1_32
Chen, Kai and Dou, Qi (2021) Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. 2773--2782, Proceedings of the IEEE/CVF International Conference on Computer Vision
Lin, Haitao and Liu, Zichang and Cheang, Chilam and Fu, Yanwei and Guo, Guodong and Xue, Xiangyang (2022) Sar-net: Shape alignment and recovery network for category-level 6d object pose and size estimation. 6707--6717, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Peng, Sida and Liu, Yuan and Huang, Qixing and Zhou, Xiaowei and Bao, Hujun (2019) Pvnet: Pixel-wise voting network for 6dof pose estimation. 4561--4570, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wu, Yangzheng and Javaheri, Alireza and Zand, Mohsen and Greenspan, Michael (2022) Keypoint cascade voting for point cloud based 6DoF pose estimation. IEEE, 176--186, 2022 International Conference on 3D Vision (3DV)
Sundermeyer, Martin and Marton, Zoltan-Csaba and Durner, Maximilian and Brucker, Manuel and Triebel, Rudolph (2018) Implicit 3D Orientation Learning for 6D Object Detection from RGB Images. en-US, 712 –729, Jan, Computer Vision – ECCV 2018,Lecture Notes in Computer Science, 10.1007/978-3-030-01231-1_43, http://dx.doi.org/10.1007/978-3-030-01231-1_43
Li, Hongyang and Lin, Jiehong and Jia, Kui (2022) DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation. en-US, Oct
Li, Zhigang and Wang, Gu and Ji, Xiangyang (2019) CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation. en-US, Oct, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 10.1109/iccv.2019.00777, http://dx.doi.org/10.1109/iccv.2019.00777
Liu, Xingyu and Iwase, Shun and Kitani, Kris M. (2021) KDFNet: Learning Keypoint Distance Field for 6D Object Pose Estimation. en-US, Sep, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10.1109/iros51168.2021.9636489, http://dx.doi.org/10.1109/iros51168.2021.9636489
Tian, Meng and Pan, Liang and Ang, MarceloH. and Lee, GimHee (2020) Robust 6D Object Pose Estimation by Learning RGB-D Features. Cornell University - arXiv,Cornell University - arXiv https://doi.org/10.1109/icra40945.2020.9197555, en-US, May
Gao, Ge and Lauri, Mikko and Wang, Yulong and Hu, Xiaolin and Zhang, Jianwei and Frintrop, Simone (2020) 6D Object Pose Regression via Supervised Learning on Point Clouds. en-US, May, 2020 IEEE International Conference on Robotics and Automation (ICRA), 10.1109/icra40945.2020.9197461, http://dx.doi.org/10.1109/icra40945.2020.9197461
Kleeberger, Kilian and Huber, MarcoF. (2020) Single Shot 6D Object Pose Estimation. Cornell University - arXiv,Cornell University - arXiv en-US, Apr
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian (2016) Deep residual learning for image recognition. 770--778, Proceedings of the IEEE conference on computer vision and pattern recognition
Lin, Zhi-Hao and Huang, Sheng-Yu and Wang, Yu-Chiang Frank (2020) Convolution in the Cloud: Learning Deformable Kernels in 3D Graph Convolution Networks for Point Cloud Analysis. en-US, Jun, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr42600.2020.00187, http://dx.doi.org/10.1109/cvpr42600.2020.00187
Hao, Zekun and Averbuch-Elor, Hadar and Snavely, Noah and Belongie, Serge (2020) DualSDF: Semantic Shape Manipulation using a Two-Level Representation. Cornell University - arXiv,Cornell University - arXiv en-US, Apr
Taud, Hind and Mas, Jean-Franccois (2018) Multilayer perceptron (MLP). Geomatic approaches for modeling land change scenarios : 451--455 Springer
Zhao, Hengshuang and Shi, Jianping and Qi, Xiaojuan and Wang, Xiaogang and Jia, Jiaya (2017) Pyramid Scene Parsing Network. en-US, Jul, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr.2017.660, http://dx.doi.org/10.1109/cvpr.2017.660
He, Yisheng and Sun, Wei and Huang, Haibin and Liu, Jianran and Fan, Haoqiang and Sun, Jian (2019) PVN3D: A Deep Point-wise 3D Keypoints Voting Network for 6DoF Pose Estimation. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition en-US, Nov
He, Yisheng and Huang, Haibin and Fan, Haoqiang and Chen, Qifeng and Sun, Jian (2021) FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation. en-US, Jun, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr46437.2021.00302, http://dx.doi.org/10.1109/cvpr46437.2021.00302
Peng, Sida and Liu, Yuan and Huang, Qixing and Zhou, Xiaowei and Bao, Hujun (2019) PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation. en-US, Jun, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr.2019.00469, http://dx.doi.org/10.1109/cvpr.2019.00469
Lin, Tsung-Yi and Dollar, Piotr and Girshick, Ross and He, Kaiming and Hariharan, Bharath and Belongie, Serge (2017) Feature Pyramid Networks for Object Detection. en-US, Jul, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr.2017.106, http://dx.doi.org/10.1109/cvpr.2017.106
Liu, Shu and Qi, Lu and Qin, Haifang and Shi, Jianping and Jia, Jiaya (2018) Path Aggregation Network for Instance Segmentation. en-US, Jun, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10.1109/cvpr.2018.00913, http://dx.doi.org/10.1109/cvpr.2018.00913
Vaswani, A (2017) Attention is all you need. Advances in Neural Information Processing Systems
Liu, Mengyu and Yin, Hujun (2019) Cross attention network for semantic segmentation. IEEE, 2434--2438, 2019 IEEE International Conference on Image Processing (ICIP)
Tancik, Matthew and Srinivasan, PratulP. and Mildenhall, Ben and Fridovich-Keil, Sara and Raghavan, Nithin and Singhal, Utkarsh and Ramamoorthi, Ravi and Barron, JonathanT. and Ng, Ren (2020) Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition en-US, Jun
He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Dollar, Piotr and Girshick, Ross (2022) Masked Autoencoders Are Scalable Vision Learners. en-US, Jun, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/cvpr52688.2022.01553, http://dx.doi.org/10.1109/cvpr52688.2022.01553
Song, Yiwei and Tang, Chunhui (2024) A RGB-D feature fusion network for occluded object 6D pose estimation. Signal, Image and Video Processing 18(8-9): 6309-6319