← Back IROS 2024

3D Object Detection Via Stereo Pyramid Transformers with Rich Semantic Feature Fusion

Rongqi Gu, Chu Yang, Yaohan Lu, Peigen Liu, FEI WU, Guang Chen

PDF

Abstract

Camera-based 3D object detectors, prized for their broader applicability and cost-effectiveness compared to LiDAR sensors, still grapple with the inherently ill-posed nature of depth extraction from images. In this work, we present a novel approach that employs a transformer-based backbone and a fused geometry volume to bolster feature richness and elevate detection accuracy. Firstly, we propose the Stereo Pyramid Transformer backbone to extract features from stereo images, which can capture global information and establish cross-image semantic connections. Then, to tackle the challenge posed by small baseline binocular cameras, we propose to fuse stereo geometry volumes constructed by Stereo Plane Sweeping Volume (SPSV), Monocular Semantic Volume (MSV), and Lifted Volume (LV) to create finely detailed feature volumes. Through extensive experiments on both the KITTI and our datasets, our approach not only surpasses all existing transformer-based stereo 3D detection methods but also marks a significant milestone by achieving comparable performance with the leading CNN-based 3D detectors for the first time.

Index terms

Object Detection Segmentation and Categorization Computer Vision for Transportation Deep Learning Methods