← Back ICRA 2024

CVFormer: Learning Circum-View Representation and Consistency Constraints for Vision-Based Occupancy Prediction Via Transformers

Zhengqi Bai, Wenjun Shi, Dongchen Zhu, HanLong Kang, Guanghui Zhang, Gang Ye, Yang Xiao, Lei Wang, Xiaolin Zhang, bo Li, Jiamao Li

PDF

Abstract

With the increasing demands for perception ac- curacy in autonomous driving, there is a growing focus on fine-grained 3D semantic occupancy prediction. Effectively representing detailed three-dimensional scenes has become a significant challenge in the development of this task. In this paper, we present a novel transformer-based framework named CVFormer, which leverages two-dimensional circum-views from the ego to excavate three-dimensional features of the surround- ing environment. Circum-views provide a novel solution for ef- fectively addressing the representation of dense and fine-grained scenes. Specifically, a multi-attention module CTMA is designed for fusing temporal features from circum-views to fully exploit the spatiotemporal correlations between frames and capture more comprehensive clues. Furthermore, a novel 2D projection constraint is established by observing objects from different perspective directions, and multiple 3D constraints based on object invariance and semantic consistency are also conducted for supervising the network, which enhances its performance of understanding the scene. Experimental results on nuScenes dataset demonstrate that the proposed CVFormer obviously outperforms existing methods for occupancy prediction.

Index terms

Deep Learning for Visual Perception Computer Vision for Transportation Computer Vision for Automation