CVFormer: Learning Circum-View Representation and Consistency Constraints for Vision-Based Occupancy Prediction Via Transformers
Zhengqi Bai, Wenjun Shi, Dongchen Zhu, HanLong Kang, Guanghui Zhang, Gang Ye, Yang Xiao, Lei Wang, Xiaolin Zhang, bo Li, Jiamao Li
Abstract
With the increasing demands for perception ac- curacy in autonomous driving, there is a growing focus on fine-grained 3D semantic occupancy prediction. Effectively representing detailed three-dimensional scenes has become a significant challenge in the development of this task. In this paper, we present a novel transformer-based framework named CVFormer, which leverages two-dimensional circum-views from the ego to excavate three-dimensional features of the surround- ing environment. Circum-views provide a novel solution for ef- fectively addressing the representation of dense and fine-grained scenes. Specifically, a multi-attention module CTMA is designed for fusing temporal features from circum-views to fully exploit the spatiotemporal correlations between frames and capture more comprehensive clues. Furthermore, a novel 2D projection constraint is established by observing objects from different perspective directions, and multiple 3D constraints based on object invariance and semantic consistency are also conducted for supervising the network, which enhances its performance of understanding the scene. Experimental results on nuScenes dataset demonstrate that the proposed CVFormer obviously outperforms existing methods for occupancy prediction.