← Back IROS 2024

WidthFormer: Toward Efficient Transformer-Based BEV View Transformation

Chenhongyi Yang, Tianwei Lin, Lichao Huang, Elliot J. Crowley

PDF

Abstract

We present WidthFormer, a novel transformer- based module to compute Bird’s-Eye-View (BEV) represen- tations from multi-view cameras for real-time autonomous- driving applications. WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy. We first introduce a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information, which enables our model to compute high-quality BEV representations with only a single transformer decoder layer. This mechanism is also beneficial for existing sparse 3D object detectors. Inspired by the recently proposed works, we further improve our model’s efficiency by vertically compress- ing the image features when serving as attention keys and values, and then we develop two modules to compensate for potential information loss due to feature compression. Experi- mental evaluation on the widely-used nuScenes 3D object de- tection benchmark demonstrates that our method outperforms previous approaches across different 3D detection architectures. More importantly, our model is highly efficient. For example, when using 256 × 704 input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 com- putation solutions. Furthermore, WidthFormer also exhibits strong robustness to different degrees of camera perturbations. Our study offers valuable insights into the deployment of BEV transformation methods in real-world, complex road environments. Code is available at https://github.com/ ChenhongyiYang/WidthFormer.

Index terms

Deep Learning for Visual Perception Computer Vision for Automation Object Detection Segmentation and Categorization