A Point-Based Approach to Efficient LiDAR Multi-Task Perception
Christopher Lang, Alexander Braun, Lars Schillingmann, Abhinav Valada
Abstract
Multi-task perception networks hold great potential as they can improve performance and computational efficiency compared to their single-task counterparts, facilitating online deployment. However, current multi-task architectures in point cloud perception combine multiple task-specific point cloud representations, each requiring a separate feature encoder, making the network significantly large and slow. In this work, we propose PAttFormer, an efficient multi-task learning architecture for joint semantic segmentation and object detection in point clouds, only relying on a point-based representation. The network builds on transformer-based feature encoders using neighborhood attention and grid-pooling, complemented with a query-based detection decoder using a novel 3D deformable- attention detection head topology. Unlike other LiDAR-based multi-task architectures, our proposed PAttFormer does not require separate feature encoders for multiple task-specific point cloud representations, resulting in a network that is 3× smaller and 1.4× faster while achieving competitive performance on the nuScenes and KITTI benchmarks for autonomous driving perception. We perform extensive evaluations that show substantial improvement from multi-task learning, achieving +1.7% in mIoU for LiDAR semantic segmentation and +1.7% in mAP for 3D object detection on the nuScenes benchmark compared to the single-task models.