← Back ICRA 2026

DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds

Siqi Pei, Andras Palffy, Dariu Gavrila

PDF

AI summary

Key figure (auto-extracted from paper)

DRIFT outperforms existing baselines on 4D radar perception tasks by effectively fusing local point-level and global pillar-level features through a novel dual-path transformer architecture.

4D Radar Point Cloud Perception Dual-Representation Transformer Automated Driving Feature Fusion

Problem

4D radar point clouds are significantly sparser and noisier than LiDAR data, making it difficult for traditional single-representation models to capture both the fine-grained local details and coarse-grained global context required for accurate automated driving perception.

Approach

The authors propose DRIFT, a dual-path transformer model that processes raw radar points and pillar-voxelized data in parallel, intertwining them at multiple stages via novel feature-sharing blocks to fuse local and global representations.

Key results

52.6% mAP on View-of-Delft detection dataset
Surpasses baselines in 3D object detection and free-road segmentation
Validated on public and large-scale proprietary radar datasets
Novel feature-sharing blocks enable effective bi-directional local-global fusion

Why it matters

Provides a robust, low-cost alternative to LiDAR for automated driving perception, particularly improving reliability in adverse weather and low-light conditions.

Abstract

4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset [1] and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6% (compared to, say, 45.4% of CenterPoint [2]) on the VoD dataset.

Index terms

Intelligent Transportation Systems Object Detection Segmentation and Categorization Deep Learning Methods