DMFuser: Distilled Multi-Task Learning for End-To-End Transformer-Based Sensor Fusion in Autonomous Driving
Pedram Agand, Mohammad Mahdavian, Manolis Savva, Mo Chen
Abstract
In end-to-end autonomous driving, current sensor fusion and navigational control techniques used by imitation learning algorithms are insufficient in challenging scenarios involving multiple dynamic agents and result in poor driving capabilities. To tackle this issue, we introduce DMFuser, a transformer-based algorithm that employs knowledge distilla- tion between multi-task student and single-task teachers and combines attention and convolutions to fuse multiple RGB-D camera representations to produce vehicular navigational com- mands (throttle, steering and brake). Our model incorporates two modules. The first module, perception, encodes data from RGB-D cameras for tasks like semantic segmentation, semantic depth cloud (SDC) mapping, and traffic light state recognition. To enhance feature extraction and fusion from both RGB and depth sources, we harness local and global capabilities of convolution and transformer modules. We employ an attention- CNN fusion structure to effectively learn and fuse RGB and SDC map features. Subsequently, the control module decodes these features along with supplementary data, containing envi- ronment’s static and dynamic information, to predict waypoints and vehicular control actions. We evaluate the model and conduct a comparative analysis, in various scenarios, weather conditions, and traffic situations, spanning from normal to adversarial in the CARLA simulator. We achieve better or comparable results in term of driving score (DS) and other metrics with respect to our baselines. Also, our ablation studies demonstrate the effectiveness of our contributions to improve the driving skills. Our code is available at the following github page: https://github.com/pagand/e2etransfuser