Multi-Task Visual Perception with Temporal Feature Fusion for Autonomous Driving
Huei-Yung Lin, Shih-Han Wei
AI summary
Problem
Existing multi-task perception models rely on single datasets, limiting data diversity and degrading performance in complex scenarios like occlusion and poor lighting. They also lack temporal consistency, making lane and road marking detection unreliable in real-world driving.
Approach
The authors propose a unified network that jointly performs object detection, lane detection, and road/drivable area segmentation using a shared backbone. It integrates a homography-guided temporal fusion module to align features across consecutive frames and employs cross-dataset training to expand data diversity and improve generalization.
Key results
- Unified architecture jointly handles object detection, lane detection, and road/drivable area segmentation with shared features for computational efficiency.
- Homography-guided temporal fusion aligns consecutive frame features to enhance lane and road marking robustness without extra sensors.
- Cross-dataset training expands available resources and improves generalization across all perception tasks.
- Achieves competitive performance on BDD100K, VIL-100, and SeRM benchmarks, matching state-of-the-art multi-task models.
Why it matters
Provides a computationally efficient and robust perception framework that improves real-time decision-making and safety for autonomous vehicles in complex, dynamic driving environments.
Abstract
With the rapid developments of autonomous driv- ing technologies, accurate scene perception has become essential for safe and efficient navigation. The key perception tasks such as lane detection, semantic segmentation of road markings and road area, and object detection directly impact vehicle decision- making and obstacle avoidance. However, most existing methods are trained on a single-task dataset, limiting data diversity and reducing performance in complex scenarios or under occlusion and illumination variation. In this work we propose a multi-task perception network based on image sequence input, integrating lane detection, road marking and road area segmentation, and object detection into a unified framework. The network model employs multi-task learning to share features and improve the computational efficiency, and adopts the cross-dataset training paradigm to enhance generalization across tasks. Furthermore, the temporal information from adjacent frames is leveraged to compensate visual degradation in current frames. Experimental results obtained on multiple datasets demonstrate the proposed technique achieves competitive performance compared to state- of-the-art approaches. Code is available at https://github. com/hank890121/MTVP