← Back ICRA 2026

Multi-Task Visual Perception with Temporal Feature Fusion for Autonomous Driving

Huei-Yung Lin, Shih-Han Wei

PDF

AI summary

Key figure (auto-extracted from paper)

A unified multi-task network with homography-guided temporal fusion and cross-dataset training achieves robust, efficient perception for autonomous driving across challenging conditions.

Multi-task learning autonomous driving temporal feature fusion lane detection road marking segmentation cross-dataset training

Problem

Existing multi-task perception models rely on single datasets, limiting data diversity and degrading performance in complex scenarios like occlusion and poor lighting. They also lack temporal consistency, making lane and road marking detection unreliable in real-world driving.

Approach

The authors propose a unified network that jointly performs object detection, lane detection, and road/drivable area segmentation using a shared backbone. It integrates a homography-guided temporal fusion module to align features across consecutive frames and employs cross-dataset training to expand data diversity and improve generalization.

Key results

Unified architecture jointly handles object detection, lane detection, and road/drivable area segmentation with shared features for computational efficiency.
Homography-guided temporal fusion aligns consecutive frame features to enhance lane and road marking robustness without extra sensors.
Cross-dataset training expands available resources and improves generalization across all perception tasks.
Achieves competitive performance on BDD100K, VIL-100, and SeRM benchmarks, matching state-of-the-art multi-task models.

Why it matters

Provides a computationally efficient and robust perception framework that improves real-time decision-making and safety for autonomous vehicles in complex, dynamic driving environments.

Abstract

With the rapid developments of autonomous driv- ing technologies, accurate scene perception has become essential for safe and efficient navigation. The key perception tasks such as lane detection, semantic segmentation of road markings and road area, and object detection directly impact vehicle decision- making and obstacle avoidance. However, most existing methods are trained on a single-task dataset, limiting data diversity and reducing performance in complex scenarios or under occlusion and illumination variation. In this work we propose a multi-task perception network based on image sequence input, integrating lane detection, road marking and road area segmentation, and object detection into a unified framework. The network model employs multi-task learning to share features and improve the computational efficiency, and adopts the cross-dataset training paradigm to enhance generalization across tasks. Furthermore, the temporal information from adjacent frames is leveraged to compensate visual degradation in current frames. Experimental results obtained on multiple datasets demonstrate the proposed technique achieves competitive performance compared to state- of-the-art approaches. Code is available at https://github. com/hank890121/MTVP

Index terms

Autonomous Vehicle Navigation Intelligent Transportation Systems Computer Vision for Transportation