← Back ICRA 2026

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Huy Dung Nguyen, Anass Bairouk, Mirjana Maras, Wei Xiao, Tsun-Hsuan Wang, Patrick Chareyre, Ramin Hasani, Marc Blanchon, Daniela Rus

PDF

AI summary

Key figure (auto-extracted from paper)

A unified encoder trained on diverse visual tasks learns a rich latent space that significantly outperforms baselines for steering estimation while maintaining strong multi-task performance.

multi-task learning unified encoder autonomous driving latent space steering estimation knowledge distillation

Problem

Existing autonomous driving models often rely on single-task objectives or generic datasets, lacking the diverse contextual visual cues needed for robust perception and control in complex driving scenarios.

Approach

The authors propose a unified encoder trained jointly on multiple segmentation, depth, pose, and motion tasks, using a multi-scale pose decoder and knowledge distillation to balance gradients and stabilize training.

Key results

Strong generalization across all visual tasks matching dedicated models
Frozen latent representations outperform fine-tuned and ImageNet-pretrained baselines for steering estimation
Multi-scale pose decoder improves depth estimation in dynamic scenes
Knowledge distillation stabilizes training across supervised and self-supervised tasks

Why it matters

This approach provides an efficient, context-rich foundation for autonomous driving systems by mimicking human multi-cue perception, benefiting researchers and engineers developing compact, robust perception pipelines.

Abstract

Autonomous driving systems require a compre- hensive understanding of the environment, achieved by ex- tracting visual features essential for perception, planning, and control. However, models trained solely on single-task objectives or generic datasets often lack the contextual information needed for robust performance in complex driving scenarios. In this work, we present a unified encoder trained across a diverse set of computer vision tasks essential for urban driving, including depth estimation, pose estimation, 3D scene flow estimation, and semantic, instance, panoptic, and motion segmentation. This single-encoder approach not only integrates these com- plementary visual cues, inspired by the diversity of visual cues used in human driving perception, but also enables a compact and inference-efficient model that embeds a rich, navigation- relevant latent space. Indeed, the unified encoder learns to embed multi-task knowledge into a shared representation, allowing for better downstream task adaptation, particularly for steering estimation. To ensure the efficient learning across tasks within a unified encoder, we propose a multi-scale pose decoder and employ knowledge distillation from a multi- backbone teacher model. Our experiments demonstrate that (1) the unified encoder achieves strong generalization across all visual tasks, comparable to state-of-the-art dedicated models, and (2) its frozen latent representations significantly outperform both fine-tuned models and ImageNet-pretrained baselines for steering estimation. These results underscore how multi-task feature learning, inspired by the diversity of perceptual cues used in human driving, offers an efficient and context-rich foundation for autonomous driving systems.

Index terms

Deep Learning for Visual Perception Semantic Scene Understanding Vision-Based Navigation