← Back ICRA 2026

AutoFocus-IL: VLM-Based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

Litian Gong, Fatemeh Bahrani, Yutai Zhou, Amin Banayeeanzade, Jiachen Li, Erdem Bıyık

PDF

AI summary

Key figure (auto-extracted from paper)

VLM-generated saliency maps effectively regularize imitation learning policies, significantly improving data efficiency and generalization without requiring human annotations.

Imitation Learning Vision-Language Models Saliency Maps Causal Confusion Data Efficiency Behavior Cloning

Problem

Imitation learning struggles with data scarcity, poor generalization, and causal confusion where policies latch onto spurious visual correlations. Existing saliency-based fixes require costly human supervision like gaze data or manual labels.

Approach

AutoFocus-IL automatically identifies and tracks task-relevant objects across demonstration frames using vision-language models to generate temporal saliency maps. These maps regularize behavior cloning policies to focus on causal features while suppressing distractors.

Key results

104% driving score improvement over standard behavior cloning in CARLA
50% performance gain in real-world WidowX robot manipulation tasks
Outperforms state-of-the-art baselines using privileged human gaze supervision
Introduces a scalable, annotation-free pipeline for VLM-driven object filtering and temporal saliency modeling

Why it matters

Enables robust, data-efficient robot learning at scale by replacing expensive human supervision with automated, context-aware visual attention.

Abstract

We present AutoFocus-IL, a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Saliency regularization has emerged as a promising way to achieve this, but existing approaches typically require costly supervision such as human gaze data or manual saliency an- notations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Our findings highlight that VLM-driven saliency provides a scalable, annotation-free path toward robust imitation learning in robotics. Particularly, our experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. The supplementary materials, including code, datasets, and trained policy videos, are publicly available at https://AutoFocus-IL.github.io/.

Index terms

Imitation Learning Learning from Demonstration Representation Learning