← Back ICRA 2026

Look, Focus, Act: Efficient and Robust Robot Learning Via Human Gaze and Foveated Vision Transformers

Ian Chuang, Jinyu Zou, Andrew Lee, Dechen Gao, Iman Soltani

PDF

AI summary

Key figure (auto-extracted from paper)

Incorporating human-like gaze and foveated vision into robot learning drastically reduces computational cost while improving robustness and task performance.

Foveated Vision Robot Learning Vision Transformers Human Gaze Imitation Learning Active Vision

Problem

Current robot learning systems process camera images uniformly, ignoring the biological efficiency of human active gaze and foveation. This leads to high computational costs and reduced robustness in complex environments.

Approach

The authors propose GIAVA, a framework that collects human gaze and manipulation data via VR headsets, and integrates gaze-guided foveated patch tokenization into Vision Transformers to focus robot policies on task-relevant regions.

Key results

Reduces ViT computational overhead by 94% while preserving performance
Enhances robustness to background distractors
Improves success rates on high-precision manipulation tasks
Provides an open-source simulation benchmark and synchronized gaze dataset

Why it matters

This approach provides a scalable, biologically-inspired inductive bias that significantly accelerates robot learning and improves performance, offering a practical pathway for more efficient embodied AI systems.

Abstract

Human vision is a highly active process driven by gaze, which directs attention to task-relevant regions through foveation, dramatically reducing visual processing. In contrast, robot learning systems typically rely on passive, uniform pro- cessing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance efficiency and robustness. We develop GIAVA (Gaze Integrated Active-Vision ALOHA), a robot vision system that emulates human head and neck movement, and gaze adjust- ment for foveated processing. Extending the AV-ALOHA robot platform, we introduce a framework for simultaneously collect- ing eye-tracking, perspective control, and robot manipulation demonstration data from a human operator. We also open- source a simulation benchmark and dataset for training robot policies that incorporate human gaze. Inspired by recent work in foveated image segmentation and given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme. Compared to uniform patch tokenization, this signifi- cantly reduces the number of tokens, and thus computation. For this purpose, we explore two approaches to gaze estimation: The first is a two-stage model that predicts gaze independently to guide foveation and subsequently action. The second integrates gaze into the action space, allowing the policy to jointly estimate gaze and actions end-to-end. Our results show that our method for foveated robot vision drastically reduces computational overhead, and enhances robustness to background distractors. Notably, on certain high-precision tasks, foveated vision also improves performance, as reflected in higher success rates. Together, these findings suggest that human-inspired foveated visual processing offers untapped potential and should be further considered as a useful inductive bias in robotic vision systems. https://soltanilara.github.io/giava/

Index terms

Imitation Learning Bimanual Manipulation Deep Learning in Grasping and Manipulation