Look, Focus, Act: Efficient and Robust Robot Learning Via Human Gaze and Foveated Vision Transformers
Ian Chuang, Jinyu Zou, Andrew Lee, Dechen Gao, Iman Soltani
AI summary
Problem
Current robot learning systems process camera images uniformly, ignoring the biological efficiency of human active gaze and foveation. This leads to high computational costs and reduced robustness in complex environments.
Approach
The authors propose GIAVA, a framework that collects human gaze and manipulation data via VR headsets, and integrates gaze-guided foveated patch tokenization into Vision Transformers to focus robot policies on task-relevant regions.
Key results
- Reduces ViT computational overhead by 94% while preserving performance
- Enhances robustness to background distractors
- Improves success rates on high-precision manipulation tasks
- Provides an open-source simulation benchmark and synchronized gaze dataset
Why it matters
This approach provides a scalable, biologically-inspired inductive bias that significantly accelerates robot learning and improves performance, offering a practical pathway for more efficient embodied AI systems.
Abstract
Human vision is a highly active process driven by gaze, which directs attention to task-relevant regions through foveation, dramatically reducing visual processing. In contrast, robot learning systems typically rely on passive, uniform pro- cessing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance efficiency and robustness. We develop GIAVA (Gaze Integrated Active-Vision ALOHA), a robot vision system that emulates human head and neck movement, and gaze adjust- ment for foveated processing. Extending the AV-ALOHA robot platform, we introduce a framework for simultaneously collect- ing eye-tracking, perspective control, and robot manipulation demonstration data from a human operator. We also open- source a simulation benchmark and dataset for training robot policies that incorporate human gaze. Inspired by recent work in foveated image segmentation and given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme. Compared to uniform patch tokenization, this signifi- cantly reduces the number of tokens, and thus computation. For this purpose, we explore two approaches to gaze estimation: The first is a two-stage model that predicts gaze independently to guide foveation and subsequently action. The second integrates gaze into the action space, allowing the policy to jointly estimate gaze and actions end-to-end. Our results show that our method for foveated robot vision drastically reduces computational overhead, and enhances robustness to background distractors. Notably, on certain high-precision tasks, foveated vision also improves performance, as reflected in higher success rates. Together, these findings suggest that human-inspired foveated visual processing offers untapped potential and should be further considered as a useful inductive bias in robotic vision systems. https://soltanilara.github.io/giava/