Zero-Shot Exocentric Viewpoint-Robust Imitation Learning (VIL): Bridging Handheld Gripper and Exocentric Views
Boyan Li, Peilin Meng, Chang Liu, Yulin Chen, Qi Zhou, Youyi Bi
AI summary
Problem
Existing handheld gripper pipelines for robot learning rely on egocentric views, limiting global context and failing under viewpoint shifts, while existing imitation learning algorithms are highly sensitive to camera perspective changes, hindering scalability and real-world deployment.
Approach
The authors design a Robotiq-like handheld gripper that captures both egocentric and exocentric views, paired with a zero-shot imitation learning algorithm that uses a hybrid Swin-Transformer/ResNet encoder and an SAM2-based inpainting module to align appearances and extract viewpoint-consistent features for an ACT policy.
Key results
- 93.3% success in simulated cube transfer under novel top-down views
- Zero-shot policy stability across symmetric and novel exocentric viewpoints
- Robust real-world manipulation under dynamic camera shifts
- Swin Transformer outperforms ResNet and ViT for viewpoint generalization
Why it matters
Provides a scalable, low-cost, manipulator-independent pipeline for collecting and deploying viewpoint-robust imitation learning policies in real-world settings.
Abstract
Recent advances in robot learning have motivated integrated pipelines that combine hardware for data collection with imitation learning algorithms. Existing data collection methods like leader–follower, VR/AR, and exoskeletons rely on costly hardware and exhibit limited scalability, while imitation learning algorithms built on them remain highly sensitive to viewpoint shifts, further constraining generalizability. Hand- held grippers provide a low-cost, robot-agnostic alternative, but prior systems bypass exocentric view alignment by relying solely on wrist-mounted cameras, resulting in narrowed observation and reduced policy robustness. We propose VIL, a framework pairing customized handheld gripper with zero-shot, exocentric viewpoint-robust imitation learning algorithm, bridging the handheld gripper with exocentric views. Our approach employs adapters for appearance alignment and a hybrid encoder design to extract view-consistent representations for an ACT-style policy, enabling robust execution across diverse perspectives. We further optimize the data collection pipeline and validate the system in both simulation and real-world tasks. Experiments show that VIL achieves stable performance under viewpoint shifts, challenging low-horizon scenarios, and dynamic perspec- tives, outperforming SOTA methods and demonstrating a scal- able pipeline for manipulator-independent, viewpoint-robust policy learning. The project repository containing code and hardware is available at https://github.com/liboyan233/VIL.git.