← Back ICRA 2026

Zero-Shot Exocentric Viewpoint-Robust Imitation Learning (VIL): Bridging Handheld Gripper and Exocentric Views

Boyan Li, Peilin Meng, Chang Liu, Yulin Chen, Qi Zhou, Youyi Bi

PDF

AI summary

Key figure (auto-extracted from paper)

VIL enables zero-shot, viewpoint-robust imitation learning by combining a custom handheld gripper with a Swin Transformer-based encoder and inpainting alignment, achieving stable policy execution across diverse exocentric views without fine-tuning.

Imitation Learning Viewpoint Robustness Handheld Gripper Zero-Shot Generalization Swin Transformer Inpainting Alignment

Problem

Existing handheld gripper pipelines for robot learning rely on egocentric views, limiting global context and failing under viewpoint shifts, while existing imitation learning algorithms are highly sensitive to camera perspective changes, hindering scalability and real-world deployment.

Approach

The authors design a Robotiq-like handheld gripper that captures both egocentric and exocentric views, paired with a zero-shot imitation learning algorithm that uses a hybrid Swin-Transformer/ResNet encoder and an SAM2-based inpainting module to align appearances and extract viewpoint-consistent features for an ACT policy.

Key results

93.3% success in simulated cube transfer under novel top-down views
Zero-shot policy stability across symmetric and novel exocentric viewpoints
Robust real-world manipulation under dynamic camera shifts
Swin Transformer outperforms ResNet and ViT for viewpoint generalization

Why it matters

Provides a scalable, low-cost, manipulator-independent pipeline for collecting and deploying viewpoint-robust imitation learning policies in real-world settings.

Abstract

Recent advances in robot learning have motivated integrated pipelines that combine hardware for data collection with imitation learning algorithms. Existing data collection methods like leader–follower, VR/AR, and exoskeletons rely on costly hardware and exhibit limited scalability, while imitation learning algorithms built on them remain highly sensitive to viewpoint shifts, further constraining generalizability. Hand- held grippers provide a low-cost, robot-agnostic alternative, but prior systems bypass exocentric view alignment by relying solely on wrist-mounted cameras, resulting in narrowed observation and reduced policy robustness. We propose VIL, a framework pairing customized handheld gripper with zero-shot, exocentric viewpoint-robust imitation learning algorithm, bridging the handheld gripper with exocentric views. Our approach employs adapters for appearance alignment and a hybrid encoder design to extract view-consistent representations for an ACT-style policy, enabling robust execution across diverse perspectives. We further optimize the data collection pipeline and validate the system in both simulation and real-world tasks. Experiments show that VIL achieves stable performance under viewpoint shifts, challenging low-horizon scenarios, and dynamic perspec- tives, outperforming SOTA methods and demonstrating a scal- able pipeline for manipulator-independent, viewpoint-robust policy learning. The project repository containing code and hardware is available at https://github.com/liboyan233/VIL.git.

Index terms

Imitation Learning Learning from Demonstration Grippers and Other End-Effectors