Observer�Actor: Active Vision Imitation Learning with Sparse-View Gaussian Splatting
Yilong Wang, Cheng Qian, Ruomeng Fan, Edward Johns
AI summary
Problem
Current imitation learning relies on static or wrist-mounted cameras that struggle with occlusions and limited viewpoints, while existing active vision methods require fixed roles or extensive human demonstrations.
Approach
The ObAct framework dynamically assigns observer and actor roles at test time, using three captured images to build a 3D Gaussian Splatting model that optimizes a low-occlusion viewpoint before the actor executes the task.
Key results
- Introduces the ObAct decoupled observer–actor framework
- First application of sparse-view 3DGS for test-time active vision optimization
- Extends trajectory transfer and behavior cloning to dynamic view-conditioned settings
- Achieves up to 233% success rate improvement over static cameras under occlusion
Why it matters
Enables more robust and data-efficient robotic manipulation policies by dynamically optimizing camera views, benefiting researchers and practitioners in active vision and imitation learning.
Abstract
We propose Observer-Actor (ObAct), a novel framework for active vision imitation learning in which the observer moves to optimal visual observations for the actor. We study ObAct on a dual-arm robotic system equipped with wrist- mounted cameras. At test time, ObAct dynamically assigns observer and actor roles: the observer arm constructs a 3D Gaussian Splatting (3DGS) representation from three images, virtually explores this to find an optimal camera pose, then moves to this pose; the actor arm then executes a policy using the observer’s observations. This formulation enhances the clarity and visibility of both the object and the gripper in the policy’s observations. As a result, we enable the training of ambidextrous policies on observations that remain closer to the occlusion-free training distribution, leading to more robust policies. We study this formulation with two existing imitation learning methods – trajectory transfer and behaviour cloning – and experiments show that ObAct significantly outperforms static-camera setups: trajectory transfer improves by 145% without occlusion and 233% with occlusion, while behavior cloning improves by 75% and 143%, respectively. Videos are available at https://obact.github.io.