Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning
Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, Matthew Walter
AI summary
Problem
Standard RGB policies trained on fixed viewpoints fail in real-world deployments due to viewpoint shifts, as they implicitly rely on static background cues to infer camera geometry rather than learning true invariance.
Approach
We explicitly condition standard behavior cloning policies on camera extrinsics by integrating per-pixel Plücker ray embeddings into the visual input pipeline, decoupling pose estimation from manipulation learning.
Key results
- Explicit camera conditioning boosts success rates across ACT, Diffusion Policy, and SmolVLA
- Policies without conditioning exploit fixed background cues, causing performance collapse under viewpoint shifts
- Introduce six new benchmark tasks in robosuite and ManiSkill to isolate viewpoint generalization
- Delta end-effector actions and random cropping further enhance conditioning benefits
Why it matters
Enables robust, view-invariant robot control with standard RGB cameras, critical for real-world deployment and cross-embodiment transfer without requiring depth sensors or complex pose estimation.
Abstract
We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Pl ̈ucker embeddings of per-pixel rays, we show that con- ditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, in- cluding ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in robosuite and ManiSkill that pair “fixed” and “randomized” scene variants, decoupling back- ground cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes. This shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code to facilitate reproducibility and future research. Code and project materials are available at ripl.github.io/know your camera.