← Back ICRA 2026

Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning

Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, Matthew Walter

PDF

AI summary

Key figure (auto-extracted from paper)

Explicitly conditioning RGB-only imitation learning policies on camera extrinsics using Plücker embeddings significantly improves viewpoint invariance and generalization across diverse camera poses.

View-invariant learning Camera conditioning Plücker embeddings Imitation learning Robot manipulation Generalization

Problem

Standard RGB policies trained on fixed viewpoints fail in real-world deployments due to viewpoint shifts, as they implicitly rely on static background cues to infer camera geometry rather than learning true invariance.

Approach

We explicitly condition standard behavior cloning policies on camera extrinsics by integrating per-pixel Plücker ray embeddings into the visual input pipeline, decoupling pose estimation from manipulation learning.

Key results

Explicit camera conditioning boosts success rates across ACT, Diffusion Policy, and SmolVLA
Policies without conditioning exploit fixed background cues, causing performance collapse under viewpoint shifts
Introduce six new benchmark tasks in robosuite and ManiSkill to isolate viewpoint generalization
Delta end-effector actions and random cropping further enhance conditioning benefits

Why it matters

Enables robust, view-invariant robot control with standard RGB cameras, critical for real-world deployment and cross-embodiment transfer without requiring depth sensors or complex pose estimation.

Abstract

We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Pl ̈ucker embeddings of per-pixel rays, we show that con- ditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, in- cluding ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in robosuite and ManiSkill that pair “fixed” and “randomized” scene variants, decoupling back- ground cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes. This shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code to facilitate reproducibility and future research. Code and project materials are available at ripl.github.io/know your camera.

Index terms

Imitation Learning Deep Learning in Grasping and Manipulation Deep Learning Methods