GP3: A 3D Geometry-Aware Policy with Multi-View Images for Robotic Manipulation
Quanhao Qian, Guoyang Zhao, Gongjie Zhang, Jiuniu Wang, Junlong Gao, Deli Zhao, Ran Xu
AI summary
Problem
Existing methods rely on unstable explicit depth sensors or struggle to generalize when learning implicit 3D geometry from RGB images, limiting robust manipulation in diverse real-world scenes.
Approach
GP3 fine-tunes a pretrained 3D reconstruction model on multi-view RGB inputs to extract dense spatial features, then uses a novel language-conditioned attention mechanism to fuse these features with task instructions for action prediction.
Key results
- Fine-tuned RoboVGGT encoder for robust multi-view 3D reconstruction from RGB
- G-FiLM module that uses language guidance to suppress redundant cross-view attention
- State-of-the-art performance with 11.2% improvement on MetaWorld and 22.7% on RLBench
- Effective real-world transfer to depth-challenging scenes with minimal fine-tuning
Why it matters
Enables robust, sensor-agnostic 3D-aware robotic manipulation that generalizes to unseen tasks without relying on unstable depth hardware.
Abstract
Effective robotic manipulation relies on a precise understanding of 3D scene geometry, and one of the most straightforward ways to acquire such geometry is through multi-view observations. Motivated by this, we present GP3—a 3D geometry-aware robotic manipulation policy that leverages multi-view input. GP3 employs a spatial encoder to infer dense spatial features from RGB observations, which enable the estimation of depth and camera parameters, leading to a compact yet expressive 3D scene representation tailored for manipulation. This representation is fused with language instructions and translated into continuous actions via a lightweight policy head. We further introduce G-FiLM, which applies language-conditioned FiLM only to cross-view global attention. Comprehensive experiments demonstrate that GP3 consistently outperforms state-of-the-art methods on simulated benchmarks. Furthermore, GP3 transfers effectively to real- world robots in depth-challenging scenes with only minimal fine-tuning. These results highlight GP3 as a practical, sensor- agnostic solution for geometry-aware robotic manipulation.