Research Analyzer
← Back ICRA 2026

GP3: A 3D Geometry-Aware Policy with Multi-View Images for Robotic Manipulation

Quanhao Qian, Guoyang Zhao, Gongjie Zhang, Jiuniu Wang, Junlong Gao, Deli Zhao, Ran Xu

PDF

AI summary

Key figure (auto-extracted from paper)
GP3 achieves state-of-the-art robotic manipulation using only multi-view RGB images by learning implicit 3D geometry and language-guided attention, eliminating the need for explicit depth sensors.
Multi-view RGB 3D geometry robotic manipulation language-conditioned attention implicit spatial representation vision-language-action

Problem

Existing methods rely on unstable explicit depth sensors or struggle to generalize when learning implicit 3D geometry from RGB images, limiting robust manipulation in diverse real-world scenes.

Approach

GP3 fine-tunes a pretrained 3D reconstruction model on multi-view RGB inputs to extract dense spatial features, then uses a novel language-conditioned attention mechanism to fuse these features with task instructions for action prediction.

Key results

  • Fine-tuned RoboVGGT encoder for robust multi-view 3D reconstruction from RGB
  • G-FiLM module that uses language guidance to suppress redundant cross-view attention
  • State-of-the-art performance with 11.2% improvement on MetaWorld and 22.7% on RLBench
  • Effective real-world transfer to depth-challenging scenes with minimal fine-tuning

Why it matters

Enables robust, sensor-agnostic 3D-aware robotic manipulation that generalizes to unseen tasks without relying on unstable depth hardware.

Abstract

Effective robotic manipulation relies on a precise understanding of 3D scene geometry, and one of the most straightforward ways to acquire such geometry is through multi-view observations. Motivated by this, we present GP3—a 3D geometry-aware robotic manipulation policy that leverages multi-view input. GP3 employs a spatial encoder to infer dense spatial features from RGB observations, which enable the estimation of depth and camera parameters, leading to a compact yet expressive 3D scene representation tailored for manipulation. This representation is fused with language instructions and translated into continuous actions via a lightweight policy head. We further introduce G-FiLM, which applies language-conditioned FiLM only to cross-view global attention. Comprehensive experiments demonstrate that GP3 consistently outperforms state-of-the-art methods on simulated benchmarks. Furthermore, GP3 transfers effectively to real- world robots in depth-challenging scenes with only minimal fine-tuning. These results highlight GP3 as a practical, sensor- agnostic solution for geometry-aware robotic manipulation.

Index terms

Deep Learning for Visual Perception Visual Servoing Imitation Learning

Related papers