Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning
Sunghwan Kim, Woojeh Chung, Zhirui Dai, Dwait Bhatt, Arth Shukla, Hao Su, Yulun Tian, Nikolay Atanasov
AI summary
Problem
Current mobile manipulation policies rely on 2D images, which lack consistent 3D understanding and struggle with long-horizon reasoning, while existing 3D methods either lack temporal consistency or cannot adapt to novel views online.
Approach
We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning framework that incrementally builds a persistent 3D latent feature map from streaming camera data. A pre-trained decoder reconstructs semantic embeddings from this map, and a 3D feature aggregator distills global context into a token that conditions a manipulation policy trained via behavior cloning or reinforcement learning.
Key results
- Incrementally constructed 3D latent map serves as persistent spatial and temporal memory
- Policy conditioned on global map tokens outperforms image-based baselines in novel scenes
- 15% success rate improvement on sequential tabletop manipulation tasks
- Modular encoder-decoder design enables cross-scene generalization without per-scene retraining
Why it matters
Enables robots to perform complex, long-horizon mobile manipulation tasks in dynamic environments by providing persistent 3D context, advancing the scalability of vision-language models for real-world robotics.
Abstract
In this paper, we demonstrate that mobile manipu- lation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot’s current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features. A pre-trained, scene-agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene-level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long-horizon memory, and (iii) outperforms image-based policies in both in-distribution and novel scenes, e.g., improving the success rate by 15% for the sequential manipulation task.