← Back ICRA 2026

Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning

Sunghwan Kim, Woojeh Chung, Zhirui Dai, Dwait Bhatt, Arth Shukla, Hao Su, Yulun Tian, Nikolay Atanasov

PDF

AI summary

Key figure (auto-extracted from paper)

Conditioning mobile manipulation policies on an incrementally built 3D latent map significantly improves long-horizon reasoning and task success over purely image-based approaches.

3D latent mapping mobile manipulation policy learning long-horizon reasoning vision-language models

Problem

Current mobile manipulation policies rely on 2D images, which lack consistent 3D understanding and struggle with long-horizon reasoning, while existing 3D methods either lack temporal consistency or cannot adapt to novel views online.

Approach

We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning framework that incrementally builds a persistent 3D latent feature map from streaming camera data. A pre-trained decoder reconstructs semantic embeddings from this map, and a 3D feature aggregator distills global context into a token that conditions a manipulation policy trained via behavior cloning or reinforcement learning.

Key results

Incrementally constructed 3D latent map serves as persistent spatial and temporal memory
Policy conditioned on global map tokens outperforms image-based baselines in novel scenes
15% success rate improvement on sequential tabletop manipulation tasks
Modular encoder-decoder design enables cross-scene generalization without per-scene retraining

Why it matters

Enables robots to perform complex, long-horizon mobile manipulation tasks in dynamic environments by providing persistent 3D context, advancing the scalability of vision-language models for real-world robotics.

Abstract

In this paper, we demonstrate that mobile manipu- lation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot’s current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features. A pre-trained, scene-agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene-level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long-horizon memory, and (iii) outperforms image-based policies in both in-distribution and novel scenes, e.g., improving the success rate by 15% for the sequential manipulation task.

Index terms

Perception for Grasping and Manipulation Semantic Scene Understanding Deep Learning in Grasping and Manipulation