← Back ICRA 2026

DSPv2: Improved Dense Policy for Effective and Generalizable Whole-Body Mobile Manipulation

Yue Su, Chubin Zhang, Sijin Chen, Liufan Tan, Yansong Tang, JIANAN WANG, Xihui Liu

PDF

AI summary

Key figure (auto-extracted from paper)

DSPv2 achieves robust, generalizable whole-body mobile manipulation by fusing multi-view 2D semantic and 3D spatial features, and generating coherent actions via a bidirectional dense autoregressive policy.

Whole-body manipulation Dense Policy Multi-view fusion Policy generalization Imitation learning Mobile manipulation

Problem

Whole-body mobile manipulation policies struggle with complex multi-view observations, poor generalization to unseen environments or objects, and error amplification across high-dimensional robot components.

Approach

The method aligns sparse 3D spatial features with dense multi-view 2D semantic features using a Q-former, then processes the fused representation through a dense autoregressive action head that predicts whole-body trajectories bidirectionally.

Key results

Surpasses existing whole-body policies in success rates across five real-world tasks
Demonstrates strong generalization to unseen objects, lighting, layouts, and spatial arrangements
Extends the Dense Policy paradigm to enable precise, coherent whole-body action generation
Mitigates inter-component error amplification through bidirectional autoregressive prediction

Why it matters

Provides a scalable, generalizable policy framework that bridges the gap between simulation-trained manipulation and reliable real-world household robot deployment.

Abstract

Learning whole-body mobile manipulation via imitation is essential for generalizing robotic skills to diverse environments and complex tasks. However, this goal is hin- dered by significant challenges, particularly in effectively pro- cessing complex observation, achieving robust generalization, and generating coherent actions. To address these issues, we propose DSPv2, a novel policy architecture. DSPv2 introduces an effective encoding scheme that aligns 3D spatial features with multi-view 2D semantic features. This fusion enables the policy to achieve broad generalization while retaining the fine- grained perception necessary for precise control. Furthermore, we extend the Dense Policy paradigm to the whole-body mobile manipulation domain, demonstrating its effectiveness in generating coherent and precise actions for the whole-body robotic platform. Extensive experiments show that our method significantly outperforms existing approaches in both task performance and generalization ability. Project page is available at: https://selen-suyue.github.io/DSPv2Net/.

Index terms

Imitation Learning Bimanual Manipulation Mobile Manipulation