← Back ICRA 2026

AdaptPNP: Integrating Prehensile and Non-Prehensile Skills for Adaptive Robotic Manipulation

Jinxuan Zhu, Chenrui Tie, Xinyi Cao, Yuran Wang, Jingxiang Guo, Zixuan Chen, Haonan Chen, Junting Chen, Yangyu Xiao, Ruihai Wu, Lin Shao

PDF

AI summary

Key figure (auto-extracted from paper)

A VLM-guided framework with digital-twin rehearsal and closed-loop reflection enables robots to seamlessly adaptively combine grasping and contact-based skills for complex manipulation tasks.

Prehensile manipulation Non-prehensile manipulation Vision-language models Task and motion planning Digital twin Adaptive robotics

Problem

Robots struggle to adaptively integrate grasping and contact-based manipulation due to combinatorial skill selection complexity and a lack of physical reasoning in existing planners.

Approach

A vision-language model generates high-level action sequences that are validated against a digital twin for 6D pose sampling and iteratively refined through execution feedback.

Key results

Unified VLM-guided framework for adaptive P&NP skill scheduling
Digital-twin intermediate layer for physically-informed 6D pose generation
Closed-loop reflection mechanism enabling online replanning from execution errors
Superior performance over RL, MPC, and hierarchical VLM baselines across simulation and real-world hybrid tasks

Why it matters

Advances general-purpose robotic manipulation by enabling flexible, physically-aware skill composition in unstructured environments.

Abstract

Non-prehensile (NP) manipulation, in which robots alter object states without forming stable grasps (for example, pushing, poking, or sliding), significantly broadens robotic manipulation capabilities when grasping is infeasible or insufficient. However, enabling a unified framework that generalizes across different tasks, objects, and environments while seamlessly integrating non-prehensile and prehensile (P) actions remains challenging: robots must determine when to invoke NP skills, select the appropriate primitive for each context, and compose P and NP strategies into robust, multi- step plans. We introduce AdaptPNP, a vision-language model (VLM)-empowered task and motion planning framework that systematically selects and combines P and NP skills to accom- plish diverse manipulation objectives. Our approach leverages a VLM to interpret visual scene observations and textual task de- scriptions, generating a high-level plan skeleton that prescribes the sequence and coordination of P and NP actions. A digital- twin based object-centric intermediate layer predicts desired object poses, enabling proactive mental rehearsal of manipula- tion sequences. Finally, a control module synthesizes low-level robot commands, with continuous execution feedback enabling online task plan refinement and adaptive replanning through the VLM. We evaluate AdaptPNP across representative P&NP hybrid manipulation tasks in both simulation and real-world environments. These results underscore the potential of hybrid P&NP manipulation as a crucial step toward general-purpose, human-level robotic manipulation capabilities. More detailed information can be found at https://adaptpnp.github.io/.

Index terms

Manipulation Planning Task and Motion Planning