← Back ICRA 2026

NaturalVLM: Leveraging Fine-Grained Natural Language for Affordance-Guided Visual Manipulation

Ran Xu, Yan Shen, Xiaoqi Li, Ruihai Wu, Hao Dong

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating fine-grained, step-by-step language instructions with cross-modal prompts enables robots to accurately execute complex manipulation tasks that high-level instructions alone cannot handle.

Fine-grained language Visual manipulation Robot benchmark Cross-modal alignment Affordance prediction Embodied AI

Problem

Existing robot manipulation benchmarks rely on simplistic, high-level language instructions that fail for complex, multi-step real-world tasks. Robots lack the detailed, step-by-step linguistic guidance required to navigate intricate object interactions and unfamiliar scenarios.

Approach

The authors introduce the NrVLM benchmark with 4,500+ episodes annotated with fine-grained step-by-step instructions, and propose a framework that uses pre-defined action and perception prompt modules to align visual, linguistic, and manipulation features for stepwise gripper and contact point prediction.

Key results

NrVLM benchmark comprising 15 tasks, 82 object variations, and 4,500+ fine-grained instruction episodes
Prompt-based cross-modal alignment framework that explicitly bridges language, vision, and manipulation modalities
Superior manipulation accuracy and step-by-step execution compared to four competitive visual-language baselines
Strong generalization performance across novel tasks and unseen object geometries

Why it matters

Provides a critical resource and methodology for training embodied AI agents to reliably interpret and execute complex, multi-step human instructions in dynamic home environments.

Abstract

Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instruc- tions, i.e., "Slide the top drawer open". However, many real- world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step- by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effector’s current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross- modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end- effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.

Index terms

Perception for Grasping and Manipulation Data Sets for Robot Learning Deep Learning in Grasping and Manipulation