Research Analyzer
← Back ICRA 2026

NaturalVLM: Leveraging Fine-Grained Natural Language for Affordance-Guided Visual Manipulation

Ran Xu, Yan Shen, Xiaoqi Li, Ruihai Wu, Hao Dong

PDF

AI summary

Key figure (auto-extracted from paper)
Integrating fine-grained, step-by-step language instructions with cross-modal prompts enables robots to accurately execute complex manipulation tasks that high-level instructions alone cannot handle.
Fine-grained language Visual manipulation Robot benchmark Cross-modal alignment Affordance prediction Embodied AI

Problem

Existing robot manipulation benchmarks rely on simplistic, high-level language instructions that fail for complex, multi-step real-world tasks. Robots lack the detailed, step-by-step linguistic guidance required to navigate intricate object interactions and unfamiliar scenarios.

Approach

The authors introduce the NrVLM benchmark with 4,500+ episodes annotated with fine-grained step-by-step instructions, and propose a framework that uses pre-defined action and perception prompt modules to align visual, linguistic, and manipulation features for stepwise gripper and contact point prediction.

Key results

  • NrVLM benchmark comprising 15 tasks, 82 object variations, and 4,500+ fine-grained instruction episodes
  • Prompt-based cross-modal alignment framework that explicitly bridges language, vision, and manipulation modalities
  • Superior manipulation accuracy and step-by-step execution compared to four competitive visual-language baselines
  • Strong generalization performance across novel tasks and unseen object geometries

Why it matters

Provides a critical resource and methodology for training embodied AI agents to reliably interpret and execute complex, multi-step human instructions in dynamic home environments.

Abstract

Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instruc- tions, i.e., "Slide the top drawer open". However, many real- world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step- by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effector’s current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross- modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end- effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.

Index terms

Perception for Grasping and Manipulation Data Sets for Robot Learning Deep Learning in Grasping and Manipulation

Related papers