← Back ICRA 2026

GraspControl: Text-Sketch Instruction As an Interface for Controllable Grasp Synthesis

XiaoPeng Wen, songtao tian, Yi Sun

PDF

AI summary

Key figure (auto-extracted from paper)

GraspControl bridges the gap between language instructions and visual object data to generate precise, task-specific 3D grasp poses using a guided diffusion framework.

Grasp synthesis Diffusion models Text-sketch interface Robotic manipulation Vision-language alignment 3D reconstruction

Problem

Existing vision-language models struggle to align abstract text instructions with precise spatial object features, making it difficult for robots to execute specific, task-oriented grasps based on human intent.

Approach

The method uses a diffusion model to convert text prompts and 2D object sketches into 2D grasp sketches, which then guide a coarse-to-fine 3D reconstruction pipeline to output accurate 6-DoF grasp poses.

Key results

Text-sketch guided diffusion model for 2D grasp sketch generation
Multi-modal attention loss aligning semantic and structural features
Coarse-to-fine 3D diffusion framework for object-grasp CAD reconstruction
Validated high-quality, diverse grasp synthesis in simulation and real-world robots

Why it matters

Provides a controllable, human-intent-driven interface for robotic manipulation, advancing reliable object interaction in complex environments.

Abstract

Large vision-language models have been shown to perform complex tasks. However, aligning language instructions with object visual information to enable general inference for robotic grasping poses a significant challenge. To tackle this issue, we introduce GraspControl, a method that leverages grasp language instructions and sketches of objects to control the generation of grasps. Initially, we construct a dataset that augments language instructions with position and orientation information of grasps, and visual information with sketches of the gripper and target objects. Subsequently, we develop a model capable of generating 2D grasp sketches given grasp language and 2D object sketches as input prompts, thereby bridging the gap between the linguistic and visual representations of the object to be grasped. These generated 2D grasp sketches serve as an innovative input modality for grasp synthesis, directing the creation of 3D object models and corresponding 3D grasp poses through a 3D reconstruction module. Furthermore, we incorporate a multi-modal attention loss to ensure the consistency between high-level semantic grasp features and intricate low- level visual features, with a particular emphasis on the grasping area of the object. We evaluate the capabilities of our grasp approach through extensive experiments in both simulated and real-world robotic scenarios. The experimental results confirm that our method can execute grasps in complex environments.

Index terms

Deep Learning in Grasping and Manipulation Grasping Deep Learning for Visual Perception