← Back ICRA 2023

Towards Open-World Interactive Disambiguation for Robotic Grasping

Yuchen Mo, Hanbo Zhang, Tao Kong

PDF

Abstract

Language-based communications are essential in human-robot interaction, especially for the majority of non- expert users. In this paper, we present SeeAsk, an open- world interactive visual grounding system to grasp specified targets with ambiguous natural language instructions. The main contribution of SeeAsk is that it can robustly handle open-world scenes in terms of both open-set objects and open-vocabulary interactions. Specifically, our SeeAsk is built upon modern large-scale vision-language pre-trained models and traditional decision-making process, and shows promising results to be deployed in real-world scenarios. SeeAsk outperforms previous state-of-the-art algorithms with a clear margin in terms of not only success rate but also asking smarter and more informative questions. User studies also demonstrate its advantages over previous works.

Index terms

Multi-Modal Perception for HRI Integrated Planning and Learning