← Back IROS 2024

Kosmos-E: Learning to Follow Instruction for Robotic Grasping

Zhi Wang, Xun Wu, Xun Wu, Li Dong, Wang Wenhui, Shuming Ma, Furu Wei

PDF

Abstract

Tuning on instruction-following data has been shown to enhance the capabilities and controllability of language models, but the idea is less explored in the robotic field. In this work, we introduce KOSMOS-E, a Multimodal Large Language Model (MLLM) that leverages instruction-following robotic grasping data to enhance capabilities for precise and intricate robotic grasping maneuvers. To achieve this, we craft a large-scale instruction-following robotic grasping dataset, termed INSTRUCT-GRASP, primarily comprising two aspects: (i) grasp a single object following varying levels of granularity descriptions, e.g., different angles and aspects, and (ii) grasp a specific object within a multi-object environment following specific attributes, e.g., color and shape. Extensive experiments show the effectiveness of KOSMOS-E on robotic grasping tasks across a variety of environments.

Index terms

Deep Learning in Grasping and Manipulation Grasping Deep Learning Methods