Use of Knowledge Embedded in Vision-Language Model to Estimate Robotic Grasping Force through Robot-To-Human Image Translation
Shohei Hagane, Shigeaki Goto, Yoshihiro Ohama
Abstract
In recent years, general-purpose robots have been introduced into domains requiring delicate manipulation, such as materials experimentation. While advances have been made in automating specific processes, generalized robotic pick-and- place operations still pose a challenge due to the diversity of target objects and the need for appropriate force control. This study proposes a novel approach for zero-shot estimation of target grasping force by utilizing the prior knowledge about human motions embedded in Vision-Language Model (VLM). The key idea is to convert robot manipulation images into human-action images using a style transfer approach based on a fine-tuned Variational Auto-Encoder (VAE), enabling the VLM to better infer grasping force requirements. The VLM, specifically GPT-4o, is prompted to estimate target grasping force in discrete categories (no grasp, light grip, firm grip). Experimental results demonstrate that converting robot im- ages into human representations improves the accuracy not only of target grasping force estimation but also of understanding the target objects. Furthermore, the inclusion of target object information in the prompt improves estimation accuracy across all input image types. These findings highlight the effective- ness of utilizing human-knowledge-trained VLM for robotic force control and open new avenues for general-purpose, cost- efficient manipulation without relying on large-scale robot force datasets.