CLIPUNetr: Assisting Human-Robot Interface for Uncalibrated Visual Servoing Control with CLIP-Driven Referring Expression Segmentation
Chen Jiang, Yuchen Yang, Martin Jagersand
Abstract
The classical human-robot interface in uncali- brated image-based visual servoing (UIBVS) relies on either human annotations or semantic segmentation with categorical labels. Both methods fail to match natural human commu- nication and convey rich semantics in manipulation tasks as effectively as natural language expressions. In this paper, we tackle this problem by using referring expression segmen- tation, which is a prompt-based approach, to provide more in-depth information for robot perception. To generate high- quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network. CLIPUNetr leverages CLIP’s strong vision-language representations to segment regions from refer- ring expressions, while utilizing its “U-shaped” encoder-decoder architecture to generate predictions with sharper boundaries and finer structures. Furthermore, we propose a new pipeline to integrate CLIPUNetr into UIBVS and apply it to control robots in real-world environments. In experiments, our method improves boundary and structure measurements by an average of 120% and can successfully assist real-world UIBVS control in an unstructured manipulation environment.