← Back ICRA 2026

GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning

Rui Tang, Guankun Wang, Long Bai, Huxin Gao, Jiewen Lai, Chi Kit Ng, Jiazheng Wang, Fan Zhang, Hongliang Ren

PDF

AI summary

Key figure (auto-extracted from paper)

GeoLanG achieves state-of-the-art language-guided grasping and segmentation in cluttered scenes by unifying RGB-D inputs and multi-scale features in an end-to-end framework.

language-guided grasping RGB-D perception multimodal learning geometric priors cluttered environments end-to-end robotics

Problem

Existing language-guided grasping methods rely on multi-stage pipelines or separate RGB/depth processing, leading to poor cross-modal fusion, limited generalization, and failure in cluttered or occluded environments.

Approach

GeoLanG is an end-to-end framework built on CLIP-VMamba that fuses RGB-D data and language into a shared space, using a Depth-guided Geometric Module for spatial priors and Adaptive Dense Channel Integration for multi-layer feature fusion.

Key results

85.77% IoU and 92.13% J@N on OCID-VLG, surpassing SOTA
Depth-guided Geometric Module injects spatial priors into attention without extra compute
Adaptive Dense Channel Integration balances multi-layer features for robust alignment
Validated on unseen objects and real-world hardware

Why it matters

Enables reliable, explainable robotic manipulation in complex real-world settings where objects are cluttered, occluded, or lack texture.

Abstract

Language-guided grasping has emerged as a promising paradigm for enabling robots to identify and ma- nipulate target objects through natural language instructions, yet it remains highly challenging in cluttered or occluded scenes. Existing methods often rely on multi-stage pipelines that separate object perception and grasping, which leads to limited cross-modal fusion, redundant computation, and poor generalization in cluttered, occluded, or low-texture scenes. To address these limitations, we propose GeoLanG, an end-to-end multi-task framework built upon the CLIP architecture that unifies visual and linguistic inputs into a shared representation space for robust semantic alignment and improved gener- alization. To enhance target discrimination under occlusion and low-texture conditions, we explore a more effective use of depth information through the Depth-guided Geometric Module (DGGM), which converts depth into explicit geometric priors and injects them into the attention mechanism without additional computational overhead. In addition, we propose Adaptive Dense Channel Integration, which adaptively bal- ances the contributions of multi-layer features to produce more discriminative and generalizable visual representations. Extensive experiments on the OCID-VLG dataset, as well as in both simulation and real-world hardware, demonstrate that GeoLanG enables precise and robust language-guided grasping in complex, cluttered environments, paving the way toward more reliable multimodal robotic manipulation in real-world human-centric settings. Our code and dataset are publicly available at https://github.com/Tomry1114/GeoLanG. I. INTRODUCITON In recent years, robots have been increasingly deployed in applications such as household assistance, elderly care, and warehouse logistics [1]–[5]. As shown in Fig. 1, a central challenge in this field is enabling robots to perform adaptive, explainable, and reliable grasping in open-world environ- ments, where multiple objects may occlude or interfere with each other. Traditional RGB-based grasping pipelines typically separate object detection, segmentation, and grasp planning [6], [7]. While effective in controlled settings, these approaches often exhibit limited semantic understanding, * Equal contribution This work was supported in part by Ministry of Science and Technol- ogy (MOST) of China Key Project 2025YFE0122500, Guangdong Basic and Applied Basic Research Foundation (Grant No. 2025A1515011594); National Natural Science Foundation of China (Grant No. 62403402); NSFC Distinguished Young Scientists Fund – Category A (Grant No. T252500134); Hong Kong Research Grants Council (Grant Nos. C4026- 21GF, R4020-22, 14200425, 14206125, 14204524, 14203323). (Corre- sponding to: H. Ren, hlren@ee.cuhk.edu.hk.) 1 The Chinese University of Hong Kong, Hong Kong SAR, China. 2 The Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co. Ltd., Hong Kong SAR, China. 3 Shenzhen Loop Area Institute, Shenzhen, China. Fig. 1. Language-Guided Multimodal Perception for 6-DoF Robotic Grasping in Cluttered Environments. poor generalization to cluttered or occluded scenes, and cu- mulative errors across stages. Language-guided grasping has emerged as a promising paradigm that uses natural language instructions to guide perception and planning. Contrastive language-image pretraining models, such as CLIP [8], pro- vide strong capabilities for aligning visual and linguistic modalities. Recent approaches integrate CLIP with prepro- cessed object regions and Transformer encoders to construct multimodal fusion frameworks [9]. Although these methods improve semantic understanding, several limitations remain. First, many approaches rely on external object and grasp detectors, which are susceptible to cascading errors. Second, commonly adopted visual encoders such as CLIP-ResNet and CLIP-ViT each face intrinsic drawbacks: CLIP-ResNet, constrained by its static hierarchical structure, cannot dy- namically model object scale variations and deformations, thereby limiting its representation of multi-scale and non- rigid objects [10], [11]; CLIP-ViT, while offering global context modeling, incurs high computational cost due to quadratic self-attention and struggles to capture fine-grained structures under fixed patch partitioning, restricting its ability to represent local details and deformations. These limitations highlight the need for a visual framework that combines efficient global modeling with dynamic features for robust language-guided grasping. Accurate spatial modeling is crucial in cluttered or oc- cluded scenes, where objects overlap or exhibit substantial visual similarity. Recent studies have introduced depth infor- mation to enhance grasping by providing structural cues such as object shape, spatial relationships, and occlusion reason- 2026 IEEE International Conference on Robotics and Automation (ICRA 2026) June 1-5, 2026. Vienna, Austria 979-8-3315-8160-2/26/$31.00 ©2026 IEEE 10139

Index terms

Deep Learning in Grasping and Manipulation RGB-D Perception Object Detection Segmentation and Categorization