← Back ICRA 2026

Visual Grounding Via Heterogeneous Representation Learning and Hierarchical Reasoning of Human-To-Vehicle Commands

Hao Wang, Suining He, Kang G. Shin

PDF

AI summary

Key figure (auto-extracted from paper)

VIGOR improves visual grounding accuracy for human-to-vehicle commands by over 14% through cross-modal fusion and hierarchical reasoning.

Visual grounding Human-to-vehicle commands Heterogeneous representation learning Hierarchical reasoning Autonomous vehicles LiDAR fusion

Problem

Autonomous vehicles struggle to accurately locate objects of interest based on short, opaque, and context-dependent natural language commands from riders in complex traffic environments. Existing methods fail to effectively fuse heterogeneous sensor data and reason about relative positions in low-visibility scenarios.

Approach

VIGOR fuses visual, textual, and LiDAR-based situational data using heterogeneous representation learning, then applies object- and context-level Part-of-Speech tagging to perform hierarchical reasoning for precise visual grounding.

Key results

Heterogeneous modality learning framework fusing vision, text, and LiDAR
Hierarchical reasoning mechanism using object- and context-level POS tagging
14.81% average IoU improvement over state-of-the-art baselines
Robust grounding performance in complex and low-visibility traffic conditions

Why it matters

Enables safer and more reliable human-vehicle collaboration by allowing autonomous vehicles to accurately interpret and act on natural language rider commands.

Abstract

With the proliferation of autonomous vehicles (AVs) and their increasing interaction and communication with the riders, how to ground or locate the visual objects of interests (OoIs), such as the concerned pedestrians and other traffic participants, based on the human riders’ natural language and communication (e.g., vocal commands), is essential for increas- ing the efficiency, effectiveness, and reliability/safety of AVs in following the riders’ reasonable commands and preferences. There are several technical challenges to achieve visual ground- ing for such human-to-vehicle commanding (HVC) scenes, including (1) how to fuse heterogeneous sensor modalities — i.e., visual object information, textual contexts, and situation awareness (say, obtained from the light detection and ranging); (2) how to discern the opaque commands in the human natural language; and (3) how to reason about the relative positions of the OoIs within the visual modality. To meet these challenges, we propose VIGOR, a VIsual Grounding approach based on heterogeneous mOdality learn- ing and hierarchical Reasoning for HVC scenes. First, we design a heterogeneous modality learning approach in order to incorporate the visual, textual, and situational modalities, and learn their cross-modality representations to identify important information for visual grounding. Then, VIGOR performs hierarchical reasoning of objects and context levels, and dif- ferentiates the OoIs in the complex traffic environments that relate to the natural language commands. Finally, we conduct extensive experimental studies on a total of 12,037 HVC scenes, demonstrating VIGOR to achieve higher accuracy than the state-of-the-art approaches (by 14.81% on average) in terms of the Intersection over Union (IoU) in grounding the OoIs in the complex (including low-visibility) HVC scenes.

Index terms

AI-Based Methods Human-Centered Automation Intelligent Transportation Systems