Visual Grounding Via Heterogeneous Representation Learning and Hierarchical Reasoning of Human-To-Vehicle Commands
Hao Wang, Suining He, Kang G. Shin
AI summary
Problem
Autonomous vehicles struggle to accurately locate objects of interest based on short, opaque, and context-dependent natural language commands from riders in complex traffic environments. Existing methods fail to effectively fuse heterogeneous sensor data and reason about relative positions in low-visibility scenarios.
Approach
VIGOR fuses visual, textual, and LiDAR-based situational data using heterogeneous representation learning, then applies object- and context-level Part-of-Speech tagging to perform hierarchical reasoning for precise visual grounding.
Key results
- Heterogeneous modality learning framework fusing vision, text, and LiDAR
- Hierarchical reasoning mechanism using object- and context-level POS tagging
- 14.81% average IoU improvement over state-of-the-art baselines
- Robust grounding performance in complex and low-visibility traffic conditions
Why it matters
Enables safer and more reliable human-vehicle collaboration by allowing autonomous vehicles to accurately interpret and act on natural language rider commands.
Abstract
With the proliferation of autonomous vehicles (AVs) and their increasing interaction and communication with the riders, how to ground or locate the visual objects of interests (OoIs), such as the concerned pedestrians and other traffic participants, based on the human riders’ natural language and communication (e.g., vocal commands), is essential for increas- ing the efficiency, effectiveness, and reliability/safety of AVs in following the riders’ reasonable commands and preferences. There are several technical challenges to achieve visual ground- ing for such human-to-vehicle commanding (HVC) scenes, including (1) how to fuse heterogeneous sensor modalities — i.e., visual object information, textual contexts, and situation awareness (say, obtained from the light detection and ranging); (2) how to discern the opaque commands in the human natural language; and (3) how to reason about the relative positions of the OoIs within the visual modality. To meet these challenges, we propose VIGOR, a VIsual Grounding approach based on heterogeneous mOdality learn- ing and hierarchical Reasoning for HVC scenes. First, we design a heterogeneous modality learning approach in order to incorporate the visual, textual, and situational modalities, and learn their cross-modality representations to identify important information for visual grounding. Then, VIGOR performs hierarchical reasoning of objects and context levels, and dif- ferentiates the OoIs in the complex traffic environments that relate to the natural language commands. Finally, we conduct extensive experimental studies on a total of 12,037 HVC scenes, demonstrating VIGOR to achieve higher accuracy than the state-of-the-art approaches (by 14.81% on average) in terms of the Intersection over Union (IoU) in grounding the OoIs in the complex (including low-visibility) HVC scenes.