← Back ICRA 2026

CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

Mouad Abrini, Mohamed Chetouani

PDF

AI summary

Key figure (auto-extracted from paper)

Extracting cross-modal attention maps from a vision-language model and processing them with a lightweight CNN reliably detects referential ambiguity, enabling state-of-the-art interactive visual grounding.

Interactive Visual Grounding Ambiguity Detection Vision-Language Models Cross-Modal Attention Human-Robot Interaction Parameter-Efficient Fine-Tuning

Problem

Current interactive visual grounding systems lack a reliable, spatially grounded mechanism to determine when a user's instruction is ambiguous, often relying on indirect confidence scores or heuristics.

Approach

CLUE extracts text-to-image attention maps from a VLM's mid-layers and feeds them into a lightweight CNN to explicitly detect and localize ambiguity, while a LoRA-fine-tuned decoder generates clarification questions and grounding tokens.

Key results

CNN-based detector outperforms autoregressive baselines with an F1 of 0.846 on synthetic ambiguity data
End-to-end IVG model achieves state-of-the-art grounding performance using only InViG-only supervision
Mid-decoder layer attention (layer 14) optimally balances precision and recall for ambiguity localization
Public release of a synthetic ambiguity dataset generated via Isaac Sim alongside code and models

Why it matters

Provides robots with an interpretable, computationally efficient signal for triggering clarification, directly improving human-robot interaction reliability and grounding accuracy.

Abstract

With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM’s cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity de- tector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue/

Index terms

Multi-Modal Perception for HRI Intention Recognition Natural Dialog for HRI