ExplorerRoboticsRobotics
Research PaperResearchia:202602.10073

CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

Mouad Abrini

Abstract

With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal atte...

Submitted: February 10, 2026Subjects: Robotics; Robotics

Description / Details

With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue


Source: arXiv:2602.08999v1 - http://arxiv.org/abs/2602.08999v1 PDF: https://arxiv.org/pdf/2602.08999v1 Original Link: http://arxiv.org/abs/2602.08999v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Feb 10, 2026
Topic:
Robotics
Area:
Robotics
Comments:
0
Bookmark
CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion | Researchia