← Back ICRA 2026

Mis: Light Response Agent for Video Comment with Multimodal Informative Seeking

Dong Zhang, Tongfei Shen, Zhiyu Tang, Shoushan Li, Guodong Zhou

PDF

AI summary

Key figure (auto-extracted from paper)

MIS achieves state-of-the-art video comment response generation by efficiently extracting key textual and visual context, outperforming heavy LLM-based baselines while enabling real-time robotic deployment.

Video comment response Multimodal generation Lightweight agent Key vision selection Comment context retrieval Human-robot interaction

Problem

Existing video comment response models rely on computationally heavy LLMs, ignore precise cross-modal information extraction, and suffer from noisy dataset annotations, hindering real-time robotic applications.

Approach

MIS uses a lightweight architecture with a Comment Context Retrieval module to find relevant historical comments and a Key Vision Selection module to isolate crucial video frames, fusing them via a cross-modal decoder for efficient response generation.

Key results

Surpasses SOTA baselines on BLEU, ROUGE, and CIDEr metrics
Introduces UMC dataset with verified comment-response pairs
Reduces computational overhead while improving response fluency and diversity
Enables real-time deployment on resource-constrained robotic platforms

Why it matters

Provides an efficient, context-aware response generation framework tailored for real-time human-robot interaction and automated service systems.

Abstract

Automatic response generation of video comments (RGVC) aims to generate a target reply to the content of the target comment based on the video context. Existing works for RGVC normally rely on large language models (LLMs), and mostly neglect the importance of extracting key information from both linguistic and visual perspectives. This limitation hinders the deployment of fluent and targeted response generation systems in real-world robotic and automated applications, where computational efficiency and precision are essential. In this work, we introduce a lightweight response agent with a novel multimodal informative seeking approach (MIS), which includes a Comment Context Retrieval (CCR) module and a Key Vision Selection (KVS) module to simultaneously seek essential information from both textual and visual modalities. Specifically, the CCR module enriches the dialogue context by retrieving relevant comments from other comment blocks, while the KVS module utilizes a spatial-temporal Transformer with cross-modal attention to highlight the most crucial information in the video. Moreover, we also build a large-scale user- level multimodal chitchat (UMC) dataset with exact comment- response interactions to better investigate RGVC. Extensive experiments demonstrate that our model effectively captures human points of interest and generates more fluent and diverse responses than state-of-the-art methods in both open and closed resources. These attributes make MIS particularly suitable for deployment in social robots, service automation, and other interactive robotic systems requiring real-time visual and linguistic inference.

Index terms

Representation Learning Deep Learning Methods AI-Based Methods