← Back ICRA 2026

BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

Seongwon Cho, Daechul Ahn, Donghyun Shin, Hyeonbeom Choi, San Kim, Jonghyun Choi

PDF

AI summary

Key figure (auto-extracted from paper)

Decoupling continuous video monitoring from selective 3D reconstruction via a dual-process framework significantly boosts success rates and efficiency in dynamic open-vocabulary mobile manipulation.

open-vocabulary mobile manipulation dual-process reasoning continuous environmental monitoring 3D scene reconstruction video-LLM dynamic task adaptation

Problem

Prior open-vocabulary mobile manipulation systems rely on intermittent, computationally expensive 3D scene reconstruction, leaving robots with limited environmental awareness between updates and causing missed objects, delayed error detection, and slower replanning.

Approach

BINDER separates strategic planning from continuous monitoring using a dual-process architecture: a Deliberative Response Module handles high-level task planning with 3D maps, while an Instant Response Module continuously analyzes video streams to detect changes and trigger on-demand updates or local corrections.

Key results

Decouples continuous video monitoring from selective 3D reconstruction to maintain awareness without computational overhead
Implements bidirectional coordination where the planning module guides monitoring attention and the monitoring module triggers on-demand updates
Achieves substantially higher success rates and task efficiency than state-of-the-art baselines in real-world dynamic environments
Enables immediate grasp correction, early failure detection, and opportunistic replanning during navigation and manipulation

Why it matters

Provides a computationally efficient and robust perception strategy for real-world robotic deployment in dynamic, unstructured environments like homes and offices.

Abstract

Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation as the environment changes dynamically. However, most prior works update their world representation only at discrete milestones, such as waypoints or the end of an action step. Such sparse updates leave robots with limited awareness between updates, causing missed objects, delayed error detection, and slower replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual-process framework that separates strategic planning from continuous environmental monitoring. BINDER combines a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a Video-LLM for continuous monitoring). The DRM handles strategic planning through structured 3D scene updates and guides the IRM’s focus, while the IRM processes video streams to update memory, proactively adjust actions, and trigger replanning when needed. This bidirectional coordination ensures continuous awareness without costly updates, enabling reliable and robust operation under dynamic conditions. We evaluate BINDER in three real- world environments where objects are moved during execution and show that it achieves substantially higher success rates and efficiency than state-of-the-art baselines, confirming its effectiveness for real-world deployment.

Index terms

Mobile Manipulation Task Planning Deep Learning in Grasping and Manipulation