BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands
Seongwon Cho, Daechul Ahn, Donghyun Shin, Hyeonbeom Choi, San Kim, Jonghyun Choi
AI summary
Problem
Prior open-vocabulary mobile manipulation systems rely on intermittent, computationally expensive 3D scene reconstruction, leaving robots with limited environmental awareness between updates and causing missed objects, delayed error detection, and slower replanning.
Approach
BINDER separates strategic planning from continuous monitoring using a dual-process architecture: a Deliberative Response Module handles high-level task planning with 3D maps, while an Instant Response Module continuously analyzes video streams to detect changes and trigger on-demand updates or local corrections.
Key results
- Decouples continuous video monitoring from selective 3D reconstruction to maintain awareness without computational overhead
- Implements bidirectional coordination where the planning module guides monitoring attention and the monitoring module triggers on-demand updates
- Achieves substantially higher success rates and task efficiency than state-of-the-art baselines in real-world dynamic environments
- Enables immediate grasp correction, early failure detection, and opportunistic replanning during navigation and manipulation
Why it matters
Provides a computationally efficient and robust perception strategy for real-world robotic deployment in dynamic, unstructured environments like homes and offices.
Abstract
Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation as the environment changes dynamically. However, most prior works update their world representation only at discrete milestones, such as waypoints or the end of an action step. Such sparse updates leave robots with limited awareness between updates, causing missed objects, delayed error detection, and slower replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual-process framework that separates strategic planning from continuous environmental monitoring. BINDER combines a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a Video-LLM for continuous monitoring). The DRM handles strategic planning through structured 3D scene updates and guides the IRM’s focus, while the IRM processes video streams to update memory, proactively adjust actions, and trigger replanning when needed. This bidirectional coordination ensures continuous awareness without costly updates, enabling reliable and robust operation under dynamic conditions. We evaluate BINDER in three real- world environments where objects are moved during execution and show that it achieves substantially higher success rates and efficiency than state-of-the-art baselines, confirming its effectiveness for real-world deployment.