DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation
Xiongfeng Peng, Jiaqian Yu, dingzhe li, Yixiang Jin, Lu Xu, Mao Yamin, Chao Zhang, Weiming Li, Sujin Jang, Dongwook Lee, Daehyun Ji
AI summary
Problem
Current Vision-Language-Action frameworks struggle to balance broad task generalization with the precise, task-specific control required for complex robotic manipulation in dynamic environments.
Approach
The framework leverages a VLM to route actions between two specialized diffusion models for arm movement and gripper manipulation, using a dual-scale weighting mechanism to dynamically coordinate them based on visual and linguistic cues.
Key results
- Superior average success rates over SOTA VLA methods in SIMPLER and FurnitureBench simulations
- Robust generalization to long-horizon and contact-rich tasks
- Validated effectiveness in real-world pick-and-place experiments
- Novel action routing and dual-scale weighting for precise dynamic coordination
Why it matters
Enables robots to seamlessly transition between gross motion and precise manipulation, advancing the deployment of general-purpose manipulation agents in dynamic real-world settings.
Abstract
In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete com- plex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM- VLA, a dynamic action model-based VLA framework. DAM- VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low- level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks. 1Xiongfeng Peng, Jiaqian Yu, Dingzhe Li, Yixiang Jin, Lu Xu, Yamin Mao, Chao Zhang, and Weiming Li are with Advanced Research Lab, Samsung R&D Institute China-Beijing (SRCB), China 2Sujin Jang, Dongwook Lee, and Daehyun Ji are with Samsung AI Center, DS Division, South Korea 3Sujin Jang is also with Hanyang University ERICA, South Korea