← Back ICRA 2026

DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation

Xiongfeng Peng, Jiaqian Yu, dingzhe li, Yixiang Jin, Lu Xu, Mao Yamin, Chao Zhang, Weiming Li, Sujin Jang, Dongwook Lee, Daehyun Ji

PDF

AI summary

Key figure (auto-extracted from paper)

DAM-VLA dynamically routes between specialized arm and gripper diffusion models guided by VLM reasoning, achieving superior success rates in simulation and real-world manipulation tasks.

Vision-Language-Action Robot Manipulation Diffusion Policy Action Routing Dynamic Coordination Robotic Generalization

Problem

Current Vision-Language-Action frameworks struggle to balance broad task generalization with the precise, task-specific control required for complex robotic manipulation in dynamic environments.

Approach

The framework leverages a VLM to route actions between two specialized diffusion models for arm movement and gripper manipulation, using a dual-scale weighting mechanism to dynamically coordinate them based on visual and linguistic cues.

Key results

Superior average success rates over SOTA VLA methods in SIMPLER and FurnitureBench simulations
Robust generalization to long-horizon and contact-rich tasks
Validated effectiveness in real-world pick-and-place experiments
Novel action routing and dual-scale weighting for precise dynamic coordination

Why it matters

Enables robots to seamlessly transition between gross motion and precise manipulation, advancing the deployment of general-purpose manipulation agents in dynamic real-world settings.

Abstract

In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete com- plex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM- VLA, a dynamic action model-based VLA framework. DAM- VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low- level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks. 1Xiongfeng Peng, Jiaqian Yu, Dingzhe Li, Yixiang Jin, Lu Xu, Yamin Mao, Chao Zhang, and Weiming Li are with Advanced Research Lab, Samsung R&D Institute China-Beijing (SRCB), China 2Sujin Jang, Dongwook Lee, and Daehyun Ji are with Samsung AI Center, DS Division, South Korea 3Sujin Jang is also with Hanyang University ERICA, South Korea

Index terms

Deep Learning in Grasping and Manipulation Manipulation Planning Grasping