MolmoAct: Action Reasoning Models That Can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna
AI summary
Problem
Current robotic foundation models map perception directly to control, limiting adaptability, generalization, and interpretability. They lack structural inductive biases needed for spatial reasoning and purposeful action.
Approach
The model autoregressively predicts three structured chains: depth perception tokens for 3D scene understanding, visual reasoning traces for 2D trajectory planning, and action tokens for robot control.
Key results
- 70.5% zero-shot accuracy on SimplerEnv Visual Matching
- 86.6% average success rate on LIBERO benchmark
- +10% to +22.7% real-world fine-tuning gains over π0-FAST
- Release of 10k-trajectory MOLMOACT DATASET boosting training performance by +5.5%
Why it matters
Provides an open, interpretable blueprint for building robotic foundation models that transform perception into purposeful, steerable action, advancing embodied AI research and deployment.
Abstract
Reasoning is essential for purposeful action, yet most robotic foundation models map perception and instructions directly to control, limiting adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), which integrate perception, planning, and control through a structured three-stage pipeline. Our model, MOL- MOACT, encodes observations and instructions into depth perception tokens, generates 2D spatial plans, and predicts fine- grained actions, enabling explainable and steerable behavior. MOLMOACT-7B-D achieves 70.5% zero-shot accuracy on SimplerEnv Visual Matching (surpassing π0 and GR00T N1.5), 86.6% average success on LIBERO, and real-world fine-tuning gains of +10% (single-arm) and +22.7% (bimanual) over π0- FAST. It further improves out-of-distribution generalization by +23.3% and ranks highest in human-preference evaluations for open-instruction following and trajectory steering. We also release MOLMOACT DATASET, a dataset of 10k diverse robot trajectories that yields an average +5.5% performance boost when used for training. Together with open model weights and code, this establishes MOLMOACT as a state-of-the-art robotic foundation model and an open blueprint for building ARMs that transform perception into grounded, purposeful action. Further experimental details and result with MOLMOACT DATASET and human-preference evaluations included in supplementary video. 1Allen Institute for AI, Seattle WA 2University of Washington, Seattle WA *Equal contribution. {jason328,duanj1,hqfang}@uw.edu †Core Contributors