← Back ICRA 2026

MolmoAct: Action Reasoning Models That Can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna

PDF

AI summary

Key figure (auto-extracted from paper)

MOLMOACT achieves state-of-the-art robotic manipulation by generating explicit depth and trajectory reasoning tokens, enabling highly interpretable and steerable control.

Action Reasoning Models Robotic Foundation Models Spatial Reasoning Vision-Language-Action Steerable Robotics Open-Source AI

Problem

Current robotic foundation models map perception directly to control, limiting adaptability, generalization, and interpretability. They lack structural inductive biases needed for spatial reasoning and purposeful action.

Approach

The model autoregressively predicts three structured chains: depth perception tokens for 3D scene understanding, visual reasoning traces for 2D trajectory planning, and action tokens for robot control.

Key results

70.5% zero-shot accuracy on SimplerEnv Visual Matching
86.6% average success rate on LIBERO benchmark
+10% to +22.7% real-world fine-tuning gains over π0-FAST
Release of 10k-trajectory MOLMOACT DATASET boosting training performance by +5.5%

Why it matters

Provides an open, interpretable blueprint for building robotic foundation models that transform perception into purposeful, steerable action, advancing embodied AI research and deployment.

Abstract

Reasoning is essential for purposeful action, yet most robotic foundation models map perception and instructions directly to control, limiting adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), which integrate perception, planning, and control through a structured three-stage pipeline. Our model, MOL- MOACT, encodes observations and instructions into depth perception tokens, generates 2D spatial plans, and predicts fine- grained actions, enabling explainable and steerable behavior. MOLMOACT-7B-D achieves 70.5% zero-shot accuracy on SimplerEnv Visual Matching (surpassing π0 and GR00T N1.5), 86.6% average success on LIBERO, and real-world fine-tuning gains of +10% (single-arm) and +22.7% (bimanual) over π0- FAST. It further improves out-of-distribution generalization by +23.3% and ranks highest in human-preference evaluations for open-instruction following and trajectory steering. We also release MOLMOACT DATASET, a dataset of 10k diverse robot trajectories that yields an average +5.5% performance boost when used for training. Together with open model weights and code, this establishes MOLMOACT as a state-of-the-art robotic foundation model and an open blueprint for building ARMs that transform perception into grounded, purposeful action. Further experimental details and result with MOLMOACT DATASET and human-preference evaluations included in supplementary video. 1Allen Institute for AI, Seattle WA 2University of Washington, Seattle WA *Equal contribution. {jason328,duanj1,hqfang}@uw.edu †Core Contributors

Index terms

Big Data in Robotics and Automation Imitation Learning Representation Learning