ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking
Guangming Wang, Qizhen Ying, Yixiong Jing, Olaf Wysocki, Brian Sheil
AI summary
Problem
Classical robotic systems lack generalization and require extensive domain-specific coding, while data-driven Vision-Language-Action models struggle to scale due to the vast continuous action space compared to linguistic tokens.
Approach
The authors propose ActionReasoning, a multi-agent LLM framework that ingests accurate 3D environmental states and uses structured prompting to perform explicit, physics-guided action reasoning, decomposing manipulation tasks into specialized agents that generate and verify waypoints.
Key results
- A gated multi-agent LLM pipeline decomposes brick stacking into six specialized reasoning stages
- Stable brick placement achieved in simulation with improved robustness over classical controllers
- Successful generalization across stacking configurations without per-scene code
- Demonstration that high-level prompting and 3D spatial reasoning effectively bridge perception and execution
Why it matters
This approach provides a scalable pathway for general-purpose robotic manipulation by leveraging LLMs for physical reasoning, reducing reliance on massive datasets and hand-coded controllers.
Abstract
Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabili- ties, limiting the scalability of embodied AI and general-purpose robots. Recent data-driven Vision-Language-Action (VLA) ap- proaches aim to learn policies from large-scale simulation and real-world data. However, the continuous action space of the physical world significantly exceeds the representational capacity of linguistic tokens, making it unclear if scaling data alone can yield general robotic intelligence. To address this gap, we propose ActionReasoning, an LLM-driven framework that performs explicit action reasoning to produce physics- consistent, prior-guided decisions for robotic manipulation. ActionReasoning leverages the physical priors and real-world knowledge already encoded in Large Language Models (LLMs) and structures them within a multi-agent architecture. We instantiate this framework on a tractable case study of brick stacking, where the environment states are assumed to be already accurately measured. The environmental states are then serialized and passed to a multi-agent LLM framework that generates physics-aware action plans. The experiments demonstrate that the proposed multi-agent LLM framework enables stable brick placement while shifting effort from low- level domain-specific coding to high-level tool invocation and prompting, highlighting its potential for broader generaliza- tion. This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with LLMs.