← Back ICRA 2026

ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

Guangming Wang, Qizhen Ying, Yixiong Jing, Olaf Wysocki, Brian Sheil

PDF

AI summary

Key figure (auto-extracted from paper)

A multi-agent LLM framework enables stable, physics-consistent robotic manipulation by replacing low-level coding with high-level 3D spatial reasoning.

LLM robotics 3D action reasoning multi-agent planning robotic manipulation brick stacking physics-guided control

Problem

Classical robotic systems lack generalization and require extensive domain-specific coding, while data-driven Vision-Language-Action models struggle to scale due to the vast continuous action space compared to linguistic tokens.

Approach

The authors propose ActionReasoning, a multi-agent LLM framework that ingests accurate 3D environmental states and uses structured prompting to perform explicit, physics-guided action reasoning, decomposing manipulation tasks into specialized agents that generate and verify waypoints.

Key results

A gated multi-agent LLM pipeline decomposes brick stacking into six specialized reasoning stages
Stable brick placement achieved in simulation with improved robustness over classical controllers
Successful generalization across stacking configurations without per-scene code
Demonstration that high-level prompting and 3D spatial reasoning effectively bridge perception and execution

Why it matters

This approach provides a scalable pathway for general-purpose robotic manipulation by leveraging LLMs for physical reasoning, reducing reliance on massive datasets and hand-coded controllers.

Abstract

Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabili- ties, limiting the scalability of embodied AI and general-purpose robots. Recent data-driven Vision-Language-Action (VLA) ap- proaches aim to learn policies from large-scale simulation and real-world data. However, the continuous action space of the physical world significantly exceeds the representational capacity of linguistic tokens, making it unclear if scaling data alone can yield general robotic intelligence. To address this gap, we propose ActionReasoning, an LLM-driven framework that performs explicit action reasoning to produce physics- consistent, prior-guided decisions for robotic manipulation. ActionReasoning leverages the physical priors and real-world knowledge already encoded in Large Language Models (LLMs) and structures them within a multi-agent architecture. We instantiate this framework on a tractable case study of brick stacking, where the environment states are assumed to be already accurately measured. The environmental states are then serialized and passed to a multi-agent LLM framework that generates physics-aware action plans. The experiments demonstrate that the proposed multi-agent LLM framework enables stable brick placement while shifting effort from low- level domain-specific coding to high-level tool invocation and prompting, highlighting its potential for broader generaliza- tion. This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with LLMs.

Index terms

Assembly Compliant Assembly Intelligent and Flexible Manufacturing