The Price Is Not Right: Neuro-Symbolic Methods Outperform VLAs on Structured Long-Horizon Manipulation Tasks with Significantly Lower Energy Consumption
Timothy R. Duggan, Pierrick Lorang, Hong Lu, Matthias Scheutz
AI summary
Problem
The effectiveness, reliability, and computational efficiency of Vision-Language-Action (VLA) models for structured, long-horizon manipulation tasks remain unclear, particularly regarding their substantial energy costs.
Approach
The authors conduct a head-to-head empirical comparison between a fine-tuned open-weight VLA model and a neuro-symbolic architecture that combines PDDL-based symbolic planning with learned low-level control on simulated Towers of Hanoi tasks.
Key results
- 95% success on 3-block task versus 34% for the best VLA
- 78% success on unseen 4-block variant while VLAs fail completely
- VLA fine-tuning consumes nearly two orders of magnitude more energy than the neuro-symbolic approach
- Explicit symbolic structure improves reliability, data efficiency, and energy efficiency
Why it matters
Highlights critical performance-efficiency trade-offs for roboticists and AI researchers choosing between end-to-end foundation models and structured reasoning architectures for complex manipulation tasks.
Abstract
Vision-Language-Action (VLA) models have re- cently been proposed as a pathway toward generalist robotic policies capable of interpreting natural language and visual inputs to generate manipulation actions. However, their effec- tiveness and efficiency on structured, long-horizon manipulation tasks remain unclear. In this work, we present a head-to-head empirical comparison between a fine-tuned open-weight VLA model (π0) and a neuro-symbolic architecture that combines PDDL-based symbolic planning with learned low-level control. We evaluate both approaches on structured variants of the Towers of Hanoi manipulation task in simulation while mea- suring both task performance and energy consumption during training and execution. On the 3-block task, the neuro-symbolic model achieves 95% success compared to 34% for the best- performing VLA. The neuro-symbolic model also generalizes to an unseen 4-block variant (78% success), whereas both VLAs fail to complete the task. During training, VLA fine-tuning consumes nearly two orders of magnitude more energy than the neuro-symbolic approach. These results highlight important trade-offs between end-to- end foundation-model approaches and structured reasoning ar- chitectures for long-horizon robotic manipulation, emphasizing the role of explicit symbolic structure in improving reliability, data efficiency, and energy efficiency. Code and models are available at https://price-is-not-right.github.io