← Back ICRA 2026

The Price Is Not Right: Neuro-Symbolic Methods Outperform VLAs on Structured Long-Horizon Manipulation Tasks with Significantly Lower Energy Consumption

Timothy R. Duggan, Pierrick Lorang, Hong Lu, Matthias Scheutz

PDF

AI summary

Key figure (auto-extracted from paper)

Neuro-symbolic models significantly outperform Vision-Language-Action models in structured, long-horizon manipulation tasks while consuming drastically less energy.

Neuro-symbolic robotics Vision-Language-Action models Long-horizon manipulation Energy efficiency Symbolic planning Towers of Hanoi

Problem

The effectiveness, reliability, and computational efficiency of Vision-Language-Action (VLA) models for structured, long-horizon manipulation tasks remain unclear, particularly regarding their substantial energy costs.

Approach

The authors conduct a head-to-head empirical comparison between a fine-tuned open-weight VLA model and a neuro-symbolic architecture that combines PDDL-based symbolic planning with learned low-level control on simulated Towers of Hanoi tasks.

Key results

95% success on 3-block task versus 34% for the best VLA
78% success on unseen 4-block variant while VLAs fail completely
VLA fine-tuning consumes nearly two orders of magnitude more energy than the neuro-symbolic approach
Explicit symbolic structure improves reliability, data efficiency, and energy efficiency

Why it matters

Highlights critical performance-efficiency trade-offs for roboticists and AI researchers choosing between end-to-end foundation models and structured reasoning architectures for complex manipulation tasks.

Abstract

Vision-Language-Action (VLA) models have re- cently been proposed as a pathway toward generalist robotic policies capable of interpreting natural language and visual inputs to generate manipulation actions. However, their effec- tiveness and efficiency on structured, long-horizon manipulation tasks remain unclear. In this work, we present a head-to-head empirical comparison between a fine-tuned open-weight VLA model (π0) and a neuro-symbolic architecture that combines PDDL-based symbolic planning with learned low-level control. We evaluate both approaches on structured variants of the Towers of Hanoi manipulation task in simulation while mea- suring both task performance and energy consumption during training and execution. On the 3-block task, the neuro-symbolic model achieves 95% success compared to 34% for the best- performing VLA. The neuro-symbolic model also generalizes to an unseen 4-block variant (78% success), whereas both VLAs fail to complete the task. During training, VLA fine-tuning consumes nearly two orders of magnitude more energy than the neuro-symbolic approach. These results highlight important trade-offs between end-to- end foundation-model approaches and structured reasoning ar- chitectures for long-horizon robotic manipulation, emphasizing the role of explicit symbolic structure in improving reliability, data efficiency, and energy efficiency. Code and models are available at https://price-is-not-right.github.io

Index terms

Deep Learning in Grasping and Manipulation Task and Motion Planning Perception for Grasping and Manipulation