← Back ICRA 2026

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, TAO JIANG, Hang Zhao

PDF

AI summary

Key figure (auto-extracted from paper)

Explicitly integrating a pretrained depth expert into a Vision-Language-Action model significantly boosts spatial reasoning and manipulation success across real-world and simulated benchmarks.

Vision-Language-Action Depth Estimation Spatial Reasoning Robot Manipulation Mixture-of-Transformers Embodied AI

Problem

Current Vision-Language-Action models struggle with precise 3D spatial reasoning and rely on inefficient large-scale action pretraining that still fails at fine-grained manipulation tasks.

Approach

DepthVLA uses a mixture-of-transformers architecture to fuse a pretrained depth expert with a standard VLM and action expert, enabling end-to-end spatial reasoning without sacrificing inference speed.

Key results

78.5% real-world success vs. 65.0% baseline
94.9% LIBERO and 74.8% Simpler simulator performance
Enables independent pretraining of semantic and spatial experts
Adds ~20 ms latency while boosting 3D perception

Why it matters

Provides a scalable, efficient pathway for generalist robot policies to achieve the precise spatial understanding required for real-world manipulation.

Abstract

Vision-Language-Action (VLA) models have re- cently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spa- tial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and sim- ulated environments show that DepthVLA outperforms state- of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.

Index terms

AI-Enabled Robotics Learning from Demonstration Perception for Grasping and Manipulation