← Back ICRA 2026

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, Jonas Frey

PDF

AI summary

Key figure (auto-extracted from paper)

Current vision-language models consistently lag behind human experts in predicting embodiment-specific navigation traces due to poor spatial grounding and goal localization.

Vision-language models Embodied navigation Benchmark Semantic-aware scoring Robotic evaluation Spatial grounding

Problem

Evaluating VLM navigation capabilities is hindered by costly real-world trials, simplified simulations, and benchmarks that ignore cross-embodiment challenges and trace-level planning.

Approach

The authors introduce NaviTrace, a VQA benchmark where models predict 2D navigation paths for four embodiment types across 1,000 real-world images, evaluated with a novel semantic-aware scoring metric.

Key results

NaviTrace benchmark with 1,000 real-world scenarios and 3,000+ expert traces
Semantic-aware trace score strongly correlates with human preferences (Spearman ~0.87)
Comprehensive evaluation of eight SOTA VLMs reveals consistent gaps to human performance
Spatial grounding and goal localization identified as primary navigation bottlenecks

Why it matters

Provides a scalable, reproducible benchmark for assessing and advancing real-world robotic navigation capabilities in foundation models.

Abstract

Vision–language models demonstrate unprece- dented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models’ navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodi- ment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment- conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https:// leggedrobotics.github.io/navitrace_webpage/.

Index terms

Vision-Based Navigation Performance Evaluation and Benchmarking Data Sets for Robot Learning