Research Analyzer
← Back ICRA 2026

FastViDAR: Real-Time Omnidirectional Depth Estimation Via Alternative Hierarchical Attention

Hangtian ZHAO, Xiang Chen, Yizhe Li, Qianhao Wang, Haibo Lu, Fei Gao

PDF

AI summary

Key figure (auto-extracted from paper)
FastViDAR delivers accurate, real-time 360° depth from multi-fisheye cameras on embedded hardware by replacing heavy cost volumes with an efficient hierarchical attention mechanism.
Omnidirectional depth Alternative hierarchical attention Equirectangular projection Real-time inference Multi-fisheye fusion Embedded robotics

Problem

Real-time omnidirectional depth estimation remains challenging for resource-constrained platforms due to the computational cost of transformer-based multi-view models and the calibration dependencies of classic fisheye stereo methods.

Approach

The system projects fisheye inputs to a unified equirectangular grid and processes them with an Alternative Hierarchical Attention (AHA) module that alternates local windowed attention with global summary attention to efficiently fuse cross-view features without explicit 3D cost volumes.

Key results

  • Novel Alternative Hierarchical Attention (AHA) mechanism for efficient cross-view fusion
  • Equirectangular projection (ERP) fusion for seamless 360° depth estimation
  • Competitive zero-shot accuracy on HM3D and 2D-3D-S benchmarks
  • Real-time inference up to 20 FPS on NVIDIA Orin NX embedded hardware

Why it matters

Provides a practical, low-cost alternative to LiDAR for real-time spatial perception in robotics and autonomous driving.

Abstract

In this paper, we propose FastViDAR, a novel framework that takes four fisheye camera inputs and pro- duces a full 360◦depth map along with per-camera depth, fusion depth, and confidence estimates. Our main contributions are: (1) We introduce an Alternative Hierarchical Attention (AHA) mechanism that efficiently fuses features across views through separate intra-frame and inter-frame windowed self- attention, achieving cross-view feature mixing with reduced overhead. (2) We propose a novel equirectangular projection (ERP) fusion approach that projects multi-view depth esti- mates to a shared equirectangular coordinate system to obtain the final fusion depth. (3) We generate ERP image-depth pairs using HM3D and 2D-3D-S datasets for comprehensive evaluation, demonstrating competitive zero-shot performance on real datasets while achieving up to 20 FPS on NVIDIA Orin NX embedded hardware. Project page: https://zhao- hangtian.github.io/FastViDAR/

Index terms

Omnidirectional Vision RGB-D Perception Deep Learning for Visual Perception

Related papers