← Back ICRA 2026

DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-Time Optical Flow and Stereo Estimation

Tushar Anand, Maheswar Bora, Antitza Dantcheva, ABHIJIT DAS

PDF

AI summary

Key figure (auto-extracted from paper)

DenVisCoM achieves state-of-the-art accuracy and real-time inference for optical flow and stereo estimation by jointly modeling image pairs within a linear-complexity Mamba-Transformer architecture.

Optical flow Stereo disparity Mamba State Space Models Real-time vision Hybrid architecture

Problem

Deep learning methods for optical flow and stereo estimation face a critical trade-off between accuracy and computational efficiency, while existing Mamba models lack explicit mechanisms for dense cross-image correspondence.

Approach

The authors introduce DenVisCoM, a hybrid architecture that fuses symmetric convolution branches with a joint Mamba sequence block and self/cross-attention to process left and right image patches simultaneously, enabling efficient long-range dependency modeling and precise dense matching.

Key results

Lowest EPE (1.34) and F1-all (2.52) on KITTI15 optical flow benchmark
Real-time inference speed (~39.9 FPS) with memory comparable to leading methods
Competitive Sintel Final unmatched error (10.67), outperforming Unimatch and FlowFormer
Novel hybrid Mamba-Transformer block enabling simultaneous joint learning of image pairs without quadratic complexity

Why it matters

Enables accurate, real-time dense perception for resource-constrained applications like autonomous driving and robotics by overcoming the accuracy-efficiency bottleneck of current vision models.

Abstract

In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy for at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade- off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at https://github.com/vimstereo/DenVisCoM.

Index terms

Deep Learning for Visual Perception