← Back ICRA 2026

Multi-View Stereo with Geometric Encoding for Large-Scale Dense Scene Reconstruction

Guidong YANG, Rui Cao, Junjie Wen, Benyun ZHAO, Qingxiang Li, Xi Chen, Yunhui Liu, Ben M. Chen

PDF

AI summary

Key figure (auto-extracted from paper)

GE-MVS achieves state-of-the-art depth estimation and point cloud reconstruction by explicitly encoding geometric cues during network learning.

Multi-view stereo depth estimation point cloud reconstruction geometric encoding UAV mapping dense reconstruction

Problem

Existing learning-based multi-view stereo methods implicitly encode geometric cues, leading to insufficient geometric guidance, inaccurate depth estimation, and incomplete reconstructions in challenging scenes.

Approach

The authors propose GE-MVS, a coarse-to-fine network that enhances geometric modeling through adaptive cost volume aggregation, explicit depth consistency optimization using adjacent view cues, and surface normal-assisted depth refinement.

Key results

State-of-the-art accuracy and completeness on DTU, Tanks and Temples, and BlendedMVS benchmarks
Voxel-wise visibility weighting and explicit depth consistency checks improve reconstruction fidelity
Successful UAV-based large-scale reconstruction outperforming industrial solutions in efficiency and effectiveness
Surface normal encoding refines depth hypotheses for geometrically consistent local neighborhoods

Why it matters

Provides a scalable and highly accurate 3D reconstruction pipeline essential for robotic navigation, aerial mapping, and autonomous systems.

Abstract

Multi-view stereo (MVS) implicitly encodes photo- metric and geometric cues into the cost volume for multi-view correspondence matching, transferring insufficient geometric cues essential to depth estimation and reconstruction. This paper proposes GE-MVS, a novel multi-view stereo network with geometric encoding for more accurate and complete depth estimation and point cloud reconstruction. First, the cross- view adaptive cost volume aggregation module is proposed to strengthen the encoding of multi-view geometric cues during cost volume construction. Then, the depth consistency optimization is performed in 3D point space during learning by invoking ground- truth depth cues from adjacent views. Finally, the surface normal geometries are explicitly encoded to refine the sampled depth hypotheses to be consistent in the local neighbor regions. Exten- sive experiments on the standard MVS benchmarks including DTU, Tanks and Temples, and BlendedMVS demonstrate the state-of-the-art depth estimation and point cloud reconstruction performance of GE-MVS. The GE-MVS is further deployed in real-world experiments for UAV-based large-scale reconstruc- tion, where our method outperforms the prevalent industrial reconstruction solutions in terms of reconstruction efficiency and effectiveness. Supplementary video can be found at https:// youtu.be/Z4tGROatVjU Note to Practitioners—Multi-view stereo (MVS) enables dense point cloud reconstruction of target scenes from calibrated multi-view images and has been widely adopted in robotic navigation, exploration, and manipulation. Recently, learning- based MVS methods have significantly improved reconstruction accuracy and completeness compared to traditional approaches. This work aims to enhance geometric modeling during network learning by utilizing ground-truth depth cues from adjacent views and encoding surface normal geometries. Extensive experiments conducted on both datasets and real-world scenarios validate the Received 27 September 2024; revised 5 March 2025 and 5 August 2025; accepted 29 September 2025. Date of publication 9 October 2025; date of current version 28 October 2025. This article was recommended for publi- cation by Associate Editor G. Chen and Editor Z. Li upon evaluation of the reviewers’ comments. This work was supported in part by the Research Grants Council of Hong Kong, SAR, under Grant 14206821, Grant 14217922, and Grant 14209623; and in part by the InnoHK Clusters of the Hong Kong SAR Government via Hong Kong Centre for Logistics Robotics. (Corresponding author: Junjie Wen.) Guidong Yang, Rui Cao, Benyun Zhao, Qingxiang Li, Xi Chen, Yun-Hui Liu, and Ben M. Chen are with the Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong (CUHK), Hong Kong (e-mail: gdyang@mae.cuhk.edu.hk; rcao@mae.cuhk.edu.hk; byzhao@mae.cuhk.edu.hk; qingxiang.li@polimi.it; xichen@mae.cuhk.edu.hk; yhliu@mae.cuhk.edu.hk; bmchen@mae.cuhk.edu.hk). Junjie Wen is with the Department of Mechanical and Automation Engi- neering, The Chinese University of Hong Kong (CUHK), Hong Kong, and also with the Peng Cheng Laboratory, Shenzhen 518055, China (e-mail: wenjj@pcl.ac.cn). This article has supplementary downloadable material available at https://doi.org/10.1109/TASE.2025.3619093, provided by the authors. Digital Object Identifier 10.1109/TASE.2025.3619093 effectiveness, scalability, and efficiency of the proposed method. The proposed method can provide accurate depth and dense point cloud representations for applications such as aerial path planning, robotic manipulation, autonomous driving, and virtual and augmented reality. Future work will focus on adapting the proposed method from terrestrial to underwater domains for real-world underwater dense scene reconstruction.

Index terms

Aerial Systems: Applications Aerial Systems: Perception and Autonomy Computer Vision for Automation