High-Quality Sparse-View Gaussian Splatting without Ground-Truth Camera Poses
Chun Her Lim, Yingnan Guo, Wen Yang, Yu Zhang
AI summary
Problem
Current 3D reconstruction methods require dense image inputs and precise camera poses, which are often impractical to obtain in real-world scenarios. Under sparse-view conditions, these dependencies cause severe overfitting, geometric degradation, and reconstruction failure.
Approach
The method initializes 3D Gaussian Splatting using point clouds and coarse poses generated by the MASt3R vision transformer, then refines the scene through point-rendered LPIPS, local depth/normal geometric, and semantic regularization while jointly optimizing camera parameters.
Key results
- Outperforms state-of-the-art methods on Tanks and Temples and MVImgNet datasets
- Achieves higher fidelity and photorealistic novel view synthesis
- Accurately estimates camera poses without ground-truth extrinsics
- Mitigates overfitting and geometric degradation via multi-task regularization
Why it matters
Enables practical 3D scene reconstruction and novel view synthesis for robotics and autonomous navigation where dense inputs and precise camera calibration are unavailable.
Abstract
The existing methods for novel view synthesis depend on dense input images and accurate camera poses, which significantly limits their practical application. We pro- pose a novel framework that enables high-quality sparse- view reconstruction via 3D Gaussian Splatting (3DGS) without knowing camera poses. Our approach leverages MASt3R, a ViT-based multi-view stereo prior, to generate point clouds and coarse camera poses from uncalibrated sparse images. We use the point clouds to initial 3DGS. Additionally, we propose several regularization techniques, including point- rendered LPIPS regularization, geometric regularization (local depth regularization and normal regularization), and semantic regularization to improve the quality of reconstructed scenes and enhance the generalization capability of the model in unseen viewpoint. Due to the inaccuracies in the camera poses output by MASt3R, we optimized the camera poses during both the training and testing phase. Experimental results on the Tanks and Temples and MVImgNet datasets demonstrate that our method outperforms state-of-the-art techniques in novel view synthesis and camera pose estimation under sparse- view settings. Our approach achieves higher fidelity and more photorealistic visual effects.