← Back ICRA 2026

High-Quality Sparse-View Gaussian Splatting without Ground-Truth Camera Poses

Chun Her Lim, Yingnan Guo, Wen Yang, Yu Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

The proposed framework achieves high-fidelity, photorealistic novel view synthesis and accurate camera pose estimation from sparse, uncalibrated images by integrating 3D Gaussian Splatting with MASt3R priors and multi-task regularization.

Sparse-view reconstruction 3D Gaussian Splatting Pose-free synthesis Multi-view stereo Geometric regularization Novel view synthesis

Problem

Current 3D reconstruction methods require dense image inputs and precise camera poses, which are often impractical to obtain in real-world scenarios. Under sparse-view conditions, these dependencies cause severe overfitting, geometric degradation, and reconstruction failure.

Approach

The method initializes 3D Gaussian Splatting using point clouds and coarse poses generated by the MASt3R vision transformer, then refines the scene through point-rendered LPIPS, local depth/normal geometric, and semantic regularization while jointly optimizing camera parameters.

Key results

Outperforms state-of-the-art methods on Tanks and Temples and MVImgNet datasets
Achieves higher fidelity and photorealistic novel view synthesis
Accurately estimates camera poses without ground-truth extrinsics
Mitigates overfitting and geometric degradation via multi-task regularization

Why it matters

Enables practical 3D scene reconstruction and novel view synthesis for robotics and autonomous navigation where dense inputs and precise camera calibration are unavailable.

Abstract

The existing methods for novel view synthesis depend on dense input images and accurate camera poses, which significantly limits their practical application. We pro- pose a novel framework that enables high-quality sparse- view reconstruction via 3D Gaussian Splatting (3DGS) without knowing camera poses. Our approach leverages MASt3R, a ViT-based multi-view stereo prior, to generate point clouds and coarse camera poses from uncalibrated sparse images. We use the point clouds to initial 3DGS. Additionally, we propose several regularization techniques, including point- rendered LPIPS regularization, geometric regularization (local depth regularization and normal regularization), and semantic regularization to improve the quality of reconstructed scenes and enhance the generalization capability of the model in unseen viewpoint. Due to the inaccuracies in the camera poses output by MASt3R, we optimized the camera poses during both the training and testing phase. Experimental results on the Tanks and Temples and MVImgNet datasets demonstrate that our method outperforms state-of-the-art techniques in novel view synthesis and camera pose estimation under sparse- view settings. Our approach achieves higher fidelity and more photorealistic visual effects.

Index terms

Deep Learning for Visual Perception Visual Learning Autonomous Vehicle Navigation