← Back ICRA 2026

MOGS: Monocular Object-Guided Gaussian Splatting in Large Scenes

Shengkai Zhang, Yuhe Liu, Jianhua He, Xuedou Xiao, Mozi Chen, Kezhong Liu

PDF

AI summary

Key figure (auto-extracted from paper)

MOGS enables scalable, low-cost monocular 3D Gaussian Splatting for large scenes by replacing expensive LiDAR with object-anchored visual-inertial depth, cutting training time by over 30% and memory by 20% while maintaining high rendering quality.

3D Gaussian Splatting Monocular Depth Large Scene Reconstruction Visual-Inertial SfM Object-guided Modeling Low-cost SLAM

Problem

Extending 3D Gaussian Splatting to large scenes typically relies on costly high-channel LiDAR, which strains memory and computation, limiting scalability and fleet deployment. Monocular alternatives lack reliable metric depth, causing scale drift and geometric inconsistency.

Approach

MOGS replaces LiDAR with a visual-inertial SfM frontend and uses image semantics to hypothesize object-level shape priors, anchoring them with sparse metric points and propagating constraints to generate dense depth, refined by a cross-object consistency module.

Key results

Reduces training time by up to 30.4%
Lowers memory consumption by 19.8%
Achieves rendering quality competitive with LiDAR-based methods
Enables reliable Gaussian initialization using only low-cost visual-inertial sensors

Why it matters

It makes scalable, high-fidelity 3D scene reconstruction accessible for autonomous driving and robotics by eliminating the need for expensive LiDAR hardware while maintaining computational efficiency.

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) deliver striking photorealism, and extending it to large scenes opens new opportunities for semantic reasoning and prediction in applications such as autonomous driving. Today’s state-of-the- art systems for large scenes primarily originate from LiDAR- based pipelines that utilize long-range depth sensing. However, they require costly high-channel sensors whose dense point clouds strain memory and computation, limiting scalability, fleet deployment, and optimization speed. We present MOGS, a monocular 3DGS framework that replaces active LiDAR depth with object-anchored, metrized dense depth derived from sparse visual-inertial (VI) structure-from-motion (SfM) cues. Our key idea is to exploit image semantics to hypothesize per-object shape priors, anchor them with sparse but metrically reliable SfM points, and propagate the resulting metric constraints across each object to produce dense depth. To address two key challenges, i.e., insufficient SfM coverage within objects and cross-object geometric inconsistency, MOGS introduces 1) a multi-scale shape consensus module that adaptively merges small segments into coarse objects best supported by SfM and fits them with parametric shape models, and 2) a cross-object depth refinement module that optimizes per-pixel depth under a combinatorial objective combining geometric consistency, prior anchoring, and edge-aware smoothness. Experiments on public datasets show that, with a low-cost VI sensor suite, MOGS reduces training time by up to 30.4% and memory consumption by 19.8%, while achieving high-quality rendering competitive with costly LiDAR- based approaches in large scenes. The source code is publicly available at https://github.com/ClarenceZSK/MOGS/.

Index terms

Computer Vision for Automation Mapping Visual-Inertial SLAM