← Back ICRA 2026

MGS-Track: Monocular 6DoF Pose Tracking Via Masked 3D Prior and Online Gaussian Splatting

Zhiyuan Chen, Fan Lu, Guo Yu, Sanqing Qu, Ya Wu, Huang Yuan, Alois Knoll, Guang Chen

PDF

AI summary

Key figure (auto-extracted from paper)

MGS-Track enables robust, real-time 6DoF pose tracking and high-fidelity 3D reconstruction of unseen objects from monocular video by fusing mask-augmented geometric priors with online 3D Gaussian Splatting.

Monocular pose tracking 3D Gaussian Splatting 6DoF estimation online reconstruction geometric priors robotic manipulation

Problem

Estimating the 6DoF pose of unseen objects from monocular RGB videos is essential for robotics but hindered by depth ambiguity and the lack of reliable, low-cost depth sensors. Existing monocular trackers often suffer from geometric errors and poor robustness in dynamic, occluded scenes.

Approach

The method uses a mask-augmented feed-forward network to extract object-centric geometric priors, which initialize and guide an online 3D Gaussian Splatting representation. It then jointly optimizes the Gaussian field and object pose in a coarse-to-fine manner, supplemented by adaptive pruning to maintain computational efficiency.

Key results

Surpasses monocular baselines on pose tracking accuracy across HO3D and YCBInEOAT datasets
Delivers high-fidelity, photorealistic 3D reconstruction of unseen objects in real-time
Introduces adaptive Gaussian management to control model growth and ensure online efficiency
Maintains robust tracking under severe occlusions and dynamic hand-object interactions

Why it matters

Provides a practical, sensor-light solution for real-time robotic manipulation and object-centric scene understanding without requiring expensive depth hardware.

Abstract

Tracking the 6DoF pose of previously unseen objects from monocular RGB videos is crucial for robotic manipulation, yet remains challenging due to depth ambiguity and limited object-centric visual context. Existing trackers often rely on accurate depth sensors, which constrains deployment in low-cost settings, while substituting monocular pseudo-depth frequently introduces geometric errors that reduce tracking robustness. To this end, We propose MGS-Track, an object- centric online tracking and reconstruction framework that combines learning-based geometric priors with differentiable 3D Gaussian Splatting (3DGS). Specifically, we first introduce a mask-augmented DUSt3R network (DUSt3R-M) to establish pairwise correspondences and predict point maps, which serve as geometric priors for initializing and guiding an online 3DGS representation. We then jointly optimize Gaussian parameters and 6DoF object poses in a coarse-to-fine manner, enabling ro- bust tracking and high-fidelity reconstruction. To control model growth and maintain efficiency over time, we further introduce adaptive Gaussian management with capacity-aware selection and mask-consistent pruning. Experiments on YCBInEOAT and HO3D show that MGS-Track consistently outperforms competitive monocular baselines on both pose tracking and object reconstruction in challenging object-centric scenarios.

Index terms

Perception for Grasping and Manipulation SLAM Visual-Inertial SLAM