MGS-Track: Monocular 6DoF Pose Tracking Via Masked 3D Prior and Online Gaussian Splatting
Zhiyuan Chen, Fan Lu, Guo Yu, Sanqing Qu, Ya Wu, Huang Yuan, Alois Knoll, Guang Chen
AI summary
Problem
Estimating the 6DoF pose of unseen objects from monocular RGB videos is essential for robotics but hindered by depth ambiguity and the lack of reliable, low-cost depth sensors. Existing monocular trackers often suffer from geometric errors and poor robustness in dynamic, occluded scenes.
Approach
The method uses a mask-augmented feed-forward network to extract object-centric geometric priors, which initialize and guide an online 3D Gaussian Splatting representation. It then jointly optimizes the Gaussian field and object pose in a coarse-to-fine manner, supplemented by adaptive pruning to maintain computational efficiency.
Key results
- Surpasses monocular baselines on pose tracking accuracy across HO3D and YCBInEOAT datasets
- Delivers high-fidelity, photorealistic 3D reconstruction of unseen objects in real-time
- Introduces adaptive Gaussian management to control model growth and ensure online efficiency
- Maintains robust tracking under severe occlusions and dynamic hand-object interactions
Why it matters
Provides a practical, sensor-light solution for real-time robotic manipulation and object-centric scene understanding without requiring expensive depth hardware.
Abstract
Tracking the 6DoF pose of previously unseen objects from monocular RGB videos is crucial for robotic manipulation, yet remains challenging due to depth ambiguity and limited object-centric visual context. Existing trackers often rely on accurate depth sensors, which constrains deployment in low-cost settings, while substituting monocular pseudo-depth frequently introduces geometric errors that reduce tracking robustness. To this end, We propose MGS-Track, an object- centric online tracking and reconstruction framework that combines learning-based geometric priors with differentiable 3D Gaussian Splatting (3DGS). Specifically, we first introduce a mask-augmented DUSt3R network (DUSt3R-M) to establish pairwise correspondences and predict point maps, which serve as geometric priors for initializing and guiding an online 3DGS representation. We then jointly optimize Gaussian parameters and 6DoF object poses in a coarse-to-fine manner, enabling ro- bust tracking and high-fidelity reconstruction. To control model growth and maintain efficiency over time, we further introduce adaptive Gaussian management with capacity-aware selection and mask-consistent pruning. Experiments on YCBInEOAT and HO3D show that MGS-Track consistently outperforms competitive monocular baselines on both pose tracking and object reconstruction in challenging object-centric scenarios.