Semantic Equirectangular Visual Tracking in Lightweight 3D Building Reconstructions
Hussein Loubani, Nathan Crombez, Jocelyn Buisson, Yassine Ruichek
AI summary
Problem
Accurate visual localization typically depends on dense, high-fidelity 3D models that are costly and unscalable, while lightweight city models lack the textures and fine details needed for reliable alignment.
Approach
The method converts real panoramic semantic building masks into Gaussian Mixtures and aligns them with synthetic masks rendered from coarse 3D models, using a seamless 360° formulation and frequency-domain computation for efficient optimization.
Key results
- Semantic-based alignment pipeline for visual tracking over coarse 3D models
- Gaussian Mixture extension to semantic masks overcoming poor binary gradients
- Frequency-domain GM calculation reducing computational complexity to O(P log P)
- Seamless 360° equirectangular preprocessing eliminating boundary artifacts
Why it matters
Enables scalable, privacy-preserving visual localization for robotics and AR without relying on expensive dense reconstructions.
Abstract
Accurate visual localization often relies on dense, high-fidelity 3D models, which provide rich geometric and photometric detail but are expensive to acquire, heavy to store, and limited in scalability. As an alternative, lightweight city models represent only coarse building volumes, offering compactness, accessibility, and privacy but posing challenges for reliable alignment due to the lack of textures and fine structure. This work addresses these challenges by introducing a semantic equirectangular Gaussian Mixture–based virtual visual servo- ing approach that aligns real panoramic images with synthetic views rendered from lightweight building models. The method combines semantic building masks with Gaussian Mixtures, a seamless 360◦formulation, and frequency-domain computation to overcome the poor gradients of direct photometric binary- mask alignment while maintaining computational efficiency. Experiments on outdoor trajectories show stable tracking under frame skipping and dynamic occlusions through semantic mask- ing. These results indicate that reliable localization is feasible with coarse city models, providing a scalable alternative to high- fidelity reconstructions and opening perspectives for deeper integration of semantic rules into the localization process.