← Back ICRA 2026

Semantic Equirectangular Visual Tracking in Lightweight 3D Building Reconstructions

Hussein Loubani, Nathan Crombez, Jocelyn Buisson, Yassine Ruichek

PDF

AI summary

Key figure (auto-extracted from paper)

Reliable ground-level visual tracking is achievable using only coarse, lightweight 3D building models by aligning real semantic masks with synthetic views via Gaussian Mixture-based virtual visual servoing.

Visual tracking lightweight 3D models equirectangular imagery Gaussian mixtures virtual visual servoing semantic alignment

Problem

Accurate visual localization typically depends on dense, high-fidelity 3D models that are costly and unscalable, while lightweight city models lack the textures and fine details needed for reliable alignment.

Approach

The method converts real panoramic semantic building masks into Gaussian Mixtures and aligns them with synthetic masks rendered from coarse 3D models, using a seamless 360° formulation and frequency-domain computation for efficient optimization.

Key results

Semantic-based alignment pipeline for visual tracking over coarse 3D models
Gaussian Mixture extension to semantic masks overcoming poor binary gradients
Frequency-domain GM calculation reducing computational complexity to O(P log P)
Seamless 360° equirectangular preprocessing eliminating boundary artifacts

Why it matters

Enables scalable, privacy-preserving visual localization for robotics and AR without relying on expensive dense reconstructions.

Abstract

Accurate visual localization often relies on dense, high-fidelity 3D models, which provide rich geometric and photometric detail but are expensive to acquire, heavy to store, and limited in scalability. As an alternative, lightweight city models represent only coarse building volumes, offering compactness, accessibility, and privacy but posing challenges for reliable alignment due to the lack of textures and fine structure. This work addresses these challenges by introducing a semantic equirectangular Gaussian Mixture–based virtual visual servo- ing approach that aligns real panoramic images with synthetic views rendered from lightweight building models. The method combines semantic building masks with Gaussian Mixtures, a seamless 360◦formulation, and frequency-domain computation to overcome the poor gradients of direct photometric binary- mask alignment while maintaining computational efficiency. Experiments on outdoor trajectories show stable tracking under frame skipping and dynamic occlusions through semantic mask- ing. These results indicate that reliable localization is feasible with coarse city models, providing a scalable alternative to high- fidelity reconstructions and opening perspectives for deeper integration of semantic rules into the localization process.

Index terms

Visual Tracking Visual Servoing Localization