SLoFT: End-To-End Semantic Localization with Floorplan and Transformer
Chaerin Min, Hongsheng Yu, Fengtao Fan, Srinath Sridhar, Qiuxuan Wu, Chao Guo
AI summary
Problem
Traditional visual localization relies on costly, privacy-invasive 3D maps that quickly become obsolete, while existing floorplan-based methods are constrained by strict assumptions, limited layouts, and poor generalization across domains.
Approach
The model treats raw 2D floorplans as rich semantic images and uses a dual Vision Transformer encoder with a cross-attention fusion module to match camera visual cues to the floorplan, predicting the camera’s 2D position and yaw as a probability distribution.
Key results
- Unified model generalizes across unseen indoor and outdoor domains
- Achieves up to 63.6% recall at 1m error on Structured3D and MGL benchmarks
- Runs at 13.3 FPS without retraining or test-time optimization
- Remains robust to floorplan rotations, lighting changes, and varying camera intrinsics
Why it matters
It offers a scalable, privacy-preserving alternative to 3D mapping for AR navigation, AI audio guidance, and mobile robotics by eliminating costly map construction and strict environmental constraints.
Abstract
Visual localization is critical for AR navigation, AI-driven audio guidance, and mobile robot localization. How- ever, traditional SLAM methods that rely on pre-built 3D maps suffer from high costs, privacy concerns, and sensitivity to environmental changes. Recent floorplan-based localization methods attempt to address these challenges by using 2D floorplans, eliminating the need for 3D map construction. Still, existing approaches are often impractical for real-world applications, as they are limited to specific layouts and fail to generalize beyond their training domains. We propose a novel approach that learns to semantically match visual cues from a camera image to a floorplan image rich in semantic details, inspired by human ability to directly localize oneself using a complex floorplan image. To achieve this, we train a single, unified model on a diverse dataset of 1.2M images and 740K floorplans that we curated, which includes a new collection of semantically-rich, real-world floorplans. This allows our model to generalize effectively to previously unseen areas and implies generalization potentials to unseen buildings. Without making assumptions about camera poses or floorplan structures, our end-to-end model outperforms existing methods and enables variations like floorplan rotations, lighting changes, and differ- ent camera intrinsics.