← Back ICRA 2026

SLoFT: End-To-End Semantic Localization with Floorplan and Transformer

Chaerin Min, Hongsheng Yu, Fengtao Fan, Srinath Sridhar, Qiuxuan Wu, Chao Guo

PDF

AI summary

SLoFT enables accurate, privacy-preserving camera localization using only raw 2D floorplan images and a single unified model that generalizes across diverse indoor and outdoor environments without retraining.

Visual Localization Floorplan Mapping Transformer End-to-End Learning AR Navigation Privacy-Preserving

Problem

Traditional visual localization relies on costly, privacy-invasive 3D maps that quickly become obsolete, while existing floorplan-based methods are constrained by strict assumptions, limited layouts, and poor generalization across domains.

Approach

The model treats raw 2D floorplans as rich semantic images and uses a dual Vision Transformer encoder with a cross-attention fusion module to match camera visual cues to the floorplan, predicting the camera’s 2D position and yaw as a probability distribution.

Key results

Unified model generalizes across unseen indoor and outdoor domains
Achieves up to 63.6% recall at 1m error on Structured3D and MGL benchmarks
Runs at 13.3 FPS without retraining or test-time optimization
Remains robust to floorplan rotations, lighting changes, and varying camera intrinsics

Why it matters

It offers a scalable, privacy-preserving alternative to 3D mapping for AR navigation, AI audio guidance, and mobile robotics by eliminating costly map construction and strict environmental constraints.

Abstract

Visual localization is critical for AR navigation, AI-driven audio guidance, and mobile robot localization. How- ever, traditional SLAM methods that rely on pre-built 3D maps suffer from high costs, privacy concerns, and sensitivity to environmental changes. Recent floorplan-based localization methods attempt to address these challenges by using 2D floorplans, eliminating the need for 3D map construction. Still, existing approaches are often impractical for real-world applications, as they are limited to specific layouts and fail to generalize beyond their training domains. We propose a novel approach that learns to semantically match visual cues from a camera image to a floorplan image rich in semantic details, inspired by human ability to directly localize oneself using a complex floorplan image. To achieve this, we train a single, unified model on a diverse dataset of 1.2M images and 740K floorplans that we curated, which includes a new collection of semantically-rich, real-world floorplans. This allows our model to generalize effectively to previously unseen areas and implies generalization potentials to unseen buildings. Without making assumptions about camera poses or floorplan structures, our end-to-end model outperforms existing methods and enables variations like floorplan rotations, lighting changes, and differ- ent camera intrinsics.

Index terms

Localization Visual Learning Deep Learning for Visual Perception