← Back ICRA 2026

SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video

Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng

PDF

AI summary

Key figure (auto-extracted from paper)

SAVMap accurately reconstructs large-scale semantic wireframe maps of warehouse infrastructure from a single consumer-grade panoramic video by combining semantic segmentation with geometry-constrained structure-from-motion.

Panoramic mapping Structure-from-motion Semantic wireframes Warehouse digitization Constrained optimization 2.5D reconstruction

Problem

Generic 3D mapping techniques struggle with drift in visually repetitive environments and lack semantic context, while high-accuracy LiDAR systems are costly and complex to deploy at scale.

Approach

The pipeline extracts rectified shelf and ceiling views from panoramic video, detects structural boundaries via semantic segmentation, tracks feature points across frames, and solves for 3D coordinates using a modified structure-from-motion algorithm that enforces Manhattan grid constraints.

Key results

End-to-end pipeline generating semantic wireframe maps from consumer-grade panoramic video
Constrained structure-from-motion algorithm enforcing Manhattan grid geometry for geometric consistency
Real-world validation across 46 shelving rows achieving 4.8 cm aggregate mean absolute error against ground truth
Scalable reconstruction of over 5000 shelf elements from just one hour of video footage

Why it matters

Provides a low-cost, scalable alternative to LiDAR for generating semantically rich 3D warehouse maps essential for robot localization and digital twin generation.

Abstract

Precise 3D representations of industrial environ- ments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the ware- house aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55 m by 7 m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8 cm with respect to ground-truth.

Index terms

Mapping Manufacturing Maintenance and Supply Chains SLAM