SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video
Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng
AI summary
Problem
Generic 3D mapping techniques struggle with drift in visually repetitive environments and lack semantic context, while high-accuracy LiDAR systems are costly and complex to deploy at scale.
Approach
The pipeline extracts rectified shelf and ceiling views from panoramic video, detects structural boundaries via semantic segmentation, tracks feature points across frames, and solves for 3D coordinates using a modified structure-from-motion algorithm that enforces Manhattan grid constraints.
Key results
- End-to-end pipeline generating semantic wireframe maps from consumer-grade panoramic video
- Constrained structure-from-motion algorithm enforcing Manhattan grid geometry for geometric consistency
- Real-world validation across 46 shelving rows achieving 4.8 cm aggregate mean absolute error against ground truth
- Scalable reconstruction of over 5000 shelf elements from just one hour of video footage
Why it matters
Provides a low-cost, scalable alternative to LiDAR for generating semantically rich 3D warehouse maps essential for robot localization and digital twin generation.
Abstract
Precise 3D representations of industrial environ- ments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the ware- house aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55 m by 7 m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8 cm with respect to ground-truth.