SC-VLMaps: Depth-Free Visual�Language Mapping Via Scene Coordinate Regression
Nanda Febri Istighfarin, Baehoon Choi, HyungGi Jo
AI summary
Problem
Existing visual-language mapping methods require depth sensors to reconstruct 3D geometry, which limits scalability and deployment due to hardware costs and practical constraints.
Approach
The framework replaces depth inputs with a scene coordinate regression network that predicts dense 3D coordinates directly from monocular RGB frames, fusing them with frozen visual-language features into an implicit voxel map.
Key results
- Denser, more compact 3D reconstructions than depth-dependent baselines
- Stronger semantic alignment and precise text-query localization on indoor and outdoor benchmarks
- Zero-shot generalization to unseen sequences without additional training
- Efficient online mapping at ~4 FPS using only monocular RGB input
Why it matters
It enables scalable, language-interactable robotic mapping using only inexpensive monocular cameras, lowering hardware barriers for embodied AI in diverse environments.
Abstract
The ability to connect visual observations with human language is increasingly valuable for embodied agents in tasks such as navigation and semantic mapping. Existing visual–language map (VLMaps) approach enables this connec- tion but typically depends on depth images to project semantic features into 3D space, which limits scalability due to sensor cost and deployment constraints. In this work, we introduce SC-VLMaps, a depth-free visual–language mapping framework that constructs semantic maps using only monocular RGB input. SC-VLMaps leverages a scene coordinate regression (SCR) network to predict dense 3D coordinates from images, bypassing the need for depth supervision and enabling implicit geometry reconstruction. The predicted coordinates are fused into a voxel grid and augmented with language-aligned features from a frozen visual–language encoder, producing maps that are both geometrically coherent and semantically enriched. By employing a multi-scene training strategy, SC-VLMaps generalizes from indoor datasets (7Scenes) to challenging out- door benchmarks (Cambridge Landmarks). Experiments show that SC-VLMaps achieves denser, more compact maps with stronger semantic alignment than VLMaps, while requiring only monocular RGB images.