← Back ICRA 2026

SC-VLMaps: Depth-Free Visual�Language Mapping Via Scene Coordinate Regression

Nanda Febri Istighfarin, Baehoon Choi, HyungGi Jo

PDF

AI summary

Key figure (auto-extracted from paper)

SC-VLMaps builds dense, semantically rich 3D maps from monocular RGB images alone by predicting scene coordinates, eliminating the need for costly depth sensors while generalizing across indoor and outdoor environments.

visual-language mapping scene coordinate regression monocular RGB semantic mapping embodied AI depth-free reconstruction

Problem

Existing visual-language mapping methods require depth sensors to reconstruct 3D geometry, which limits scalability and deployment due to hardware costs and practical constraints.

Approach

The framework replaces depth inputs with a scene coordinate regression network that predicts dense 3D coordinates directly from monocular RGB frames, fusing them with frozen visual-language features into an implicit voxel map.

Key results

Denser, more compact 3D reconstructions than depth-dependent baselines
Stronger semantic alignment and precise text-query localization on indoor and outdoor benchmarks
Zero-shot generalization to unseen sequences without additional training
Efficient online mapping at ~4 FPS using only monocular RGB input

Why it matters

It enables scalable, language-interactable robotic mapping using only inexpensive monocular cameras, lowering hardware barriers for embodied AI in diverse environments.

Abstract

The ability to connect visual observations with human language is increasingly valuable for embodied agents in tasks such as navigation and semantic mapping. Existing visual–language map (VLMaps) approach enables this connec- tion but typically depends on depth images to project semantic features into 3D space, which limits scalability due to sensor cost and deployment constraints. In this work, we introduce SC-VLMaps, a depth-free visual–language mapping framework that constructs semantic maps using only monocular RGB input. SC-VLMaps leverages a scene coordinate regression (SCR) network to predict dense 3D coordinates from images, bypassing the need for depth supervision and enabling implicit geometry reconstruction. The predicted coordinates are fused into a voxel grid and augmented with language-aligned features from a frozen visual–language encoder, producing maps that are both geometrically coherent and semantically enriched. By employing a multi-scene training strategy, SC-VLMaps generalizes from indoor datasets (7Scenes) to challenging out- door benchmarks (Cambridge Landmarks). Experiments show that SC-VLMaps achieves denser, more compact maps with stronger semantic alignment than VLMaps, while requiring only monocular RGB images.

Index terms

Mapping Semantic Scene Understanding Object Detection Segmentation and Categorization