← Back ICRA 2026

Find Anything Like Humans: Online Semantic Mapping and Coarse-To-Fine Navigation in Dynamic Environments

and verify the target before interaction.

PDF

AI summary

Key figure (auto-extracted from paper)

FALH enables robots to find objects in dynamic environments by mimicking human coarse-to-fine search, outperforming existing mapping baselines with lower computational cost.

open-vocabulary navigation coarse-to-fine search prompt-free mapping dynamic environments robot navigation scene memory

Problem

Existing open-vocabulary navigation systems rely on prompt-driven pipelines and dense 3D reconstruction, which limit flexibility, ignore unseen objects, and impose high computational costs that hinder real-time operation in dynamic settings.

Approach

The framework builds a compact, prompt-free scene memory of class-agnostic visual features and poses during exploration, then retrieves likely locations via feature similarity and verifies targets through local fine search.

Key results

Prompt-free perception front-end builds scene memory from class-agnostic proposals
Coarse-to-fine search strategy recalls likely regions and performs precise local verification
Outperforms ConceptGraphs and HOV-SG in simulation success rates
Achieves reliable real-world deployment across static and dynamic scenes

Why it matters

Provides a computationally efficient and flexible navigation framework for robots operating in real-world, changing environments without vocabulary constraints.

Abstract

Enabling robots to follow natural-language in- structions in dynamic environments requires scene represen- tations that support open-ended queries, adapt to change, and operate in real time. However, existing approaches often rely on prompt-driven pipelines and dense 3D reconstruction, which limit flexibility and impose high computational cost. We propose Find Anything Like Humans (FALH), an online framework inspired by how people search: recalling likely regions from memory and then verifying them up close. During exploration, FALH constructs a compact scene memory by pairing visual features with observed poses, using class-agnostic detectors without predefined prompts. At query time, it re- This work was supported by National Key Research and Development Program of China (No. 2024YFB4709800), Guangdong S&T Program (No. 2024B0101050002), Shenzhen Innovation in Science and Technology Foun- dation for The Excellent Youth Scholars (No. RCYX20231211090248064). Yutian Zhang, Jianyu Zhang and Mengyuan Liu are with the State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Shenzhen 518055, China. Corresponding Author: liumengyuan@pku.edu.cn. trieves candidate locations via feature similarity, then performs local verification to confirm the target and estimate a precise 3D goal. All components operate on a unified pose-feature representation that supports efficient recall, online updates, and robust performance in cluttered, changing scenes. Experiments in simulation (HM3D, AI2-THOR) and in the real world show that FALH outperforms object-centric baselines in both success rate and responsiveness under limited resources. Code, videos, and datasets are available at: https://github.com/yutian929/ Find-Anything-Like-Humans.

Index terms

Semantic Scene Understanding