Research Analyzer
← Back ICRA 2026

OTAS: Open-vocabulary Token Alignment for Outdoor Segmentation

Simon Schwaiger,, Stefan Thalhammer, Wilfried W ̈ober, and Gerald Steinbauer-Wagner

PDF

AI summary

Key figure (auto-extracted from paper)
OTAS enables real-time, zero-shot open-vocabulary segmentation in unstructured outdoor environments by aligning self-supervised visual tokens with language features, outperforming existing 3D mapping methods and matching fine-tuned 2D baselines.
open-vocabulary segmentation outdoor robotics token alignment zero-shot learning semantic mapping vision-language models

Problem

Existing open-vocabulary segmentation models rely on object-centric priors that fail in unstructured outdoor settings due to semantic ambiguities and indistinct boundaries. Current language-grounded 3D mapping approaches are either computationally heavy, require per-scene training, or retain these object-centric biases.

Approach

The method clusters self-supervised visual tokens into coarse semantic structures and pools vision-language features over these clusters for language grounding, optionally fusing multi-view data into a geometrically consistent 3D feature field without per-scene fine-tuning or differentiable rendering.

Key results

  • Training-free token alignment fusing DINOv2 and CLIP features for zero-shot segmentation
  • Real-time 2D segmentation on ORFD achieving up to 94.34 IoU at ~17 fps
  • Up to 151% relative IoU improvement in 3D vegetation segmentation on TartanAir over baselines
  • Language-grounded 3D feature field enabling real-time mapping without scene-specific MLPs or rendering

Why it matters

Provides a fast, zero-shot semantic mapping solution critical for robotic navigation and task planning in complex, unstructured outdoor environments.

Abstract

Understanding open-world semantics is critical for robotic planning and control, particularly in unstructured outdoor environments. Existing vision-language mapping ap- proaches typically rely on object-centric segmentation priors, which often fail outdoors due to semantic ambiguities and indis- tinct class boundaries. We propose OTAS, an Open-vocabulary Token Alignment method for Outdoor Segmentation. OTAS ad- dresses the limitations of open-vocabulary segmentation models by extracting semantic structure directly from the output tokens of pre-trained vision models. By clustering semantically similar structures across single and multiple views and grounding them in language, OTAS reconstructs a geometrically consistent fea- ture field that supports open-vocabulary segmentation queries. Our method operates in a zero-shot manner, without scene- specific fine-tuning, and achieves real-time performance of up to ≈17 fps. On the Off-Road Freespace Detection dataset, OTAS yields a modest IoU improvement over fine-tuned and open- vocabulary 2D segmentation baselines. In 3D segmentation on TartanAir, it achieves up to a 151% relative IoU improvement compared to existing open-vocabulary mapping methods. Real- world reconstructions further demonstrate OTAS’ applicability to robotic deployment. Code and a ROS 2 node are available at https://otas-segmentation.github.io/.

Index terms

Semantic Scene Understanding Visual Learning Deep Learning for Visual Perception

Related papers