Research Analyzer
← Back ICRA 2026

ClustViT: Clustering-Based Token Merging for Semantic Segmentation

Fabio Montello, Ronja Güldenring, Lazaros Nalpantidis

PDF

AI summary

Key figure (auto-extracted from paper)
ClustViT significantly reduces Vision Transformer computational cost for semantic segmentation by dynamically merging semantically similar tokens while maintaining accuracy.
Semantic segmentation Vision Transformers token merging computational efficiency robotic perception clustering

Problem

Vision Transformers suffer from quadratic computational complexity that limits real-world robotic deployment, and existing token merging methods are poorly suited for dense prediction tasks requiring preserved spatial and semantic detail.

Approach

ClustViT integrates a trainable clustering module that merges tokens based on semantic pseudo-clusters derived from segmentation masks, followed by a regenerator module that restores fine details to maintain compatibility with standard segmentation heads.

Key results

  • Up to 2.18× fewer GFLOPs and 1.64× faster inference across three datasets
  • Comparable segmentation accuracy (mIoU) to baseline Vision Transformers
  • Significant speedups on robotics-relevant datasets with large background regions
  • End-to-end trainable clustering component integrated directly into the ViT backbone

Why it matters

Enables efficient, high-performance semantic segmentation with Vision Transformers on resource-constrained robotic platforms without sacrificing accuracy.

Abstract

Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18× fewer GFLOPs and 1.64× faster inference on three different datasets, with comparable segmentation accuracy. Our code and models are made publicly available1.

Index terms

Deep Learning for Visual Perception Object Detection Segmentation and Categorization Semantic Scene Understanding

Related papers