ClustViT: Clustering-Based Token Merging for Semantic Segmentation
Fabio Montello, Ronja Güldenring, Lazaros Nalpantidis
AI summary
Problem
Vision Transformers suffer from quadratic computational complexity that limits real-world robotic deployment, and existing token merging methods are poorly suited for dense prediction tasks requiring preserved spatial and semantic detail.
Approach
ClustViT integrates a trainable clustering module that merges tokens based on semantic pseudo-clusters derived from segmentation masks, followed by a regenerator module that restores fine details to maintain compatibility with standard segmentation heads.
Key results
- Up to 2.18× fewer GFLOPs and 1.64× faster inference across three datasets
- Comparable segmentation accuracy (mIoU) to baseline Vision Transformers
- Significant speedups on robotics-relevant datasets with large background regions
- End-to-end trainable clustering component integrated directly into the ViT backbone
Why it matters
Enables efficient, high-performance semantic segmentation with Vision Transformers on resource-constrained robotic platforms without sacrificing accuracy.
Abstract
Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18× fewer GFLOPs and 1.64× faster inference on three different datasets, with comparable segmentation accuracy. Our code and models are made publicly available1.