Research Analyzer
← Back ICRA 2024

SAM-Event-Adapter: Adapting Segment Anything Model for Event-RGB Semantic Segmentation

Bowen Yao, Yongjian Deng, yuhan liu, Hao Chen, You-Fu Li, Zhen Yang

PDF

Abstract

Semantic segmentation, a fundamental visual task ubiquitously employed in sectors ranging from transportation and robotics to healthcare, has always captivated the research community. In the wake of rapid advancements in large model research, the foundation model for semantic segmentation tasks, termed the Segment Anything Model (SAM), has been introduced. This model substantially addresses the dilemma of poor generalizability of previous segmentation models and the disadvantage in requiring to retrain the whole model on variant datasets. Nonetheless, segmentation models developed on SAM remain constrained by the inherent limitations of RGB sensors, particularly in scenarios characterized by complex lighting conditions and high-speed motion. Motivated by these observations, a natural recourse is to adapt SAM to additional visual modalities without compromising its robust general- izability. To achieve this, we introduce a lightweight SAM- Event-Adapter (SE-Adapter) module, which incorporates event camera data into a cross-modal learning architecture based on SAM, with only limited tunable parameters incremental. Capitalizing on the high dynamic range and temporal resolution afforded by event cameras, our proposed multi-modal Event- RGB learning architecture effectively augments the perfor- mance of semantic segmentation tasks. In addition, we propose a novel paradigm for representing event data in a patch format compatible with transformer-based models, employing multi- spatiotemporal scale encoding to efficiently extract motion and semantic correlations from event representations. Exhaustive empirical evaluations conducted on the DSEC-Semantic and DDD17 datasets provide validation of the effectiveness and rationality of our proposed approach.

Index terms

Computer Vision for Automation Sensor Fusion Deep Learning for Visual Perception