← Back ICRA 2026

Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection Via Vision-Language Knowledge Distillation

Jinchang Zhang, Zijun Li, Jiakai Lin, Guoyu Lu

PDF

AI summary

Key figure (auto-extracted from paper)

A novel framework enables open-vocabulary object detection on event cameras by distilling CLIP’s visual knowledge and using spiking neural networks for adaptive event stream slicing.

Event cameras Open-vocabulary detection Knowledge distillation Spiking neural networks Vision-language models Adaptive event slicing

Problem

Event cameras lack texture and color, making open-vocabulary detection difficult, while directly applying image-based vision-language models fails due to a severe modality gap. Traditional fixed event stream slicing also discards crucial temporal information or introduces redundancy.

Approach

The authors bridge the modality gap by distilling visual knowledge from a frozen CLIP image encoder into an event-based detector, while using a spiking neural network with self-supervised feedback to dynamically slice event streams at optimal moments.

Key results

First event-based open-vocabulary object detection framework
CLIP-to-event knowledge distillation bridging the modality gap
Self-supervised spiking neural network for adaptive event slicing
Category-agnostic proposal module for enhanced generalization

Why it matters

Enables event cameras to detect unseen objects using natural language, advancing real-time, low-power perception for robotics and autonomous systems.

Abstract

Event camera offers advantages in object detec- tion tasks for its properties such as high-speed response, low latency, and robustness to motion blur. However, event cameras inherently lack texture and color information, making open- vocabulary detection particularly challenging. Current event- based detection methods are typically trained on predefined target categories, limiting their ability to generalize to novel ob- jects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap be- tween images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework, leveraging CLIP’s semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as teacher model inputs, guiding the event- based student model to learn CLIP’s rich visual representa- tions. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs, while inheriting CLIP’s broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid Spiking Neural Network (SNN) and Convolutional Neural Network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.

Index terms

Object Detection Segmentation and Categorization Deep Learning for Visual Perception