← Back ICRA 2026

CETUS: Causal Event-Driven Temporal Modeling with Unified Variable-Rate Scheduling

Hanfang Liang, Bing Wang, Shizhen Zhang, wen jiang, Yizhuo Yang, Weixiang Guo, Shenghai Yuan

PDF

AI summary

Key figure (auto-extracted from paper)

CETUS eliminates fixed-window latency by directly processing raw event streams with a causal spatial encoder and Mamba backbone, achieving real-time, high-accuracy object detection for high-speed UAVs.

Event camera real-time detection Mamba state-space models variable-rate scheduling UAV perception

Problem

Existing event camera detection methods rely on fixed time windows or dense intermediate representations, which introduce algorithmic latency, dilute microsecond temporal resolution, and struggle with computational efficiency for real-time applications.

Approach

CETUS directly processes asynchronous event streams using a lightweight causal spatial neighborhood encoder and a Mamba-based state-space model, coupled with an adaptive controller that dynamically adjusts processing speed based on the instantaneous event rate.

Key results

Eliminates fixed-window latency for millisecond-level detection
Introduces an event-rate-aware causal spatial neighborhood encoder
Implements a PID-based variable-rate inference controller
Achieves state-of-the-art accuracy and efficiency on EV-UAV with fewer parameters

Why it matters

Enables ultra-low-latency, real-time object detection for high-speed autonomous systems like UAVs where traditional frame-based or windowed event methods fail.

Abstract

Event cameras capture asynchronous pixel-level brightness changes with microsecond temporal resolution, of- fering unique advantages for high-speed vision tasks. Existing methods often convert event streams into intermediate repre- sentations such as frames, voxel grids, or point clouds, which inevitably require predefined time windows and thus introduce window latency. Meanwhile, pointwise detection methods face computational challenges that prevent real-time efficiency due to their high computational cost. To overcome these limita- tions, we propose the Variable-Rate Spatial Event Mamba, a novel architecture that directly processes raw event streams without intermediate representations. Our method introduces a lightweight causal spatial neighborhood encoder to efficiently capture local geometric relations, followed by Mamba-based state space models for scalable temporal modeling with linear complexity. During inference, a controller adaptively adjusts the processing speed according to the event rate, achieving an optimal balance between window latency and inference latency. Github: https://github.com/lianghanfang/CETUS

Index terms

Deep Learning for Visual Perception Computer Vision for Automation Visual Learning