State-Space Time Surfaces for Event-Based Zero-Shot Robotic Grasping and Scene Reconstruction
Gu GONG, David Navarro-Alarcon
AI summary
Problem
Event cameras generate sparse, asynchronous data that is fundamentally incompatible with dense RGB-trained vision-language models, blocking their use in zero-shot robotic perception and manipulation.
Approach
The authors introduce S3TS, a training-free representation that converts event streams into a three-channel pseudo-RGB image using a multi-scale diagonal state-space model with input-dependent selective decay, allowing direct feeding into a frozen OWLv2 detector.
Key results
- Detects over twice as many objects as single-channel event representations
- Achieves highest top-1 confidence (0.486) in zero-shot text-prompted detection
- Enables depth-free robotic grasping with 5–15 mm XY error reduction via near-nadir refinement
- Produces dense 3D temporal meshes (29k vertices) via multi-view TSDF fusion
Why it matters
Provides a practical, theoretically grounded pathway for deploying event cameras in open-vocabulary robotic manipulation without requiring model training or depth sensors.
Abstract
Event cameras report per-pixel brightness changes asynchronously with microsecond latency, but their output is incompatible with vision foundation models trained on conven- tional images. We propose State-Space Time Surfaces (S3TS), a training-free representation that recasts exponential-decay time surfaces as a diagonal state-space model with multi-scale tem- poral channels and input-dependent selective decay inspired by Mamba. The resulting three-channel pseudo-RGB image is fed directly to a frozen OWLv2 detector for zero-shot, text- prompted object detection from events alone. We demonstrate two applications on a simulated 6-DOF manipulator: (i) event- only grasping with near-nadir refinement that localizes objects without any depth sensor, and (ii) dense 3D scene reconstruc- tion via multi-view TSDF fusion with neuromorphic per-vertex surface descriptors. S3TS detects over twice as many objects as single-channel event representations and produces faithful 3D workspace meshes—all without network training or fine-tuning.