← Back ICRA 2026

State-Space Time Surfaces for Event-Based Zero-Shot Robotic Grasping and Scene Reconstruction

Gu GONG, David Navarro-Alarcon

PDF

AI summary

Key figure (auto-extracted from paper)

S3TS bridges the modality gap between event cameras and frozen vision-language models, enabling training-free zero-shot object detection and 3D reconstruction for robotics.

Event cameras State-space models Zero-shot detection Robotic grasping Scene reconstruction Vision-language models

Problem

Event cameras generate sparse, asynchronous data that is fundamentally incompatible with dense RGB-trained vision-language models, blocking their use in zero-shot robotic perception and manipulation.

Approach

The authors introduce S3TS, a training-free representation that converts event streams into a three-channel pseudo-RGB image using a multi-scale diagonal state-space model with input-dependent selective decay, allowing direct feeding into a frozen OWLv2 detector.

Key results

Detects over twice as many objects as single-channel event representations
Achieves highest top-1 confidence (0.486) in zero-shot text-prompted detection
Enables depth-free robotic grasping with 5–15 mm XY error reduction via near-nadir refinement
Produces dense 3D temporal meshes (29k vertices) via multi-view TSDF fusion

Why it matters

Provides a practical, theoretically grounded pathway for deploying event cameras in open-vocabulary robotic manipulation without requiring model training or depth sensors.

Abstract

Event cameras report per-pixel brightness changes asynchronously with microsecond latency, but their output is incompatible with vision foundation models trained on conven- tional images. We propose State-Space Time Surfaces (S3TS), a training-free representation that recasts exponential-decay time surfaces as a diagonal state-space model with multi-scale tem- poral channels and input-dependent selective decay inspired by Mamba. The resulting three-channel pseudo-RGB image is fed directly to a frozen OWLv2 detector for zero-shot, text- prompted object detection from events alone. We demonstrate two applications on a simulated 6-DOF manipulator: (i) event- only grasping with near-nadir refinement that localizes objects without any depth sensor, and (ii) dense 3D scene reconstruc- tion via multi-view TSDF fusion with neuromorphic per-vertex surface descriptors. S3TS detects over twice as many objects as single-channel event representations and produces faithful 3D workspace meshes—all without network training or fine-tuning.

Index terms

Sensor-based Control