← Back ICRA 2026

ST-HNet: A CNN-LSM Hybrid Architecture for Spatio-Temporal Feature Learning in Event-Based Visual Place Recognition

Xun Xiao, Shasha Guo, Tie Junbo, Jingyue Zhao, Ziqi Wang, Yuan Li, jingzhuo yuan, Qiang Dou, Lei Wang

PDF

AI summary

Key figure (auto-extracted from paper)

ST-HNet significantly boosts event-based visual place recognition accuracy by combining a compact binary event representation with a hybrid CNN-LSM network and adaptive loss function.

Event-based VPR CNN-LSM hybrid BBVG representation Spiking neural networks Soft-margin loss Autonomous navigation

Problem

The sparse and asynchronous nature of dynamic vision sensor event streams complicates compact representation, spatio-temporal feature extraction, and robust place embedding for visual place recognition.

Approach

ST-HNet converts raw events into a compact Bipolar Binary Voxel Grid, extracts spatial features via a CNN, aggregates temporal dynamics with a Liquid State Machine, and optimizes learning using a soft-margin triplet loss.

Key results

Novel Bipolar Binary Voxel Grid (BBVG) for memory-efficient event encoding
CNN-LSM hybrid network for robust spatio-temporal feature extraction
Soft-margin triplet loss to handle continuous spatial transitions
State-of-the-art Recall@1 gains of 11% and 23% on Brisbane-Event-VPR and DDD20 datasets

Why it matters

Enables more reliable loop closure and localization for autonomous robots operating in challenging lighting or high-motion environments.

Abstract

Visual Place Recognition (VPR) based on Dynamic Vision Sensors (DVSs) has gained attention due to their high temporal resolution and robustness under challenging lighting conditions. However, the sparse and asynchronous event stream output of DVS introduces unique challenges for effective VPR. In this paper, we propose ST-HNet, a novel framework for VPR that introduces improvements in event representation, spatio-temporal feature extraction, and loss design. Specifically, we introduce a compact and efficient event representation called Bipolar Binary Voxel Grid (BBVG). Then, we propose a hybrid feature extractor that combines a Convolutional Neural Network (CNN) for spatial encoding and a Liquid State Machine (LSM) for temporal aggregation. We refer to this combination as a CNN-LSM hybrid architecture. Moreover, we introduce a soft-margin triplet loss to better accommodate the gradual transitions between nearby locations in the event- based VPR task. Extensive experiments conducted on the Brisbane-Event-VPR and DDD20 datasets demonstrate that our method outperforms state-of-the-art approaches, achieving improvements of 11% and 23% in Recall@1 performance, respectively.

Index terms

Deep Learning for Visual Perception Localization Visual Learning