Research Analyzer
← Back ICRA 2026

Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

Yu Kiu (Idan) Lau, Chao Chen, Ge Jin, Chen Feng

PDF

AI summary

Key figure (auto-extracted from paper)
A novel recurrent deformable transformer enables flexible sequence lengths and drastically faster, lighter inference for visual place recognition without sacrificing accuracy.
Visual Place Recognition Spatio-Temporal Modeling Recurrent Transformer Deformable Attention Real-Time Inference Robotics

Problem

Existing transformer-based sequential visual place recognition models prioritize performance over flexibility and efficiency, often requiring fixed sequence lengths and incurring high computational costs. This limits their practical deployment in real-time robotic and autonomous systems.

Approach

The authors propose Adapt-STformer, which uses a novel Recurrent Deformable Transformer Encoder to iteratively fuse spatio-temporal features across frames. This unified, recurrent design naturally supports variable sequence lengths while drastically reducing memory and inference time.

Key results

  • 10% average recall boost on challenging datasets
  • 36% reduction in sequence extraction time
  • 35% lower memory usage versus baselines
  • Native support for arbitrary sequence lengths

Why it matters

Enables real-time, resource-constrained visual place recognition for robots and autonomous vehicles operating in dynamic environments.

Abstract

Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (seq- length), deliver fast inference, and have low memory usage to meet real-time constraints. However, existing approaches prioritize performance at the expense of flexibility and effi- ciency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequen- tial frames. This design naturally supports variable seq-lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt- STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% relative to our best comparable baseline. Our code is released at https://ai4ce.github.io/Adapt-STFormer/.

Index terms

Deep Learning for Visual Perception Recognition Visual Learning

Related papers