← Back ICRA 2026

Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer for Trajectory Prediction

Weizheng Wang, Baijian Yang, Sungeun Hong, Wenhai Sun, Byung-Cheol Min

PDF

AI summary

Key figure (auto-extracted from paper)

Hyper-STTN outperforms state-of-the-art methods by jointly modeling pairwise and groupwise social interactions through multiscale hypergraphs and multimodal transformer fusion.

Trajectory Prediction Hypergraph Neural Networks Spatial-Temporal Transformer Crowd Dynamics Multimodal Fusion Human-Human Interaction

Problem

Accurately forecasting human trajectories in crowds remains difficult due to complex pairwise spatial-temporal interactions and heterogeneous groupwise dynamics that existing models fail to jointly capture and align.

Approach

The method builds multiscale hypergraphs using Mahalanobis distance to model groupwise correlations, while a spatial-temporal transformer captures pairwise interactions, with a multimodal transformer aligning these heterogeneous features before trajectory decoding.

Key results

Jointly models groupwise and pairwise social interactions across spatial-temporal domains
Introduces a multimodal transformer to align heterogeneous interaction features
Constructs context-aware multiscale hypergraphs via Mahalanobis distance-based KNN
Consistently outperforms state-of-the-art baselines on public pedestrian trajectory datasets

Why it matters

Enables more reliable crowd behavior forecasting for safety-critical applications like autonomous driving and social robotics.

Abstract

Predicting crowd intentions and trajectories is critical for a range of real-world applications, involving so- cial robotics and autonomous driving. Accurately modeling such behavior remains challenging due to the complexity of pairwise spatial-temporal interactions and the heterogeneous influence of groupwise dynamics. To address these challenges, we propose Hyper-STTN, a Hypergraph-augmented Spatial- Temporal Transformer Network for crowd trajectory predic- tion. Hyper-STTN constructs crowd hypergraphs with multi- scale group sizes to model groupwise correlations, captured through spectral hypergraph convolution based on hypergraph random walk. In parallel, a spatial-temporal transformer is employed to learn pedestrians’ pairwise latent interactions across multimodal dimensions. Eventually, above heterogeneous groupwise and pairwise features are subsequently incorporated and aligned via a multimodal transformer. Extensive experi- ments on public pedestrian motion datasets demonstrate that Hyper-STTN consistently outperforms state-of-the-art baselines and ablation models. The project website is available at https: //sites.google.com/view/hypersttn.

Index terms

Human Detection and Tracking Intention Recognition