← Back ICRA 2026

SSQA: Sibling-Selective Quadtree Attention for Hierarchical Modeling in Perception Tasks

Yufan Chen, Arnav Bali, Angela Liu, Laura Zheng, Ming C. Lin

PDF

AI summary

Key figure (auto-extracted from paper)

SSQA achieves linear-time hierarchical attention that matches or improves performance over state-of-the-art models while drastically reducing computational cost.

Quadtree Attention Linear-Time Attention Hierarchical Modeling Autonomous Driving Efficient Vision Transformers Sibling-Selective Attention

Problem

Standard vision transformers and efficient attention alternatives struggle with quadratic computational costs or trade modeling capacity for speed, making them inefficient for structured perception tasks in robotics and autonomous driving.

Approach

The authors propose Sibling-Selective Quadtree Attention (SSQA), which organizes image tokens into a fixed 2×2 quadtree hierarchy, applies constant-size sibling attention with soft top-k filtering, and uses bottom-up aggregation with cross-injection fusion to preserve global context efficiently.

Key results

Linear-time complexity with constant-size sibling interactions
Cross-Injection Fusion mechanism preserves suppressed hierarchical features
Plug-in guidance variants for detector and Fourier-based spatial refinement
Highest average score on CARLA driving benchmark across 14 scenarios

Why it matters

It provides a practical, compute-efficient alternative to quadratic attention for robotics and autonomous driving applications where real-time performance and structured spatial reasoning are critical.

Abstract

Perception tasks for navigation in robotics, includ- ing aerial platforms such as drones and autonomous driving systems, are inherently structured. Drone-mounted cameras typically capture sky above, terrain below, and obstacles or man-made structures in between, while driving data often contains organized road layouts, lane markings, and surround- ing agents. Motivated by these axis-aligned structural priors, we note that such information is typically more structured than in generic image tasks. We hypothesize that processing information in a quadtree-esque manner can not only model features effectively in a hierarchical manner, but also offers an efficient linear-time alternative to vanilla attention mechanisms, which run in quadratic time. In this paper, we propose Sibling- Selective Quadtree Attention (SSQA), which models image tokens hierarchically as a structured, full quadtree. We show analytical complexity analysis that guarantees linear-time fea- ture modeling, in addition to empirical experiments comparing inference speeds with other popular modeling approaches, such as Mamba 2 and Quadtree Attention. Our results, benchmarked across several tasks, show that we achieve results at least as good, if not notably better, as others at a fraction of the computational costs.

Index terms

AI-Based Methods Computer Vision for Automation RGB-D Perception