SSQA: Sibling-Selective Quadtree Attention for Hierarchical Modeling in Perception Tasks
Yufan Chen, Arnav Bali, Angela Liu, Laura Zheng, Ming C. Lin
AI summary
Problem
Standard vision transformers and efficient attention alternatives struggle with quadratic computational costs or trade modeling capacity for speed, making them inefficient for structured perception tasks in robotics and autonomous driving.
Approach
The authors propose Sibling-Selective Quadtree Attention (SSQA), which organizes image tokens into a fixed 2×2 quadtree hierarchy, applies constant-size sibling attention with soft top-k filtering, and uses bottom-up aggregation with cross-injection fusion to preserve global context efficiently.
Key results
- Linear-time complexity with constant-size sibling interactions
- Cross-Injection Fusion mechanism preserves suppressed hierarchical features
- Plug-in guidance variants for detector and Fourier-based spatial refinement
- Highest average score on CARLA driving benchmark across 14 scenarios
Why it matters
It provides a practical, compute-efficient alternative to quadratic attention for robotics and autonomous driving applications where real-time performance and structured spatial reasoning are critical.
Abstract
Perception tasks for navigation in robotics, includ- ing aerial platforms such as drones and autonomous driving systems, are inherently structured. Drone-mounted cameras typically capture sky above, terrain below, and obstacles or man-made structures in between, while driving data often contains organized road layouts, lane markings, and surround- ing agents. Motivated by these axis-aligned structural priors, we note that such information is typically more structured than in generic image tasks. We hypothesize that processing information in a quadtree-esque manner can not only model features effectively in a hierarchical manner, but also offers an efficient linear-time alternative to vanilla attention mechanisms, which run in quadratic time. In this paper, we propose Sibling- Selective Quadtree Attention (SSQA), which models image tokens hierarchically as a structured, full quadtree. We show analytical complexity analysis that guarantees linear-time fea- ture modeling, in addition to empirical experiments comparing inference speeds with other popular modeling approaches, such as Mamba 2 and Quadtree Attention. Our results, benchmarked across several tasks, show that we achieve results at least as good, if not notably better, as others at a fraction of the computational costs.