← Back ICRA 2026

SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception

Gurmeher Khurana, Lan Wei, Dandan Zhang

PDF

AI summary

Key figure (auto-extracted from paper)

Preserving spatial structure during self-supervised learning yields significantly better representations for geometry-sensitive robotic manipulation than global pooling methods.

visuo-tactile learning self-supervised learning spatial representation robotic manipulation BYOL tactile sensing

Problem

Most self-supervised learning frameworks compress visuo-tactile features into global vectors, discarding the spatial structure required for contact-rich robotic manipulation.

Approach

SARL augments the BYOL architecture with three intermediate feature map losses that enforce consistent attention, semantic part composition, and geometric relationships across augmented views.

Key results

Outperforms nine SSL baselines across six downstream visuo-tactile tasks
Achieves 30% relative MAE reduction on edge-pose regression (0.3955 MAE)
Demonstrates robust transfer to four unseen tactile sensor datasets
Confirms fused visuo-tactile data and spatial losses outperform unimodal and global-only baselines

Why it matters

Provides a practical, label-free method for learning geometry-aware representations that directly improve robotic dexterity and contact-rich manipulation.

Abstract

Contact-rich robotic manipulation requires rep- resentations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as tex- ture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, com- press feature maps into a global vector, discarding spatial struc- ture and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual–tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.

Index terms

Representation Learning Perception for Grasping and Manipulation Force and Tactile Sensing