← Back ICRA 2026

Learning View-Invariant Sign Language Representations Via Dual-Stream Contrastive Learning

Yuting Peng, Yuecong Min, Xilin Chen

PDF

AI summary

Key figure (auto-extracted from paper)

A dual-stream contrastive learning framework learns view-invariant sign language representations from skeleton data, achieving state-of-the-art cross-view recognition without inference overhead.

Sign language recognition view-invariant learning dual-stream contrastive learning skeleton-based recognition cross-view generalization robotic interaction

Problem

Current sign language recognition systems degrade significantly under viewpoint variations, struggling with occlusions and visual noise in real-world settings where frontal camera data is unavailable.

Approach

ViSL processes accurate 2D keypoints and synthesized 3D-augmented keypoints through a shared feature extractor, aligning them via cross-view and cross-stream contrastive losses with learnable class proxies.

Key results

67.14% Top-1 accuracy on cross-view MM-WLAuslan, surpassing prior SOTA by 9.2%
Bridges viewpoint and estimation gaps via dual-stream contrastive learning with learnable proxies
Maintains lightweight single-stream inference while training with noisy 3D-augmented data
Demonstrates robust generalization across in-the-wild, synthetic, studio, and temporal disturbance test subsets

Why it matters

Enables robust, real-world sign language recognition for assistive technologies and human-robot interaction without requiring multi-camera setups or heavy inference costs.

Abstract

Viewpoint shifts significantly change how gestures and facial expressions appear and frequently cause occlusions, posing a critical challenge for robust Sign Language Recogni- tion (SLR). To address this challenge, we exploit the spatial flexibility and computational efficiency of skeleton data and propose ViSL, a dual-stream contrastive learning framework to learn View-invariant representations for Sign Language understanding. Specifically, the primary and lifting streams share a common visual feature extractor with different types of input: the primary stream (P-Stream) directly processes frontal-view skeleton data, and the lifting stream (L-Stream) synthesizes skeleton data from arbitrary viewpoints based on 3D estimations. We further propose a view-invariant contrastive loss to align representations across both viewpoints and streams. Experimental results on the challenging cross-view setting of MM-WLAuslan demonstrate that ViSL achieves substantial performance improvements, highlighting its potential for robust real-world SLR applications.

Index terms

Gesture Posture and Facial Expressions Deep Learning for Visual Perception Recognition